如何操作*大量的数据?

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its important).

我有以下问题。我需要存储大量的信息(~ 32gb)，并且能够尽可能快地操作它。我想知道最好的方法是什么(编程语言+ OS +任何你认为重要的东西)。

The structure of the information I'm using is a 4D array (NxNxNxN) of double-precission floats (8 bytes). Right now my solution is to slice the 4D array into 2D arrays and store them in separate files in the HDD of my computer. This is really slow and the manipulation of the data is unbearable, so this is no solution at all!

我使用的信息的结构是一个4D数组(nxnxnxnxnxn)的双精度浮点数(8字节)。现在我的解决方案是将4D数组分割成二维数组，并将它们存储在我计算机的硬盘驱动器中的不同文件中。这真的很慢，对数据的操作是难以忍受的，所以这根本不是解决方案!

I'm thinking on moving into a Supercomputing facility in my country and store all the information in the RAM, but I'm not sure how to implement an application to take advantage of it (I'm not a professional programmer, so any book/reference will help me a lot).

我正在考虑进入我的国家的一个超级计算设备，并将所有的信息存储在RAM中，但是我不确定如何实现应用程序来利用它(我不是一个专业的程序员，所以任何书籍/引用都会对我有很大帮助)。

An alternative solution I'm thinking on is to buy a dedicated server with lots of RAM, but I don't know for sure if that will solve the problem. So right now my ignorance doesn't let me choose the best way to proceed.

我正在考虑的另一种解决方案是购买一台具有大量RAM的专用服务器，但我不确定这是否能解决问题。所以现在我的无知并没有让我选择最好的方式。

What would you do if you were in this situation? I'm open to any idea.

如果你在这种情况下你会怎么做?我愿意接受任何意见。

Thanks in advance!

提前谢谢!

EDIT: Sorry for not providing enough information, I'll try to be more specific.

编辑:很抱歉没有提供足够的信息，我会尽量说得更具体。

I'm storing a discretized 4D mathematical function. The operations that I would like to perform includes transposition of the array (change b[i,j,k,l] = a[j,i,k,l] and the likes), array multiplication, etc.

我在存储一个离散的4D数学函数。我想要执行的操作包括数组的转位(更改b[I,j,k,l] = a[j, I,k,l]和like)，数组乘法等。

As this is a simulation of a proposed experiment, the operations will be applied only once. Once the result is obtained it wont be necessary to perform more operations on the data.

由于这是对所提议的实验的模拟，操作将只应用一次。一旦得到结果，就不需要对数据执行更多的操作。

EDIT (2):

编辑(2):

I also would like to be able to store more information in the future, so the solution should be somehow scalable. The current 32 GB goal is because I want to have the array with N=256 points, but it'll be better if I can use N=512 (which means 512 GB to store it!!).

我还希望将来能够存储更多的信息，因此解决方案应该是可伸缩的。目前32gb的目标是，因为我希望数组具有N=256点，但是如果我可以使用N=512(这意味着512gb来存储它!)

14 个解决方案

#1

Amazon's "High Memory Extra Large Instance" is only $1.20/hr and has 34 GB of memory. You might find it useful, assuming you're not running this program constantly..

Amazon的“高内存超大实例”只有每小时1.20美元，并且有34 GB的内存。如果你不经常运行这个程序，你可能会发现它很有用。

#2

Any decent answer will depend on how you need to access the data. Randomly access? Sequential access?

任何合适的答案都取决于您需要如何访问数据。随机访问?顺序存取?

32GB is not really that huge.

32GB并不是很大。

How often do you need to process your data? Once per (lifetime | year | day | hour | nanosecond)? Often, stuff only needs to be done once. This has a profound effect on how much you need to optimize your solution.

你需要多久处理一次你的数据?每次一次(终生|年|日|小时|纳秒)?通常，事情只需要做一次。这对优化解决方案所需的量有深远的影响。

What kind of operations will you be performing (you mention multiplication)? Can the data be split up into chunks, such that all necessary data for a set of operations is contained in a chunk? This will make splitting it up for parallel execution easier.

您将执行哪种操作(您提到乘法)?是否可以将数据分割成块，以便一组操作所需的所有数据都包含在块中?这将使并行执行更加容易。

Most computers you buy these days have enough RAM to hold your 32GB in memory. You won't need a supercomputer just for that.

现在你买的大多数电脑都有足够的内存来存储32GB的内存。你不需要一台超级计算机。

#3

As Chris pointed out, what are you going to do with the data.

正如Chris指出的，你要怎么处理这些数据。

Besides, I think storing it in a (relational) database will be faster than reading it from the harddrive since the RDBMS will perform some optimizations for you like caching.

此外，我认为将它存储在(关系)数据库中比从harddrive中读取要快，因为RDBMS将为您执行一些优化，比如缓存。

#4

If you can represent your problem as MapReduce, consider a clustering system optimized for disk access, such as Hadoop.

如果您可以将您的问题表示为MapReduce，请考虑为磁盘访问优化的集群系统，例如Hadoop。

Your description sounds more math-intensive, in which case you probably want to have all your data in memory at once. 32 GB of RAM in a single machine is not unreasonable; Amazon EC2 offers virtual servers with up to 68 GB.

您的描述听起来更加数学密集，在这种情况下，您可能希望同时将所有数据存储在内存中。一台机器32gb的RAM不是不合理的;Amazon EC2提供高达68gb的虚拟服务器。

#5

Depending on your use, some mathematical and physical problems tend to be mostly zeros (for example, Finite Element models). If you expect that to be true for your data, you can get serious space savings by using a sparse matrix instead of actually storing all those zeros in memory or on disk.

根据您的使用，一些数学和物理问题往往是零(例如，有限元模型)。如果您期望您的数据是这样的，那么您可以通过使用稀疏矩阵来节省大量的空间，而不是将所有的0存储在内存或磁盘上。

Check out wikipedia for a description, and to decide if this might meet your needs: http://en.wikipedia.org/wiki/Sparse_matrix

查看*的描述，并决定这是否能满足你的需要:http://en.wikipedia.org/wiki/Sparse_matrix。

#6

Without more information, if you need quickest possible access to all the data I would go with using C for your programming language, using some flavor of *nix as the O/S, and buying RAM, it's relatively cheap now. This also depends on what you are familiar with, you can go the windows route as well. But as others have mentioned it will depend on how you are using this data.

如果没有更多的信息，如果您需要最快地访问我将使用C作为编程语言的所有数据，使用*nix作为O/S，并购买RAM，那么它现在相对便宜。这也取决于您熟悉的内容，您也可以使用windows路径。但正如其他人提到的那样，这将取决于您如何使用这些数据。

#7

So far, there are a lot of very different answers. There are two good starting points mentioned above. David suggests some hardware and someone mentioned learning C. Both of these are good points.

到目前为止，有很多不同的答案。上面提到了两个好的出发点。David建议使用一些硬件，有人提到了学习c。

C is going to get you what you need in terms of speed and direct memory paging. The last thing you want to do is perform linear searches on the data. That would be slow - slow - slow.

C将为您提供所需的速度和直接内存分页。最后要做的是对数据执行线性搜索。那将是慢-慢-慢-慢。

Determine your workflow -, if your workflow is linear, that is one thing. If the workflow is not linear, I would design a binary tree referencing pages in memory. There are tons of information on B-trees on the Internet. In addition, these B-trees will be much easier to work with in C since you will also be able to set up and manipulate your memory paging.

确定你的工作流程——如果你的工作流程是线性的，那是一回事。如果工作流不是线性的，我将设计一个内存中引用页面的二叉树。互联网上有很多关于b树的信息。此外，这些b树将更容易使用C，因为您也可以设置和操作内存分页。

#8

Here's another idea:

这里有另一个想法:

Try using an SSD to store your data. Since you're grabbing very small amounts of random data, an SSD would probably be much, much faster.

尝试使用SSD存储数据。由于你抓取的随机数据非常少，SSD可能要快得多。

#9

You may want to try using mmap instead of reading the data into memory, but I'm not sure it'll work with 32Gb files.

您可能想尝试使用mmap而不是将数据读入内存，但是我不确定它是否能处理32Gb的文件。

#10

The whole database technology is about manipulating huge amounts of data that can't fit in RAM, so that might be your starting point (i.e. get a good dbms principles book and read about indexing, query execution, etc.).
A lot depends on how you need to access the data - if you absolutely need to jump around and access random bits of information, you're in trouble, but perhaps you can structure your processing of the data such that you will scan it along one axis (dimension). Then you can use a smaller buffer and continuously dump already processed data and read new data.

整个数据库技术都是关于处理大量无法装入RAM的数据，所以这可能是您的出发点(例如，获得一本好的dbms原理书，阅读有关索引、查询执行等方面的书籍)。这在很大程度上取决于您需要如何访问数据——如果您绝对需要跳转并访问随机的信息位，那么您就有麻烦了，但是也许您可以对数据进行结构化处理，以便沿着一个轴(维度)扫描数据。然后，您可以使用一个较小的缓冲区，不断地转储已处理的数据并读取新数据。

#11

For transpositions, it's faster to actually just change your understanding of what index is what. By that, I mean you leave the data where it is and instead wrap an accessor delegate that changes b[i][j][k][l] into a request to fetch (or update) a[j][i][k][l].

对于换位，改变你对指数的理解会更快。我的意思是，你把数据放在它所在的位置，而把一个将b[I][j][k][l]改变到一个请求中去获取(或更新)a[j][I][k][l]。

#12

Could it be possible to solve it by this procedure?

用这个程序能解出来吗?

First create M child processes and execute them in paralel. Each process will be running in a dedicated core of a cluster and will load some information of the array into the RAM of that core.

首先创建M个子进程并在paralel中执行它们。每个进程将运行在集群的一个专用核心中，并将数组的一些信息加载到该核心的RAM中。

A father process will be the manager of the array, calling (or connecting) the appropiate child process to obtain certain chunks of data.

父进程将是数组的管理器，调用(或连接)appropiate子进程来获取某些数据块。

Will this be faster than the HDD storage approach? Or am I cracking nuts with a sledgehammer?

这会比HDD存储方法更快吗?还是我用大锤砸坚果?

#13

The first thing that I'd recommend is picking an object-oriented language, and develop or find a class that lets you manipulate a 4-D array without concern for how it's actually implemented.

我建议的第一件事是选择一种面向对象的语言，开发或者找到一个类，让您可以操作一个4-D的数组，而不用担心它是如何实现的。

The actual implementation of this class would probably use memory-mapped files, simply because that can scale from low-power development machines up to the actual machine where you want to run production code (I'm assuming that you'll want to run this many times, so that performance is important -- if you can let it run overnight, then a consumer PC may be sufficient).

的实际实现这个类可能会使用内存映射文件,因为这可以从低能耗开发规模机器实际的机器你想运行生产代码(我假设你要运行这个很多次,所以性能是很重要的,如果你可以让它一夜之间运行,那么消费者个人电脑可能就足够了)。

Finally, once I had my algorithms and data debugged, I would look into buying time on a machine that could hold all the data in memory. Amazon EC2, for instance, will provide you with a machine that has 68 GB of memory for $US 2.40 an hour (less if you play with spot instances).

最后，一旦我对算法和数据进行了调试，我就会考虑购买一台能够存储所有数据的机器。例如，Amazon EC2将为您提供一台机器，它的内存为668 GB，每小时为2.40美元(如果您使用spot实例的话就更少)。

#14

How to handle processing large amounts of data typically revolves around the following factors:

如何处理大量数据通常围绕以下因素:

Data access order / locality of reference: Can the data be separated out into independent chunks that are then processed either independently or in a serial/sequential fashon vs. random access to the data with little or no order?

数据访问顺序/引用位置:是否可以将数据分割成独立的块，然后分别处理它们，或者以串行/连续的fashon方式处理，或者以少量或无顺序地对数据进行随机访问?
CPU vs I/O bound: Is the processing time spent more on computation with the data or reading/writing it from/to storage?

CPU与I/O边界:处理时间是用数据进行计算，还是从/写到存储?
Processing frequency: Will the data be processed only once, every few weeks, daily, etc?

处理频率:是否只处理一次数据，每几周，每天，等等?

If the data access order is essentially random, you will need either to get access to as much RAM as possible and/or find a way to at least partially organize the order so that not as much of the data needs to be in memory at the same time. Virtual memory systems slow down very quickly once physical RAM limits are exceeded and significant swapping occurs. Resolving this aspect of your problem is probably the most critical issue.

如果数据访问顺序基本上是随机的，那么您将需要访问尽可能多的RAM，或者找到一种至少部分地组织顺序的方法，以便不需要同时在内存中存储大量的数据。一旦超过物理RAM限制并发生重要的交换，虚拟内存系统就会非常迅速地慢下来。解决这方面的问题可能是最关键的问题。

Other than the data access order issue above, I don't think your problem has significant I/O concerns. Reading/writing 32 GB is usually measured in minutes on current computer systems, and even data sizes up to a terabyte should not take more than a few hours.

除了上面的数据访问顺序问题之外，我认为您的问题不存在重大的I/O问题。在当前的计算机系统中，读取/写入32gb通常以分钟为单位，即使数据大小高达tb，也不应该超过几个小时。

Programming language choice is actually not critical so long as it is a compiled language with a good optimizing compiler and decent native libraries: C++, C, C#, or Java are all reasonable choices. The most computationally and I/O-intensive software I've worked on has actually been in Java and deployed on high-performance supercomputing clusters with a few thousand CPU cores.

编程语言的选择其实并不重要，只要它是一种具有良好的优化编译器和良好的本地库的编译语言:c++、C、c#或Java都是合理的选择。我所开发的最具计算能力和I/ o密集型的软件实际上是在Java中，并部署在具有几千个CPU内核的高性能超级计算集群上。

#1