如何订购巨大的（GB大小）CSV文件？

Background

I have a huge CSV file that has several million rows. Each row has a timestamp I can use to order it.

我有一个巨大的CSV文件,有几百万行。每行都有一个我可以用来订购它的时间戳。

Naive approach

So, my first approach was obviously to read the entire thing by putting it in memory and then ordering. It didn't work that well as you may guess....

因此,我的第一种方法显然是通过将其放入内存然后排序来阅读整个内容。它可能没那么好,你可能猜到....

Naive approach v2

My second try was to follow a bit the idea behind MapReduce.

我的第二次尝试是遵循MapReduce背后的想法。

So, I would slice this huge file in several parts, and order each one. Then I would combine all the parts into the final file.

所以,我会将这个巨大的文件分成几个部分,并对每个部分进行排序。然后我将所有部分组合到最终文件中。

The issue here is that part B may have a message that should be in part A. So in the end, even though each part is ordered, I cannot guarantee the order of the final file....

这里的问题是B部分可能有一条消息应该在A部分。所以最后,即使每个部分都是有序的,我也不能保证最终文件的顺序....

Objective

My objective is to create a function that when given this huge unordered CSV file, can create an ordered CSV file with the same information.

我的目标是创建一个函数,当给定这个巨大的无序CSV文件时,可以创建具有相同信息的有序CSV文件。

Question

What are the popular solutions/algorithm to order data sets this big?

订购数据集的流行解决方案/算法有哪些?

2 个解决方案

#1

What are the popular solutions/algorithm to order data sets this big?

订购数据集的流行解决方案/算法有哪些?

Since you've already concluded that the data is too large to sort/manipulate in the memory you have available, the popular solution is a database which will build disk-based structures for managing and sorting more data than can be in memory.

由于您已经得出结论,数据太大而无法在您可用的内存中进行排序/操作,因此流行的解决方案是一个数据库,它将构建基于磁盘的结构,用于管理和排序比内存中更多的数据。

You can either build your own disk-based scheme or you can grab one that is already fully developed, optimized and maintained (e.g. a popular database). The "popular" solution that you asked about would be to use a database for managing/sorting large data sets. That's exactly what they're built for.

您可以构建自己的基于磁盘的方案,也可以使用已经完全开发,优化和维护的方案(例如,流行的数据库)。您询问的“流行”解决方案是使用数据库来管理/排序大型数据集。这正是他们为之而建的。

Database

You could set up a table that was indexed by your sort key, insert all the records into the database, then create a cursor sorted by your key and iterate the cursor, writing the now sorted records to your new file one at a time. Then, delete the database when done.

您可以设置一个由排序键索引的表,将所有记录插入数据库,然后创建一个按键排序的游标并迭代游标,一次一个地将现在排序的记录写入新文件。然后,完成后删除数据库。

Chunked Memory Sort, Manual Merge

分块内存排序,手动合并

Alternatively, you could do your chunked sort where you break the data into smaller pieces that can fit in memory, sort each piece, write each sorted block to disk, then do a merge of all the blocks where you read the next record from each block into memory, find the lowest one from all the blocks, write it out to your final output file, read the next record from that block and repeat. Using this scheme, the merge would only ever have to have N records in memory at a time where N is the number of sorted chunks you have (less than the original chunked block sort, probably).

或者,您可以进行分块排序,将数据分成适合内存的较小部分,对每个部分进行排序,将每个已排序的块写入磁盘,然后合并所有块,从而读取每个块中的下一条记录进入内存,找到所有块中最低的一个,将其写入最终输出文件,从该块读取下一条记录并重复。使用这种方案,合并只需要在内存中有N条记录,其中N是你拥有的排序块的数量(可能小于原始的分块排序)。

As juvian mentioned, here's an overview of how an "external sort" like this could work: https://en.wikipedia.org/wiki/External_sorting.

正如juvian所提到的,这里概述了像这样的“外部排序”如何起作用:https://en.wikipedia.org/wiki/External_sorting。

One key aspect of the chunked memory sort is determining how big to make the chunks. There are a number of strategies. The simplest may be to just decide how many records you can reliably fit and sort in memory based on a few simple tests or even just a guess that you're sure is safe (picking a smaller number to process at a time just means you will split the data across more files). Then, just read that many records into memory, sort them, write them out to a known filename. Repeat that process until you have read all the records and then are now all in temp files with known filenames on the disk.

分块内存排序的一个关键方面是确定块的大小。有很多策略。最简单的可能是根据一些简单的测试来确定你可以在内存中可靠地拟合和排序的记录数量,或者甚至只是猜测你确定是安全的(选择一个较小的数字来处理,这意味着你将会将数据拆分为更多文件)。然后,只需将许多记录读入内存,对它们进行排序,将它们写入已知的文件名。重复该过程,直到您已读取所有记录,然后现在都在磁盘上具有已知文件名的临时文件中。

Then, open each file, read the first record from each one, find the lowest record of each that you read in, write it out to your final file, read the next record from that file and repeat the process. When you get to the end of a file, just remove it from the list of data you're comparing since it's now done. When there is no more data, you're done.

然后,打开每个文件,从每个文件中读取第一条记录,找到每个文件的最低记录,将其写入最终文件,从该文件中读取下一条记录并重复该过程。当您到达文件末尾时,只需将其从您正在比较的数据列表中删除,因为它现在已经完成。当没有更多数据时,你就完成了。

Sort Keys only in Memory

仅在内存中对键进行排序

If all the sort keys themselves would fit in memory, but not the associated data, then you could make and sort your own index. There are many different ways to do that, but here's one scheme.

如果所有排序键本身都适合内存,而不是相关数据,那么您可以对自己的索引进行排序。有很多不同的方法可以做到这一点,但这里有一个方案。

Read through the entire original data capturing two things into memory for every record, the sort key and the file offset in the original file where that data is stored. Then, once you have all the sort keys in memory, sort them. Then, iterate through the sorted keys one by one, seeking to the write spot in the file, reading that record, writing it to the output file, advancing to the next key and repeating until the data for every key was written in order.

读取整个原始数据,将每个记录的两个内容捕获到内存中,排序键和原始文件中存储该数据的文件偏移量。然后,一旦你在内存中拥有所有排序键,就对它们进行排序。然后,逐个遍历排序的键,寻找文件中的写入点,读取该记录,将其写入输出文件,前进到下一个键并重复,直到按顺序写入每个键的数据。

BTree Key Sort

BTree键排序

If all the sort keys won't fit in memory, then you can get a disk-based BTree library that will let you sort things larger than can be in memory. You'd use the same scheme as above, but you'd be putting the sort key and file offset into a BTree.

如果所有排序键都不适合内存,那么您可以获得一个基于磁盘的BTree库,它可以让您对内存中的内容进行排序。您将使用与上面相同的方案,但您将把排序键和文件偏移量放入BTree中。

Of course, it's only one step further to put the actual data itself from the file into the BTree and then you have a database.

当然,将实际数据本身从文件中放入BTree只需要更进一步,然后你就拥有了一个数据库。

#2

I would read the entire file row-by-row and output each line into a temporary folder grouping lines into files by reasonable time interval (should the interval be a year, a day, an hour, ... etc. is for you to decide basing on your data). So the temporary folder would contain individual files for each interval (for example, for day interval split that would be 2018-05-20.tmp, 2018-05-21.tmp, 2018-05-22.tmp, ... etc.). Now we can read the files in order, sort each in memory and output into the target sorted file.

我会逐行读取整个文件并将每行输出到一个临时文件夹中,按合理的时间间隔将行分组到文件中(如果间隔为一年,一天,一小时......等等,则适合您根据您的数据做出决定)。因此临时文件夹将包含每个间隔的单个文件(例如,对于日间隔分割,即2018-05-20.tmp,2018-05-21.tmp,2018-05-22.tmp,...等)。现在我们可以按顺序读取文件,在内存中对每个文件进行排序并输出到目标排序文件中。

#1