如何在大数据集中查找唯一条目？

I have 100 million lines of data, the data is a word no longer than 15 chars,one word per line. Those data are stored in multiple files.

我有1亿行数据,数据不超过15个字,每行一个字。这些数据存储在多个文件中。

My goal to to find the unique words among all files.

我的目标是在所有文件中找到唯一的单词。

One solution is to import all words into database and add a unique key for the field. but this is too slow for this large data set.

一种解决方案是将所有单词导入数据库并为该字段添加唯一键。但这对于这个大型数据集来说太慢了。

Is there any faster solution?

有没有更快的解决方案?

Thank you

7 个解决方案

#1

I'm not sure that there'll be many faster ways than using a database. Personally, I usually use UNIX shell script for this:

我不确定会有比使用数据库更快的方法。就个人而言,我通常使用UNIX shell脚本:

cat * | sort | uniq

I don't know how fast that would be with 100,000,000 words, and I'm not sure how fast you want it to be. (E.g., do you need to run it lots of times or just once? If just once, I'd go with the sort and uniq option and let it run overnight if you can).

我不知道100,000,000字会有多快,而且我不确定你想要多快。 (例如,你需要运行很多次或只运行一次吗?如果只运行一次,我会选择sort和uniq选项,如果可以的话,让它在一夜之间运行)。

Alternatively, you could write a script in ruby or a similar language that stored the words in an associative array. I suspect that would almost certainly be slower than the database approach though.

或者,您可以使用ruby或类似语言编写脚本,以将这些单词存储在关联数组中。我怀疑这几乎肯定比数据库方法慢。

I guess if you really want speed, and you need to carry out this task (or ones like it) often, then you might want to write something in C, but to me that feels a bit like overkill.

我想如果你真的想要速度,而且你需要经常执行这个任务(或类似的任务),那么你可能想用C写一些东西,但对我来说感觉有点像矫枉过正。

Ben

#2

Using a database for this is insane. 100 million records of 15 chars fits in ram. If there is at least some duplication, simply build a trie. Should be able to process 50MB/second or so on a modern machine

使用数据库是疯狂的。 15个字符的1亿条记录符合公羊。如果至少有一些重复,只需构建一个trie。应该可以在现代机器上处理50MB /秒左右

#3

If you have to stick with the file structure, then you need some way of indexing the files and then maintaining the index.

如果你必须坚持使用文件结构,那么你需要一些方法来索引文件然后维护索引。

Otherwise, I would recommend moving to a database and migrating all operations on that file to work with the database.

否则,我建议移动到数据库并迁移该文件上的所有操作以使用数据库。

#4

You could store the words in a hashtable. Assuming there are quite a number of duplicates, the O(1) search time will be a big performance boost.

您可以将单词存储在哈希表中。假设有相当多的重复,O(1)搜索时间将是一个很大的性能提升。

Read a line.

读一行。

Search for the word in the hashtable.

在哈希表中搜索单词。

If not found, add it to the table.

如果未找到,请将其添加到表中。

#5

If you have this much data, then it needs to be in a SQL server. This is why SQL was designed in the first place. If you continue to use these files you will forever be stuck with performance issues.

如果您有这么多数据,那么它需要在SQL服务器中。这就是SQL首先设计的原因。如果您继续使用这些文件,您将永远陷入性能问题。

Even if these files are modified from external programs (or via FTP) you need to create an import process to run nightly.

即使这些文件是从外部程序(或通过FTP)修改的,您也需要创建一个导入过程以便每晚运行。

#6

You can conserve speed, space, or your sanity. Pick any two.

您可以节省速度,空间或理智。挑选任何两个。

Throwing it all into a database sacrificed both speed and space, as you found out. But it was easy.

正如你所发现的那样,把它全部投入到数据库中会牺牲速度和空间。但这很容易。

If space is your main problem (memory, disk space) then partition the work. Filter all of the 1 character lines from the files and use one of the above solutions (sort, uniq). Repeat with the 2 character lines for each file. And so on. The unique solutions from each pass form your solution set.

如果空间是您的主要问题(内存,磁盘空间),那么分区工作。过滤文件中的所有1个字符行,并使用上述解决方案之一(sort,uniq)。对每个文件重复2个字符行。等等。每个过程的独特解决方案构成您的解决方案集。

If your main problem is speed, then read each file exactly once creating a hash table (dictionary, whatever) to look for duplicates. Depending on the hash implementation, this could eat up bucketloads of memory (or disk). But it'll be fast.

如果你的主要问题是速度,那么在创建一个哈希表(字典,无论如何)之后,准确读取每个文件以查找重复项。根据哈希实现,这可能会占用大量内存(或磁盘)。但它会很快。

If you need to conserve speed and space, then consider blending the two techniques. But be prepared to sacrifice the third item.

如果您需要节省速度和空间,请考虑混合使用这两种技术。但要准备牺牲第三项。

#7

If there's significant duplication within individual files, it may be quicker to do it file by file then merge the results. Something along the lines of:

如果单个文件中存在重大错误,则可以更快地逐个文件地执行此操作,然后合并结果。有点像:

{ for n in * ; do sort -u $n ; done } | sort -u

(I'm assuming GNU bash and GNU sort)

(我假设GNU bash和GNU排序)

I think the choice of best solution will depend heavily on the distribution of duplicates and the number of separate files, though, which you haven't shared with us.

我认为最佳解决方案的选择将在很大程度上取决于重复项的分发和单独文件的数量,但您尚未与我们分享。

Given myhusky's clarification (plenty of dupes, 10~20 files), I'll definitely suggest this as a good solution. In particular, dense duplication will speed up sort -u versus sort|uniq

鉴于myhusky的澄清(大量的欺骗,10~20个文件),我肯定会建议这是一个很好的解决方案。特别是,密集复制将加快排序-u与排序| uniq

#1