计算大型字符串数据集最快的设置是什么?

For my day job, I have been tasked with setting up a computer system to run calculations on a large database of strings. I have establish a proof of concept, but don't have the low-level knowledge to optimize the hardware and software environment. I was hoping for some guidance on this aspect.

在我的日常工作中，我的任务是建立一个计算机系统，在一个大的字符串数据库上运行计算。我建立了一个概念的证明，但是没有底层的知识去优化软硬件环境。我希望在这方面能得到一些指导。

Setup:

100,000 records in a database containing strings
包含字符串的数据库中的100,000条记录
I will be performing string similarity calculations to look for approximate duplicates
- i.e. each string against every other string, so ~5 billion calculations
- 也就是说，每一根弦对应每一根弦，大约有50亿次计算
我将执行字符串相似度计算以寻找近似的重复，即每一个字符串对应于每一个字符串，因此大约有50亿次计算
I wrote the proof of concept in Ruby using SQLite3 as the database using 1000 sample rows
我使用SQLite3作为数据库，使用1000个示例行编写了Ruby中的概念证明
The total job should run in under a few days - the faster the better, but with diminishing returns. This is a one-time pass, so I don't need a supercomputer if a desktop setup can do it within a few days
总工作量应该在几天内完成——越快越好，但回报却在递减。这是一次性的，所以如果桌面设置可以在几天内完成，我不需要超级计算机

What I'm Looking For:

If I'm building a custom box to run this job (and potentially future jobs of a similar nature), what hardware should I focus on optimizing? I.e. should I spend my limited budget on a very fast GPU? CPU? Large amounts of RAM? I don't know Ruby on a low enough level to know where the bottlenecks for this type of operation are
如果我正在构建一个自定义框来运行这个作业(以及将来可能具有类似性质的工作)，那么我应该将哪些硬件重点放在优化上呢?也就是说，我应该把有限的预算花在一个非常快的GPU上吗?CPU ?大量的内存?我对Ruby的了解还不够深，不知道这种操作的瓶颈在哪里
Am I missing a better approach? I won't get approval for any major purchases of software or expensive hardware, at least until I can prove this method works with this run through. But can anyone suggest a more efficient method of detecting inexact duplicates?
我是否错过了更好的方法?我不会批准任何主要的软件或昂贵硬件的购买，至少在我能证明这个方法适用之前。但是，有人能提出一种更有效的方法来检测不精确的重复吗?

3 个解决方案

#1

First off, 100,000 strings don't really qualify as a large dataset nowadays, so don't worry too much about the hardware. Here are some suggestions from my previous job (related to search and machine translation) and the current one where I deal with several 100k to millions of XML records all the time:

首先，现在有10万个字符串不能作为大型数据集，所以不要太担心硬件。以下是我之前的工作(与搜索和机器翻译相关)的一些建议，以及目前我处理的几个100万到数百万个XML记录的建议:

You want RAM. Lots of it.
你想要的RAM。这样的例子有很多。
As Soren said, you want to make sure your algorithm is good.
正如Soren所说，你要确保你的算法是好的。
Choose your DB wisely. Postgres for example has excellent string functions and doing certain things directly in the DB can be very fast. Have I said you want a lot of RAM?
明智地选择你的数据库。例如，Postgres具有出色的字符串函数，直接在DB中执行某些操作可以非常快。我说过你想要很多RAM吗?
Your job sounds like it it would be fairly easy to partition into smaller subtasks which can be tackled in parallel. If that's really the case you might want to look at MapReduce. In the previous job we had pretty good workstations (4 cores, 8 GB of RAM) which were never turned off, so we turned some of them into a Hadoop Cluster that would do useful stuff. Since the machines were quite overpowered for everyday work use anyway, the users didn't even notice. It's usually not that difficult to turn something into a MapReduce job and the other advantage would be that you can keep the setup around for similar tasks in the future.
您的工作听起来很容易将其划分为可以并行处理的更小的子任务。如果真的是这样的话，你可以看看MapReduce。在以前的工作中，我们有非常好的工作站(4个内核，8 GB的RAM)，它们从来没有关闭过，所以我们将其中的一些工作变成了Hadoop集群，可以做一些有用的事情。由于这些机器在日常工作中都是超负荷的，用户甚至都没有注意到。将某样东西转换成MapReduce任务通常不是那么困难，另一个优点是您可以在将来为类似的任务保留设置。
As for Ruby specific bottle necks, the biggest one in MRI is usually garbage collection, which thanks to its stop-the-world nature is super slow. When we profile this regularly turns out to be a problem. See why's article The fully upturned bin for details on Ruby GC. If you are set on using Ruby, you might want to compare MRI to JRuby, from my experience with the latter and profilers like JVisualVM I wouldn't be surprised if JRuby fared better.
至于Ruby特定的瓶颈，MRI中最大的一个通常是垃圾收集，这多亏了它在世界范围内的运行速度非常慢。当我们定期分析时，发现这是个问题。有关Ruby GC的详细信息，请参见文章The full upturn bin。如果您使用Ruby，您可能会想要将MRI与JRuby进行比较，从我的经验和像JVisualVM这样的分析器，如果JRuby的表现更好，我不会感到惊讶。

#2

The total job should run in under a few days...
This is a one-time pass...
Am I missing a better approach...

总工作量应该在几天内完成……这是一次性通行证……我是否错过了更好的方法……

If this is a one-off task, You should really just run this on Amazon -- Get an Extra Large (4Core, 15GB RAM) machine for a few hours, and just run it there.

如果这是一次性的任务，那么您应该在Amazon上运行它——在几个小时内得到一个额外的大(4Core, 15GB RAM)机器，然后在那里运行它。

#3

Your algo for string similarity is much more important than your hardware spec.

对于字符串相似度的algo要比硬件规范重要得多。

The key question on algos for string similarity is "when do you expect string to be similar?" Do you consider substrings, spelling errors, phonetics, typing errors.

关于algos关于字符串相似性的关键问题是“您希望字符串什么时候相似?”你考虑过子字符串、拼写错误、语音错误、打字错误吗?

This SO link have a great discussion on algos. 100,000 records is really very little data (in my world) but for ease of implementation, once you have a good algo, you should try to get as much RAM as possible. Doing it in Ruby may not be the best choice for a performance perspective either.

这个SO链接对algos进行了很好的讨论。100,000条记录实际上是非常少的数据(在我的世界里)，但是为了便于实现，一旦您有了一个好的algo，您应该尝试获得尽可能多的RAM。在Ruby中这样做可能也不是性能方面的最佳选择。

#1

You want RAM. Lots of it.
你想要的RAM。这样的例子有很多。
As Soren said, you want to make sure your algorithm is good.
正如Soren所说，你要确保你的算法是好的。
Choose your DB wisely. Postgres for example has excellent string functions and doing certain things directly in the DB can be very fast. Have I said you want a lot of RAM?
明智地选择你的数据库。例如，Postgres具有出色的字符串函数，直接在DB中执行某些操作可以非常快。我说过你想要很多RAM吗?
Your job sounds like it it would be fairly easy to partition into smaller subtasks which can be tackled in parallel. If that's really the case you might want to look at MapReduce. In the previous job we had pretty good workstations (4 cores, 8 GB of RAM) which were never turned off, so we turned some of them into a Hadoop Cluster that would do useful stuff. Since the machines were quite overpowered for everyday work use anyway, the users didn't even notice. It's usually not that difficult to turn something into a MapReduce job and the other advantage would be that you can keep the setup around for similar tasks in the future.
您的工作听起来很容易将其划分为可以并行处理的更小的子任务。如果真的是这样的话，你可以看看MapReduce。在以前的工作中，我们有非常好的工作站(4个内核，8 GB的RAM)，它们从来没有关闭过，所以我们将其中的一些工作变成了Hadoop集群，可以做一些有用的事情。由于这些机器在日常工作中都是超负荷的，用户甚至都没有注意到。将某样东西转换成MapReduce任务通常不是那么困难，另一个优点是您可以在将来为类似的任务保留设置。
As for Ruby specific bottle necks, the biggest one in MRI is usually garbage collection, which thanks to its stop-the-world nature is super slow. When we profile this regularly turns out to be a problem. See why's article The fully upturned bin for details on Ruby GC. If you are set on using Ruby, you might want to compare MRI to JRuby, from my experience with the latter and profilers like JVisualVM I wouldn't be surprised if JRuby fared better.
至于Ruby特定的瓶颈，MRI中最大的一个通常是垃圾收集，这多亏了它在世界范围内的运行速度非常慢。当我们定期分析时，发现这是个问题。有关Ruby GC的详细信息，请参见文章The full upturn bin。如果您使用Ruby，您可能会想要将MRI与JRuby进行比较，从我的经验和像JVisualVM这样的分析器，如果JRuby的表现更好，我不会感到惊讶。

#2