用弹性搜索处理超过内存限制的文档

I am using Tire as the Ruby wrapper for Elasticsearch. My problem is that I need to load 100,000 documents into memory and do kind of complex computations on them. The current procedure looks like this:

我正在使用轮胎作为Ruby包装器进行松紧带搜索。我的问题是，我需要将100,000个文档加载到内存中，并对它们进行某种复杂的计算。目前的程序是这样的:

Load all documents

加载所有文档

Computation.new(all_documents)

Computation.new(all_documents)

Iterate all documents and call computation.calc(document)

迭代所有文档并调用comput.net .calc(document)

This strategy does not work for 100,000 documents as I will reach the memory limits of my machine immediately. The documents (JSON) are loaded into Tire objects which I then convert into Ruby Hashes.

这一策略并不适用于100,000份文档，因为我将立即到达我的机器的内存限制。文档(JSON)被加载到轮胎对象中，然后我将其转换为Ruby散列。

What can I do to make this scale? I thought of the following, but I am not sure whether a) it's good to implement b) the best solution.

我能做些什么来达到这个规模呢?我想到了下面的问题，但是我不确定a)实施b)最好的解决方案。

Initialize computation object c = Computation.new

初始化计算对象c = comput.new

Load m documents

负载m文件

c.preprocess(documents)

c.preprocess(文档)

Repeat Step 2 and 3 until all documents are preprocessed

重复第2步和第3步，直到所有文档都被预处理

Load m documents

负载m文件

Iterate the m documents

迭代的m文件

c.calc(document)

c.calc(文档)

Repeat Step 6 and 7 until all documents are processed

重复步骤6和7，直到所有文档被处理

Also from the GC point of view I am not sure how this would work out.

同样，从GC的角度来看，我不确定这将如何实现。

1 个解决方案

#1

Your question seems to be "How do I serialize 100,000 ElasticSearch JSON objects into Ruby objects without running out of memory?". A better question would be: "How do I run calculations on 100,000 ElasticSearch documents as easily and efficiently as possible?". Since we don't know what kind of calculations you are trying to run, we'll have to keep the answer general.

您的问题似乎是“如何在不耗尽内存的情况下将100,000个ElasticSearch JSON对象序列化为Ruby对象?”一个更好的问题是:“我如何尽可能轻松高效地运行10万个弹性搜索文档的计算?”因为我们不知道你想要进行什么样的计算，所以我们必须让答案保持通用性。

Take neil-slater's suggestion and do as much in ElasticSearch as possible. For instance, ES has lots of nice statistical calculations you can do in the DB/store.
采纳尼尔-斯莱特的建议，尽可能多地进行弹性搜寻。例如，ES有许多很好的统计计算，您可以在DB/store中进行。
Do preprocessing on insertion of new documents. For instance, if you know you are going to want to get counts, averages, or some other calculation against the entire set, just calculate the stats for each item before storing it in ES. If you are using Tire in Rails, add these calc methods to a before_save callback or something.
对插入新文档进行预处理。例如，如果您知道您想要对整个集合进行计数、平均值或其他一些计算，那么在存储到ES之前，只需计算每个条目的统计数据。如果您正在Rails中使用Tire，请将这些calc方法添加到before_save回调或其他东西。
Avoid deserializing the ES docs into Ruby objects all together. Turning all 100,000 into Ruby objects is what is killing your memory. See if you can get performance improvements by fetching the results as straight-up JSON and use the ruby JSON gem (or some performance-tuned alternative like multi-json) to turn them into ruby hashes. It will still each some memory, but not nearly as much as full Rails model objects.
避免将ES文档反序列化为Ruby对象。将所有100,000个对象转换为Ruby对象会耗尽您的内存。查看是否可以通过直接获取结果JSON并使用ruby JSON gem(或一些性能调优的替代方案，如multi-json)将结果转换为ruby散列，从而获得性能改进。它仍然会有一些内存，但没有完整的Rails模型对象那么多。
Try breaking the calculation into steps and feeding them as background jobs or tasks for a daemon. If they need to execute in order, you can have the first job fire off the next job as it completes.
尝试将计算分解为多个步骤，并将它们作为后台作业或守护进程的任务。如果他们需要按顺序执行，您可以让第一个作业在完成后触发下一个作业。
If none of the above works, find a way to get closer to the data (manipulate the JSON directly with some javascript lib) or consider another data store, possibly something like PostgreSQL where you can do TONS of calculations in the DB 1000x faster than you ever could in Ruby/Rails.
如果以上方法都不起作用，可以找到一种接近数据的方法(使用一些javascript库直接操作JSON)，或者考虑另一个数据存储，比如PostgreSQL，在那里可以比在Ruby/Rails中更快地在DB 1000x中进行大量计算。

Hope that helps!

希望会有帮助!

#1

Take neil-slater's suggestion and do as much in ElasticSearch as possible. For instance, ES has lots of nice statistical calculations you can do in the DB/store.
采纳尼尔-斯莱特的建议，尽可能多地进行弹性搜寻。例如，ES有许多很好的统计计算，您可以在DB/store中进行。
Do preprocessing on insertion of new documents. For instance, if you know you are going to want to get counts, averages, or some other calculation against the entire set, just calculate the stats for each item before storing it in ES. If you are using Tire in Rails, add these calc methods to a before_save callback or something.
对插入新文档进行预处理。例如，如果您知道您想要对整个集合进行计数、平均值或其他一些计算，那么在存储到ES之前，只需计算每个条目的统计数据。如果您正在Rails中使用Tire，请将这些calc方法添加到before_save回调或其他东西。
Avoid deserializing the ES docs into Ruby objects all together. Turning all 100,000 into Ruby objects is what is killing your memory. See if you can get performance improvements by fetching the results as straight-up JSON and use the ruby JSON gem (or some performance-tuned alternative like multi-json) to turn them into ruby hashes. It will still each some memory, but not nearly as much as full Rails model objects.
避免将ES文档反序列化为Ruby对象。将所有100,000个对象转换为Ruby对象会耗尽您的内存。查看是否可以通过直接获取结果JSON并使用ruby JSON gem(或一些性能调优的替代方案，如multi-json)将结果转换为ruby散列，从而获得性能改进。它仍然会有一些内存，但没有完整的Rails模型对象那么多。
Try breaking the calculation into steps and feeding them as background jobs or tasks for a daemon. If they need to execute in order, you can have the first job fire off the next job as it completes.
尝试将计算分解为多个步骤，并将它们作为后台作业或守护进程的任务。如果他们需要按顺序执行，您可以让第一个作业在完成后触发下一个作业。
If none of the above works, find a way to get closer to the data (manipulate the JSON directly with some javascript lib) or consider another data store, possibly something like PostgreSQL where you can do TONS of calculations in the DB 1000x faster than you ever could in Ruby/Rails.
如果以上方法都不起作用，可以找到一种接近数据的方法(使用一些javascript库直接操作JSON)，或者考虑另一个数据存储，比如PostgreSQL，在那里可以比在Ruby/Rails中更快地在DB 1000x中进行大量计算。

Hope that helps!

希望会有帮助!

秒客网

用弹性搜索处理超过内存限制的文档

1 个解决方案

#1

#1

相关文章