I am trying to create an lucene of around 2 million records. The indexing time is around 9 hours. Could you please suggest how to increase performance?
我正在尝试创建一个约200万条记录的lucene。索引时间约为9小时。你能否建议如何提高性能?
4 个解决方案
#1
I wrote a terrible post on how to parallelize a Lucene Index. It's truly terribly written, but you'll find it here (there's some sample code you might want to look at).
我写了一篇关于如何并行化Lucene索引的可怕帖子。它写得非常糟糕,但你会在这里找到它(你可能想看一些示例代码)。
Anyhow, the main idea is that you chunk up your data into sizable pieces, and then work on each of those pieces on a separate thread. When each of the pieces is done, you merge them all into a single index.
无论如何,主要的想法是你将数据分成大块,然后在一个单独的线程上处理每个部分。完成每个部分后,将它们全部合并到一个索引中。
With the approach described above, I'm able to index 4+ million records in approx. 2 hours.
通过上述方法,我能够为大约4亿条记录编制索引。 2小时。
Hope this gives you an idea of where to go from here.
希望这能让您了解从何处开始。
#2
Apart from the writing side (merge factor) and the computation aspect (parallelizing) this is sometimes due to the simplest of reasons: slow input. Many people build a Lucene index from a database of data. Sometimes you find that a particular query for this data is too complicated and slow to actually return all the (2 million?) records quickly. Try just the query and writing to disk, if it's still in the order of 5-9 hours, you've found a place to optimize (SQL).
除了写入方面(合并因素)和计算方面(并行化)之外,这有时是由于最简单的原因:输入缓慢。许多人从数据库中构建Lucene索引。有时您会发现对此数据的特定查询过于复杂且无法快速实际返回所有(200万?)记录。尝试查询并写入磁盘,如果它仍然在5-9小时的顺序,你就找到了一个优化的地方(SQL)。
#3
The following article really helped me when I needed to speed things up:
当我需要加快速度时,下面的文章真的帮助了我:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
I found that document construction was our primary bottleneck. After optimizing data access and implementing some of the other recommendations, I was able to substantially increase indexing performance.
我发现文档构建是我们的主要瓶颈。在优化数据访问和实现其他一些建议后,我能够大幅提高索引性能。
#4
The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together.
提高Lucene索引性能的最简单方法是调整IndexWriter的mergeFactor实例变量的值。此值告诉Lucene在将它们写入磁盘之前要在内存中存储多少文档,以及将多个段合并在一起的频率。
http://search-lucene.blogspot.com/2008/08/indexing-speed-factors.html
#1
I wrote a terrible post on how to parallelize a Lucene Index. It's truly terribly written, but you'll find it here (there's some sample code you might want to look at).
我写了一篇关于如何并行化Lucene索引的可怕帖子。它写得非常糟糕,但你会在这里找到它(你可能想看一些示例代码)。
Anyhow, the main idea is that you chunk up your data into sizable pieces, and then work on each of those pieces on a separate thread. When each of the pieces is done, you merge them all into a single index.
无论如何,主要的想法是你将数据分成大块,然后在一个单独的线程上处理每个部分。完成每个部分后,将它们全部合并到一个索引中。
With the approach described above, I'm able to index 4+ million records in approx. 2 hours.
通过上述方法,我能够为大约4亿条记录编制索引。 2小时。
Hope this gives you an idea of where to go from here.
希望这能让您了解从何处开始。
#2
Apart from the writing side (merge factor) and the computation aspect (parallelizing) this is sometimes due to the simplest of reasons: slow input. Many people build a Lucene index from a database of data. Sometimes you find that a particular query for this data is too complicated and slow to actually return all the (2 million?) records quickly. Try just the query and writing to disk, if it's still in the order of 5-9 hours, you've found a place to optimize (SQL).
除了写入方面(合并因素)和计算方面(并行化)之外,这有时是由于最简单的原因:输入缓慢。许多人从数据库中构建Lucene索引。有时您会发现对此数据的特定查询过于复杂且无法快速实际返回所有(200万?)记录。尝试查询并写入磁盘,如果它仍然在5-9小时的顺序,你就找到了一个优化的地方(SQL)。
#3
The following article really helped me when I needed to speed things up:
当我需要加快速度时,下面的文章真的帮助了我:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
I found that document construction was our primary bottleneck. After optimizing data access and implementing some of the other recommendations, I was able to substantially increase indexing performance.
我发现文档构建是我们的主要瓶颈。在优化数据访问和实现其他一些建议后,我能够大幅提高索引性能。
#4
The simplest way to improve Lucene's indexing performance is to adjust the value of IndexWriter's mergeFactor instance variable. This value tells Lucene how many documents to store in memory before writing them to the disk, as well as how often to merge multiple segments together.
提高Lucene索引性能的最简单方法是调整IndexWriter的mergeFactor实例变量的值。此值告诉Lucene在将它们写入磁盘之前要在内存中存储多少文档,以及将多个段合并在一起的频率。
http://search-lucene.blogspot.com/2008/08/indexing-speed-factors.html