在多个应用服务器上同步Lucene.net索引

时间:2021-07-05 03:08:09

we are designing the search architecture for a corporate web application. We'll be using Lucene.net for this. The indexes will not be big (about 100,000 documents), but the search service must be always up and always be up to date. There will be new documents added to the index all the time and concurrent searches. Since we must have high availability for the search system, we have 2 application servers which expose a WCF service to perform searches and indexing (a copy of the service is running in each server). The server then uses lucene.net API to access the indexes.

我们正在为企业Web应用程序设计搜索体系结构。我们将使用Lucene.net。索引不会很大(大约100,000个文档),但搜索服务必须始终保持最新状态并始终保持最新状态。将始终向索引添加新文档和并发搜索。由于我们必须具有搜索系统的高可用性,因此我们有2个应用程序服务器,它们公开WCF服务以执行搜索和索引(服务的副本在每个服务器中运行)。然后,服务器使用lucene.net API来访问索引。

The problem is, what would be the best solution to keep the indexes synced all the time? We have considered several options:

问题是,保持索引始终同步的最佳解决方案是什么?我们考虑了几种选择:

  • Using one server for indexing and having the 2nd server access the indexes via SMB: no can do because we have a single point of failure situation;

    使用一台服务器进行索引并让第二台服务器通过SMB访问索引:没有办法,因为我们有单点故障情况;

  • Indexing to both servers, essentially writing every index twice: probably lousy performance, and possibility of desync if eg. server 1 indexes OK and server 2 runs out of disk space or whatever;

    索引到两个服务器,基本上写入每个索引两次:可能是糟糕的性能,并且如果例如,则可能出现异步。服务器1索引正常,服务器2耗尽磁盘空间或其他任何内容;

  • Using SOLR or KATTA to wrap access to the indexes: nope, we cannot have tomcat or similar running on the servers, we only have IIS.

    使用SOLR或KATTA来包装对索引的访问:nope,我们不能在服务器上运行tomcat或类似的东西,我们只有IIS。

  • Storing the index in database: I found this can be done with the java version of Lucene (JdbcDirectory module), but I couldn't find anything similar for Lucene.net. Even if it meant a small performance hit, we'd go for this option because it'd cleanly solve the concurrency and syncing problem with mininum development.

    将索引存储在数据库中:我发现这可以使用java版本的Lucene(JdbcDirectory模块)来完成,但我找不到任何类似的Lucene.net。即使它意味着一个小的性能损失,我们也会选择这个选项,因为它可以干净地解决并发和同步问题与mininum开发。

  • Using Lucene.net DistributedSearch contrib module: I couldn't file a single link with documentation about this. I don't even know by looking at the code what this code does, but it seems to me that it actually splits the index across multiple machines, which is not what we want.

    使用Lucene.net DistributedSearch contrib模块:我无法提供有关此文档的单个链接。我甚至不知道通过查看代码的代码是什么,但在我看来,它实际上是在多台机器上拆分索引,这不是我们想要的。

  • rsync and friends, copying the indexes back and forth between the 2 servers: this feels hackish and error-prone to us, and, if the indexes grow big, might take a while, and during this period we would be returning either corrupt or inconsistent data to clients, so we'd have to develop some ad hoc locking policy, which we don't want to.

    rsync和朋友,在两台服务器之间来回复制索引:这对我们来说感觉很乱,容易出错,而且,如果索引变大,可能需要一段时间,在此期间我们将返回腐败或不一致数据到客户端,所以我们必须开发一些我们不想要的临时锁定策略。

I understand this is a complex problem, but I'm sure lots of people have faced it before. Any help is welcome!

我知道这是一个复杂的问题,但我相信很多人以前都会面对它。欢迎任何帮助!

5 个解决方案

#1


It seems that the best solution would be to index the documents on both servers into their own copy of the index.

似乎最好的解决方案是将两台服务器上的文档索引到它们自己的索引副本中。

If you are worried about the indexing succeeding on one server and failing on the other, then you'll need to keep track of the success/failure for each server so that you can re-try the failed documents once the problem is resolved. This tracking would be done outside of Lucene in whatever system you are using to present the documents to be indexed to Lucene. Depending on how critical the completeness of the index is to you, you may also have to remove the failed server from whatever load balancer you are using until the problem has been fixed and indexing has reprocessed any outstanding documents.

如果您担心索引在一台服务器上成功而另一台服务器失败,那么您需要跟踪每台服务器的成功/失败,以便在问题解决后重新尝试失败的文档。无论您使用什么系统来呈现要归档到Lucene的文档,都可以在Lucene之外进行此跟踪。根据索引的完整性对您的重要程度,您可能还必须从您正在使用的任何负载均衡器中删除发生故障的服务器,直到问题得到解决并且索引已重新处理任何未完成的文档。

#2


+1 for Sean Carpenter's answer. Indexing on both servers seems like the sanest and safest choice.

给肖恩卡彭特的答案+1。在两台服务器上建立索引似乎是最安全和最安全的选择。

If the documents you're indexing are complex (Word/PDF and the sorts), you could perform some preprocessing on a single server and then give that to the indexing servers, to save some processing time.

如果您要编制索引的文档很复杂(Word / PDF和各种类型),您可以在单个服务器上执行一些预处理,然后将其提供给索引服务器,以节省一些处理时间。

A solution I've used before involves creating an index chunk on one server, then rsyncing it over to the search servers and merging the chunk into each index, using IndexWriter.AddIndexesNoOptimize. You can create a new chunk every 5 minutes or whenever it gets to a certain size. If you don't have to have absolutely up-to-date indexes, this might be a solution for you.

我之前使用的解决方案包括在一台服务器上创建索引块,然后使用IndexWriter.AddIndexesNoOptimize将其转发到搜索服务器并将块合并到每个索引中。您可以每5分钟或每当达到一定大小时创建一个新块。如果您不必拥有绝对最新的索引,这可能是您的解决方案。

#3


in the java world, we solved this problem by putting a MQ in front of the index(es). The insert was only complete when the bean pulled from the queue was successful, otherwise it just rolled back any action it took, marked on the doc as pending and it was tried again later

在java世界中,我们通过在索引前面放置一个MQ来解决这个问题。只有当从队列中取出的bean成功时,插入才会完成,否则它只会回滚它所执行的任何操作,在文档上标记为待处理,稍后再次尝试

#4


I know that this is an old question, but I just came across it and wanted to give my 2 cents for anyone else looking for advise on a multi-server implementation.

我知道这是一个老问题,但我刚刚遇到它,并希望为其他任何寻求多服务器实现建议的人提供2美分。

Why not keep index files on a shared NAS folder? How is it different from storing index in a database that you were contemplating? A database can be replicated for high availability, and so can be a NAS!

为什么不将索引文件保存在共享的NAS文件夹中?它与您正在考虑的数据库中存储索引有何不同?可以复制数据库以实现高可用性,因此可以是NAS!

I would configure the two app servers that you have behind a load balancer. Any index request that comes in will index documents in a machine specific folder on the NAS. That is, there will be as many indexes on the NAS as your app servers. When a search request comes in, you will do a multi-index search using Lucene. Lucene has constructs (MultiSearcher) built-in to do this, and the performance is still excellent.

我将配置负载均衡器后面的两个应用服务器。进来的任何索引请求都会将文档索引到NAS上的计算机专用文件夹中。也就是说,NAS上将有与您的应用服务器一样多的索引。当搜索请求进入时,您将使用Lucene进行多索引搜索。 Lucene内置了构造(MultiSearcher)来执行此操作,性能仍然非常出色。

#5


The way we keep our load-balanced servers in sync, each with their own copy of Lucene, is to have a task on some other server, that runs every 5 minutes commanding each load-balanced server to update their index to a certain timestamp.

我们保持负载均衡服务器同步的方式,每个都有自己的Lucene副本,就是在其他服务器上执行一项任务,每隔5分钟运行一次,命令每个负载均衡的服务器将其索引更新为某个时间戳。

For instance, the task sends a timestamp of '12/1/2013 12:35:02.423' to all the load-balanced servers (the task is submitting the timestamp via querystring to a webpage on each load-balanced website), then each server uses that timestamp to query the database for all updates that have occurred since the last update through to that timestamp, and updates their local Lucene index.

例如,任务向所有负载平衡服务器发送时间戳'12 / 1/2013 12:35:02.423'(任务通过查询字符串将时间戳提交到每个负载均衡网站上的网页),然后每个服务器使用该时间戳查询数据库以查找自上次更新到该时间戳以来发生的所有更新,并更新其本地Lucene索引。

Each server also stores the timestamp in the db, so it knows when each server was last updated. So if a server goes offline, when it comes back online, the next time it receives a timestamp command, it'll grab all the updates it missed while it was offline.

每个服务器还将时间戳存储在db中,因此它知道每个服务器上次更新的时间。因此,如果服务器脱机,当它重新联机时,下次收到时间戳命令时,它将获取它在脱机时错过的所有更新。

#1


It seems that the best solution would be to index the documents on both servers into their own copy of the index.

似乎最好的解决方案是将两台服务器上的文档索引到它们自己的索引副本中。

If you are worried about the indexing succeeding on one server and failing on the other, then you'll need to keep track of the success/failure for each server so that you can re-try the failed documents once the problem is resolved. This tracking would be done outside of Lucene in whatever system you are using to present the documents to be indexed to Lucene. Depending on how critical the completeness of the index is to you, you may also have to remove the failed server from whatever load balancer you are using until the problem has been fixed and indexing has reprocessed any outstanding documents.

如果您担心索引在一台服务器上成功而另一台服务器失败,那么您需要跟踪每台服务器的成功/失败,以便在问题解决后重新尝试失败的文档。无论您使用什么系统来呈现要归档到Lucene的文档,都可以在Lucene之外进行此跟踪。根据索引的完整性对您的重要程度,您可能还必须从您正在使用的任何负载均衡器中删除发生故障的服务器,直到问题得到解决并且索引已重新处理任何未完成的文档。

#2


+1 for Sean Carpenter's answer. Indexing on both servers seems like the sanest and safest choice.

给肖恩卡彭特的答案+1。在两台服务器上建立索引似乎是最安全和最安全的选择。

If the documents you're indexing are complex (Word/PDF and the sorts), you could perform some preprocessing on a single server and then give that to the indexing servers, to save some processing time.

如果您要编制索引的文档很复杂(Word / PDF和各种类型),您可以在单个服务器上执行一些预处理,然后将其提供给索引服务器,以节省一些处理时间。

A solution I've used before involves creating an index chunk on one server, then rsyncing it over to the search servers and merging the chunk into each index, using IndexWriter.AddIndexesNoOptimize. You can create a new chunk every 5 minutes or whenever it gets to a certain size. If you don't have to have absolutely up-to-date indexes, this might be a solution for you.

我之前使用的解决方案包括在一台服务器上创建索引块,然后使用IndexWriter.AddIndexesNoOptimize将其转发到搜索服务器并将块合并到每个索引中。您可以每5分钟或每当达到一定大小时创建一个新块。如果您不必拥有绝对最新的索引,这可能是您的解决方案。

#3


in the java world, we solved this problem by putting a MQ in front of the index(es). The insert was only complete when the bean pulled from the queue was successful, otherwise it just rolled back any action it took, marked on the doc as pending and it was tried again later

在java世界中,我们通过在索引前面放置一个MQ来解决这个问题。只有当从队列中取出的bean成功时,插入才会完成,否则它只会回滚它所执行的任何操作,在文档上标记为待处理,稍后再次尝试

#4


I know that this is an old question, but I just came across it and wanted to give my 2 cents for anyone else looking for advise on a multi-server implementation.

我知道这是一个老问题,但我刚刚遇到它,并希望为其他任何寻求多服务器实现建议的人提供2美分。

Why not keep index files on a shared NAS folder? How is it different from storing index in a database that you were contemplating? A database can be replicated for high availability, and so can be a NAS!

为什么不将索引文件保存在共享的NAS文件夹中?它与您正在考虑的数据库中存储索引有何不同?可以复制数据库以实现高可用性,因此可以是NAS!

I would configure the two app servers that you have behind a load balancer. Any index request that comes in will index documents in a machine specific folder on the NAS. That is, there will be as many indexes on the NAS as your app servers. When a search request comes in, you will do a multi-index search using Lucene. Lucene has constructs (MultiSearcher) built-in to do this, and the performance is still excellent.

我将配置负载均衡器后面的两个应用服务器。进来的任何索引请求都会将文档索引到NAS上的计算机专用文件夹中。也就是说,NAS上将有与您的应用服务器一样多的索引。当搜索请求进入时,您将使用Lucene进行多索引搜索。 Lucene内置了构造(MultiSearcher)来执行此操作,性能仍然非常出色。

#5


The way we keep our load-balanced servers in sync, each with their own copy of Lucene, is to have a task on some other server, that runs every 5 minutes commanding each load-balanced server to update their index to a certain timestamp.

我们保持负载均衡服务器同步的方式,每个都有自己的Lucene副本,就是在其他服务器上执行一项任务,每隔5分钟运行一次,命令每个负载均衡的服务器将其索引更新为某个时间戳。

For instance, the task sends a timestamp of '12/1/2013 12:35:02.423' to all the load-balanced servers (the task is submitting the timestamp via querystring to a webpage on each load-balanced website), then each server uses that timestamp to query the database for all updates that have occurred since the last update through to that timestamp, and updates their local Lucene index.

例如,任务向所有负载平衡服务器发送时间戳'12 / 1/2013 12:35:02.423'(任务通过查询字符串将时间戳提交到每个负载均衡网站上的网页),然后每个服务器使用该时间戳查询数据库以查找自上次更新到该时间戳以来发生的所有更新,并更新其本地Lucene索引。

Each server also stores the timestamp in the db, so it knows when each server was last updated. So if a server goes offline, when it comes back online, the next time it receives a timestamp command, it'll grab all the updates it missed while it was offline.

每个服务器还将时间戳存储在db中,因此它知道每个服务器上次更新的时间。因此,如果服务器脱机,当它重新联机时,下次收到时间戳命令时,它将获取它在脱机时错过的所有更新。