在已有的.NET / SQL服务器堆栈上使用多个web服务器实现Lucene

时间:2021-06-18 03:09:41

I want to look at using Lucene for a fulltext search solution for a site that I currently manage. The site is built entirely on SQL Server 2008 / C# .NET 4 technologies. The data I'm looking to index is actually quite simple, with only a couple of fields per record and only one of those fields actually searchable.

我想看看如何使用Lucene为我目前管理的站点提供全文搜索解决方案。该站点完全建立在SQL Server 2008 / c# . net 4技术之上。我要查找的数据实际上非常简单,每个记录只有两个字段,而这些字段中只有一个是可搜索的。

It's not clear to me what the best toolset I need to be using is, or what the architecture I should be using is. Specifically:

我不清楚我需要使用的最佳工具集是什么,或者我应该使用的架构是什么。具体地说:

  1. Where should I put the index? I've seen people recommend putting it on the webserver, but that would seem wasteful for a large number of webservers. Surely centralising would be better here?

    我应该把指数放在哪里?我见过有人建议把它放在webserver上,但这对大量的webserver来说似乎是浪费。*集权会更好吗?

  2. If the index is centralised, how would I query it, given that it just lives on the filesystem? Will I have to effectively put it on a network share that all the webservers can see?

    如果索引是集中的,那么考虑到它只存在于文件系统中,我如何查询它呢?我是否必须将它有效地放在所有web服务器都能看到的网络共享上?

  3. Are there any pre-existing tools that will incrementally populate a Lucene index on a schedule, pulling the data from an SQL Server database? Would I be better off rolling my own service here?

    是否有任何预先存在的工具在调度中逐步填充Lucene索引,从SQL服务器数据库中提取数据?我最好在这里开展自己的服务吗?

  4. When I query the index, should I be looking to just pull back a bunch of record id's which I then go back to the DB for the actual record, or should I be aiming to pull everything I need for the search straight out of the index?

    当我查询索引时,我是否应该只提取一堆记录id然后返回到DB以获取实际的记录,还是应该将搜索所需的所有东西直接从索引中取出?

  5. Is there value in trying to implement something like Solr in this flavour environment? If so, I'd probably give it it's own *nix VM and run it within Tomcat on that. But I'm not sure what Solr would buy me in this case.

    在这个味道环境中尝试实现Solr是否有价值?如果是的话,我可能会给它自己的*nix VM,然后在Tomcat中运行它。但我不确定Solr会给我买什么。

1 个解决方案

#1


50  

I'll answer a bit based on how we chose to implement Lucene.Net here on Stack Overflow, and some lessons I learned along the way:

我将根据我们选择如何实现Lucene回答一些问题。Net上的栈溢出,以及我在此过程中学到的一些经验:

Where should I put the index? I've seen people recommend putting it on the webserver, but that would seem wasteful for a large number of webservers. Surely centralising would be better here?

我应该把指数放在哪里?我见过有人建议把它放在webserver上,但这对大量的webserver来说似乎是浪费。*集权会更好吗?

  • It depends on your goals here, we had a severely under-utilized web tier (~10% CPU), and an overloaded database doing FullText searching (around 60% CPU, we wanted it lower). Loading up the same index on each web tier let us utilize those machines and have a ton of redundancy, we can still lose 9 out of 10 web servers and keep the Stack Exchange network up if need be. There is a downside to this, it's very IO (read) intensive for us, and the web tier was not bought with this in mind (this is often the case at most companies). While it works fine, we'll still be upgrading our web tier to SSDs and implementing some other bits left out of the .Net port to compensate for this hardware deficiency (NIOFSDirectory for example).
  • 它取决于您在这里的目标,我们有一个严重未被利用的web层(大约10%的CPU)和一个重载的数据库进行全文搜索(大约60%的CPU,我们希望它更低)。在每个web层上加载相同的索引使我们能够利用这些机器并具有大量的冗余,我们仍然可能丢失10个web服务器中的9个,并在需要时保持堆栈交换网络。这有一个缺点,它对我们来说是非常密集的,并且web层并没有考虑到这一点(在大多数公司通常是这样)。虽然它工作得很好,但是我们仍然要将web层升级到ssd,并实现。net端口之外的一些其他比特,以弥补这个硬件缺陷(例如NIOFSDirectory)。
  • The other downside if we index all our databases n times for the web tier, but luckily we're not starved for network bandwidth and SQL server caching the results makes this a very fast delta indexing operation each time. With a large number of web servers, that alone may eliminate this option.
  • 另一个缺点是,如果我们为web层索引所有数据库n次,但幸运的是,我们并不需要网络带宽和SQL服务器索引缓存,因此每次都是非常快速的增量操作。对于大量的web服务器,仅这一点就可能消除此选项。

If the index is centralised, how would I query it, given that it just lives on the filesystem? Will I have to effectively put it on a network share that all the webservers can see?

如果索引是集中的,那么考虑到它只存在于文件系统中,我如何查询它呢?我是否必须将它有效地放在所有web服务器都能看到的网络共享上?

  • You can query it on a file share either way, just make sure only one is indexing at a time (write.lock, the directory locking mechanism will ensure this and error when you try multiple IndexWriters at once).
  • 您可以在文件共享上查询它,只要确保每次只有一个是索引(写)。锁定,目录锁定机制将确保当您同时尝试多个索引编写者时这一点和错误)。
  • Keep in mind my notes above, this is is IO intensive when a lot of readers are flying around, so you need ample bandwidth to your store, short of at least iSCSI or a fiber SAN, I'd be cautious of this approach on a high traffic (hundreds of thousands of searches a day) use.
  • 记住我的笔记上面,这是IO密集型当很多读者都四处飞翔,所以你需要足够的带宽来你的商店,至少iSCSI或短纤维圣,我小心谨慎的这种方法在高流量(每天成千上万的搜索)使用。
  • Another consideration is how you update/alert your web servers (or whatever tier is querying it). When you finishing an indexing pass, you'll need to re-open your IndexReaders to get the updated index with new documents. We use a redis messaging channel to alert whoever cares that the index has updated...any messaging mechanism would work here.
  • 另一个需要考虑的问题是如何更新/警告web服务器(或任何查询它的层)。当您完成索引传递时,您将需要重新打开索引阅读器,以获得使用新文档的更新索引。我们使用redis消息通道通知关心索引更新的人……任何消息传递机制都可以在这里工作。

Are there any pre-existing tools that will incrementally populate a Lucene index on a schedule, pulling the data from an SQL Server database? Would I be better off rolling my own service here?

是否有任何预先存在的工具在调度中逐步填充Lucene索引,从SQL服务器数据库中提取数据?我最好在这里开展自己的服务吗?

  • Unfortunately there are none that I know of, but I can share with you how I approached this.
  • 不幸的是,我不知道,但我可以和你们分享我是如何做到这一点的。
  • When indexing a specific table (akin to a document in Lucene), we added a rowversion to that table. When we index we select based off the last rowversion (a timestamp datatype, pulled back as a bigint). I chose to store the last index date and last indexed rowversion on the file system via a simple .txt file for one reason: everything else in Lucene is stored there. This means if there's ever a large problem, you can just delete the folder containing the index and the next indexing pass will recover and have a fully up-to-date index, just add some code to handle nothing being there meaning "index everything".
  • 当索引特定的表(类似于Lucene中的文档)时,我们向该表添加了一个rowversion。当我们索引时,我们根据最后的行版本(时间戳数据类型,作为bigint回拉)选择。我选择通过一个简单的.txt文件在文件系统上存储最后一个索引日期和最后一个索引rowversion,这是一个原因:Lucene中的其他所有东西都存储在那里。这意味着如果有大的问题,您可以删除包含索引的文件夹,下一个索引传递将恢复并拥有一个完全最新的索引,只需添加一些代码来处理不存在的任何内容,即“索引所有内容”。

When I query the index, should I be looking to just pull back a bunch of record id's which I then go back to the DB for the actual record, or should I be aiming to pull everything I need for the search straight out of the index?

当我查询索引时,我是否应该只提取一堆记录id然后返回到DB以获取实际的记录,还是应该将搜索所需的所有东西直接从索引中取出?

  • This really depends on your data, for us it's not really feasible to store everything in the index (nor is this recommended). What I suggest is you store the fields for your search results in the index, and by that I mean what you need to present your search results in a list, before the user clicks to go to the full [insert type here].
  • 这确实取决于您的数据,对于我们来说,将所有数据存储在索引中并不是切实可行的(也不推荐这样做)。我建议您将搜索结果的字段存储在索引中,我的意思是您需要将搜索结果显示在一个列表中,然后用户点击进入完整的[插入类型这里]。
  • Another consideration is how often your data is changing. If a lot of fields you're not searching on are changing rapidly, you'll need to re-index those rows (documents) to update your index, not only when the field you're searching on changes.
  • 另一个要考虑的问题是数据的变化频率。如果您没有搜索的许多字段正在快速地发生变化,那么您需要重新索引那些行(文档)来更新索引,而不仅仅是在您搜索更改的字段时。

Is there value in trying to implement something like Solr in this flavour environment? If so, I'd probably give it it's own *nix VM and run it within Tomcat on that. But I'm not sure what Solr would buy me in this case.

在这个味道环境中尝试实现Solr是否有价值?如果是的话,我可能会给它自己的*nix VM,然后在Tomcat中运行它。但我不确定Solr会给我买什么。

  • Sure there is, it's the centralized search you're talking about (with a high number of searches you may again hit a limit with a VM setup, keep an eye on this). We didn't do this because it introduced a lot of (we feel) unwarranted complexity in our technology stack and build process, but for a larger number of web servers it makes much more sense.
  • 当然,这是您正在谈论的集中式搜索(在大量搜索的情况下,您可能会再次使用VM设置达到限制,请密切关注这一点)。我们没有这样做,因为它在我们的技术堆栈和构建过程中引入了很多(我们觉得)不必要的复杂性,但是对于更多的web服务器来说,它更有意义。
  • What does it buy you? performance mainly, and a dedicated indexing server(s). Instead of n servers crawling a network share (competing for IO as well), they can hit a single server that only deals with requests and results over the network, not crawling the index which is a lot more data going back and forth...this would be local on the Solr server(s). Also, you're not hitting your SQL server as much since fewer servers are indexing.
  • 它能给你带来什么?主要是性能和专用索引服务器。不需要n台服务器爬行网络共享(同样也在争夺IO),它们可以攻击一台只处理网络上的请求和结果的服务器,而不需要爬行索引,因为来回的数据要多得多……这在Solr服务器上是本地的。此外,由于索引的服务器更少,所以您不会经常访问SQL服务器。
  • What it doesn't buy you is as much redundancy, but it's up to you how important this is. If you can operate fine on degraded search or without it, simply have your app handle that. If you can't, then a backup Solr server or more may also be a valid solution...and it is possible another software stack to maintain.
  • 它不买你的是冗余,但这取决于你有多重要。如果你可以在降级搜索或没有降级搜索的情况下很好地操作,只要让你的应用处理它。如果不能,那么备份Solr服务器或其他服务器也可能是有效的解决方案……还有可能需要维护另一个软件堆栈。

#1


50  

I'll answer a bit based on how we chose to implement Lucene.Net here on Stack Overflow, and some lessons I learned along the way:

我将根据我们选择如何实现Lucene回答一些问题。Net上的栈溢出,以及我在此过程中学到的一些经验:

Where should I put the index? I've seen people recommend putting it on the webserver, but that would seem wasteful for a large number of webservers. Surely centralising would be better here?

我应该把指数放在哪里?我见过有人建议把它放在webserver上,但这对大量的webserver来说似乎是浪费。*集权会更好吗?

  • It depends on your goals here, we had a severely under-utilized web tier (~10% CPU), and an overloaded database doing FullText searching (around 60% CPU, we wanted it lower). Loading up the same index on each web tier let us utilize those machines and have a ton of redundancy, we can still lose 9 out of 10 web servers and keep the Stack Exchange network up if need be. There is a downside to this, it's very IO (read) intensive for us, and the web tier was not bought with this in mind (this is often the case at most companies). While it works fine, we'll still be upgrading our web tier to SSDs and implementing some other bits left out of the .Net port to compensate for this hardware deficiency (NIOFSDirectory for example).
  • 它取决于您在这里的目标,我们有一个严重未被利用的web层(大约10%的CPU)和一个重载的数据库进行全文搜索(大约60%的CPU,我们希望它更低)。在每个web层上加载相同的索引使我们能够利用这些机器并具有大量的冗余,我们仍然可能丢失10个web服务器中的9个,并在需要时保持堆栈交换网络。这有一个缺点,它对我们来说是非常密集的,并且web层并没有考虑到这一点(在大多数公司通常是这样)。虽然它工作得很好,但是我们仍然要将web层升级到ssd,并实现。net端口之外的一些其他比特,以弥补这个硬件缺陷(例如NIOFSDirectory)。
  • The other downside if we index all our databases n times for the web tier, but luckily we're not starved for network bandwidth and SQL server caching the results makes this a very fast delta indexing operation each time. With a large number of web servers, that alone may eliminate this option.
  • 另一个缺点是,如果我们为web层索引所有数据库n次,但幸运的是,我们并不需要网络带宽和SQL服务器索引缓存,因此每次都是非常快速的增量操作。对于大量的web服务器,仅这一点就可能消除此选项。

If the index is centralised, how would I query it, given that it just lives on the filesystem? Will I have to effectively put it on a network share that all the webservers can see?

如果索引是集中的,那么考虑到它只存在于文件系统中,我如何查询它呢?我是否必须将它有效地放在所有web服务器都能看到的网络共享上?

  • You can query it on a file share either way, just make sure only one is indexing at a time (write.lock, the directory locking mechanism will ensure this and error when you try multiple IndexWriters at once).
  • 您可以在文件共享上查询它,只要确保每次只有一个是索引(写)。锁定,目录锁定机制将确保当您同时尝试多个索引编写者时这一点和错误)。
  • Keep in mind my notes above, this is is IO intensive when a lot of readers are flying around, so you need ample bandwidth to your store, short of at least iSCSI or a fiber SAN, I'd be cautious of this approach on a high traffic (hundreds of thousands of searches a day) use.
  • 记住我的笔记上面,这是IO密集型当很多读者都四处飞翔,所以你需要足够的带宽来你的商店,至少iSCSI或短纤维圣,我小心谨慎的这种方法在高流量(每天成千上万的搜索)使用。
  • Another consideration is how you update/alert your web servers (or whatever tier is querying it). When you finishing an indexing pass, you'll need to re-open your IndexReaders to get the updated index with new documents. We use a redis messaging channel to alert whoever cares that the index has updated...any messaging mechanism would work here.
  • 另一个需要考虑的问题是如何更新/警告web服务器(或任何查询它的层)。当您完成索引传递时,您将需要重新打开索引阅读器,以获得使用新文档的更新索引。我们使用redis消息通道通知关心索引更新的人……任何消息传递机制都可以在这里工作。

Are there any pre-existing tools that will incrementally populate a Lucene index on a schedule, pulling the data from an SQL Server database? Would I be better off rolling my own service here?

是否有任何预先存在的工具在调度中逐步填充Lucene索引,从SQL服务器数据库中提取数据?我最好在这里开展自己的服务吗?

  • Unfortunately there are none that I know of, but I can share with you how I approached this.
  • 不幸的是,我不知道,但我可以和你们分享我是如何做到这一点的。
  • When indexing a specific table (akin to a document in Lucene), we added a rowversion to that table. When we index we select based off the last rowversion (a timestamp datatype, pulled back as a bigint). I chose to store the last index date and last indexed rowversion on the file system via a simple .txt file for one reason: everything else in Lucene is stored there. This means if there's ever a large problem, you can just delete the folder containing the index and the next indexing pass will recover and have a fully up-to-date index, just add some code to handle nothing being there meaning "index everything".
  • 当索引特定的表(类似于Lucene中的文档)时,我们向该表添加了一个rowversion。当我们索引时,我们根据最后的行版本(时间戳数据类型,作为bigint回拉)选择。我选择通过一个简单的.txt文件在文件系统上存储最后一个索引日期和最后一个索引rowversion,这是一个原因:Lucene中的其他所有东西都存储在那里。这意味着如果有大的问题,您可以删除包含索引的文件夹,下一个索引传递将恢复并拥有一个完全最新的索引,只需添加一些代码来处理不存在的任何内容,即“索引所有内容”。

When I query the index, should I be looking to just pull back a bunch of record id's which I then go back to the DB for the actual record, or should I be aiming to pull everything I need for the search straight out of the index?

当我查询索引时,我是否应该只提取一堆记录id然后返回到DB以获取实际的记录,还是应该将搜索所需的所有东西直接从索引中取出?

  • This really depends on your data, for us it's not really feasible to store everything in the index (nor is this recommended). What I suggest is you store the fields for your search results in the index, and by that I mean what you need to present your search results in a list, before the user clicks to go to the full [insert type here].
  • 这确实取决于您的数据,对于我们来说,将所有数据存储在索引中并不是切实可行的(也不推荐这样做)。我建议您将搜索结果的字段存储在索引中,我的意思是您需要将搜索结果显示在一个列表中,然后用户点击进入完整的[插入类型这里]。
  • Another consideration is how often your data is changing. If a lot of fields you're not searching on are changing rapidly, you'll need to re-index those rows (documents) to update your index, not only when the field you're searching on changes.
  • 另一个要考虑的问题是数据的变化频率。如果您没有搜索的许多字段正在快速地发生变化,那么您需要重新索引那些行(文档)来更新索引,而不仅仅是在您搜索更改的字段时。

Is there value in trying to implement something like Solr in this flavour environment? If so, I'd probably give it it's own *nix VM and run it within Tomcat on that. But I'm not sure what Solr would buy me in this case.

在这个味道环境中尝试实现Solr是否有价值?如果是的话,我可能会给它自己的*nix VM,然后在Tomcat中运行它。但我不确定Solr会给我买什么。

  • Sure there is, it's the centralized search you're talking about (with a high number of searches you may again hit a limit with a VM setup, keep an eye on this). We didn't do this because it introduced a lot of (we feel) unwarranted complexity in our technology stack and build process, but for a larger number of web servers it makes much more sense.
  • 当然,这是您正在谈论的集中式搜索(在大量搜索的情况下,您可能会再次使用VM设置达到限制,请密切关注这一点)。我们没有这样做,因为它在我们的技术堆栈和构建过程中引入了很多(我们觉得)不必要的复杂性,但是对于更多的web服务器来说,它更有意义。
  • What does it buy you? performance mainly, and a dedicated indexing server(s). Instead of n servers crawling a network share (competing for IO as well), they can hit a single server that only deals with requests and results over the network, not crawling the index which is a lot more data going back and forth...this would be local on the Solr server(s). Also, you're not hitting your SQL server as much since fewer servers are indexing.
  • 它能给你带来什么?主要是性能和专用索引服务器。不需要n台服务器爬行网络共享(同样也在争夺IO),它们可以攻击一台只处理网络上的请求和结果的服务器,而不需要爬行索引,因为来回的数据要多得多……这在Solr服务器上是本地的。此外,由于索引的服务器更少,所以您不会经常访问SQL服务器。
  • What it doesn't buy you is as much redundancy, but it's up to you how important this is. If you can operate fine on degraded search or without it, simply have your app handle that. If you can't, then a backup Solr server or more may also be a valid solution...and it is possible another software stack to maintain.
  • 它不买你的是冗余,但这取决于你有多重要。如果你可以在降级搜索或没有降级搜索的情况下很好地操作,只要让你的应用处理它。如果不能,那么备份Solr服务器或其他服务器也可能是有效的解决方案……还有可能需要维护另一个软件堆栈。