Has anyone used Lucene.NET rather than using the full text search that comes with sql server?
有人使用Lucene。NET而不是使用sql server的全文搜索?
If so I would be interested on how you implemented it.
如果是这样的话,我将对您如何实现它感兴趣。
Did you for example write a windows service that queried the database every hour then saved the results to the lucene.net index?
例如,您是否编写了一个windows服务,该服务每小时查询一次数据库,然后将结果保存到lucen .net索引中?
5 个解决方案
#1
57
Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.
是的,我用它来描述你所描述的。我们有两个服务——一个用于读,一个用于写,但这只是因为我们有多个读者。我确信我们可以只用一个服务(作者)就完成了,并将阅读器嵌入到web应用程序和服务中。
I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.
我使用lucene.net作为一个通用数据库索引器,所以我得到的基本上是DB id(对索引的电子邮件消息),并且我还使用它来获取足够的信息来填充搜索结果,或者不涉及到数据库。这是伟大的工作在这两种情况下,tho SQL可以有点慢,你非常需要一个ID,选择ID等。我们在这通过一个临时表(只有行ID)和bulk-inserting从文件(lucene的输出),那么加入到信息表。是快很多。
Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.
Lucene并不是完美的,你必须在关系数据库框之外考虑一下,因为它完全不是一个,但是它在这方面做得很好。值得一看,而且,我被告知,不存在SQL女士的FTI所存在的“哎呀,对不起,您需要重新构建索引”问题。
BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.
顺便说一下,我们处理了2000 - 5000万封电子邮件(以及大约100万个唯一的附件),我想总共有20GB的lucene索引,以及250+GB的SQL数据库+附件。
Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)
至少可以说,性能非常棒——只要确保您考虑并调整合并因子(当它合并索引段时)。有一个以上的段是没有问题的,但是如果你尝试合并两个段,每个部分都有1mil项,那么就会有一个大问题,如果你的进程太长,就会有一个watcher线程来杀死这个进程……(是的,我们的屁股被踢了一段时间)。所以,把每个文件的最大数量保持在一个低的位置(也就是说,不要像我们那样将文档设置为maxint !)
EDIT Corey Trager documented how to use Lucene.NET in BugTracker.NET here.
编辑Corey Trager文件如何使用Lucene。在BugTracker净。净。
#2
2
I have not done it against database yet, your question is kinda open.
我还没有针对数据库做过,你的问题有点开放。
If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database. If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.
如果您想搜索一个db,并且可以选择使用Lucene,我猜您还可以控制何时将数据插入数据库。如果是这样,就没有什么理由轮询db以确定是否需要重新索引,只是在插入时进行索引,或者创建一个可以用来告诉lucene要索引什么的队列表。
I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.
我认为,我们不需要另一个索引器,它不知道自己在做什么,每次都要重新索引,或者使用资源浪费。
#3
2
I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.
我还使用了lucen .net作为存储引擎,因为使用索引分发和设置备用机器比使用数据库更容易,它只是一个文件系统拷贝,您可以在一台机器上索引,并且只需将新文件复制到其他机器以分发索引。所有的搜索和细节都显示在lucene索引中,而数据库仅仅用于编辑。这个设置已经被证明是非常可伸缩的解决方案。
Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.
关于sql server之间的差异和lucene,sql server 2005全文检索的主要问题是服务与关系脱钩引擎,所以连接、订单、聚合和过滤和全文的结果之间的关系列非常昂贵在性能方面,微软声称这种问题已经解决在sql server 2008中,集成关系内的全文搜索引擎,但我没有测试过它。它们还使整个全文搜索变得更加透明,在以前的版本中,词干、停止符和索引的其他几个部分就像一个黑框,难以理解,在新版本中,更容易看到它们是如何工作的。
With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.
以我的经验,如果sql server满足您的需求,这将是最简单的方法,如果你希望大量增长,复杂的查询或需要一个大控制的全文搜索,你可以考虑使用lucene从一开始因为它会更容易规模和个性化。
#4
1
I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:
我使用Lucene。净和MySQL。我的方法是在Lucene文档中存储db记录的主键和索引文本。伪代码是这样的:
-
Store record:
存储记录:
insert text, other data to the table
get latest inserted ID
create lucene document
put (ID, text) into lucene document update lucene index插入文本,其他数据到表获取最新插入的ID创建lucene文档,将(ID, text)插入到lucene文档更新索引中
-
Querying
search lucene index
for each lucene doc in result set load data from DB by stored record's ID通过存储记录的ID从DB中查询每个lucene doc的搜索lucene索引
Just to note, I switched from Lucene to Sphinx due to it superb performance
注意,我从Lucene到Sphinx,因为它的性能很好。
#5
1
I think this article is a good starting point:
我认为这篇文章是一个很好的起点:
http://www.aspfree.com/c/a/braindump/working-with-lucene-net/
http://www.aspfree.com/c/a/braindump/working-with-lucene-net/
#1
57
Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.
是的,我用它来描述你所描述的。我们有两个服务——一个用于读,一个用于写,但这只是因为我们有多个读者。我确信我们可以只用一个服务(作者)就完成了,并将阅读器嵌入到web应用程序和服务中。
I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.
我使用lucene.net作为一个通用数据库索引器,所以我得到的基本上是DB id(对索引的电子邮件消息),并且我还使用它来获取足够的信息来填充搜索结果,或者不涉及到数据库。这是伟大的工作在这两种情况下,tho SQL可以有点慢,你非常需要一个ID,选择ID等。我们在这通过一个临时表(只有行ID)和bulk-inserting从文件(lucene的输出),那么加入到信息表。是快很多。
Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.
Lucene并不是完美的,你必须在关系数据库框之外考虑一下,因为它完全不是一个,但是它在这方面做得很好。值得一看,而且,我被告知,不存在SQL女士的FTI所存在的“哎呀,对不起,您需要重新构建索引”问题。
BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.
顺便说一下,我们处理了2000 - 5000万封电子邮件(以及大约100万个唯一的附件),我想总共有20GB的lucene索引,以及250+GB的SQL数据库+附件。
Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)
至少可以说,性能非常棒——只要确保您考虑并调整合并因子(当它合并索引段时)。有一个以上的段是没有问题的,但是如果你尝试合并两个段,每个部分都有1mil项,那么就会有一个大问题,如果你的进程太长,就会有一个watcher线程来杀死这个进程……(是的,我们的屁股被踢了一段时间)。所以,把每个文件的最大数量保持在一个低的位置(也就是说,不要像我们那样将文档设置为maxint !)
EDIT Corey Trager documented how to use Lucene.NET in BugTracker.NET here.
编辑Corey Trager文件如何使用Lucene。在BugTracker净。净。
#2
2
I have not done it against database yet, your question is kinda open.
我还没有针对数据库做过,你的问题有点开放。
If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database. If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.
如果您想搜索一个db,并且可以选择使用Lucene,我猜您还可以控制何时将数据插入数据库。如果是这样,就没有什么理由轮询db以确定是否需要重新索引,只是在插入时进行索引,或者创建一个可以用来告诉lucene要索引什么的队列表。
I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.
我认为,我们不需要另一个索引器,它不知道自己在做什么,每次都要重新索引,或者使用资源浪费。
#3
2
I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.
我还使用了lucen .net作为存储引擎,因为使用索引分发和设置备用机器比使用数据库更容易,它只是一个文件系统拷贝,您可以在一台机器上索引,并且只需将新文件复制到其他机器以分发索引。所有的搜索和细节都显示在lucene索引中,而数据库仅仅用于编辑。这个设置已经被证明是非常可伸缩的解决方案。
Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.
关于sql server之间的差异和lucene,sql server 2005全文检索的主要问题是服务与关系脱钩引擎,所以连接、订单、聚合和过滤和全文的结果之间的关系列非常昂贵在性能方面,微软声称这种问题已经解决在sql server 2008中,集成关系内的全文搜索引擎,但我没有测试过它。它们还使整个全文搜索变得更加透明,在以前的版本中,词干、停止符和索引的其他几个部分就像一个黑框,难以理解,在新版本中,更容易看到它们是如何工作的。
With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.
以我的经验,如果sql server满足您的需求,这将是最简单的方法,如果你希望大量增长,复杂的查询或需要一个大控制的全文搜索,你可以考虑使用lucene从一开始因为它会更容易规模和个性化。
#4
1
I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:
我使用Lucene。净和MySQL。我的方法是在Lucene文档中存储db记录的主键和索引文本。伪代码是这样的:
-
Store record:
存储记录:
insert text, other data to the table
get latest inserted ID
create lucene document
put (ID, text) into lucene document update lucene index插入文本,其他数据到表获取最新插入的ID创建lucene文档,将(ID, text)插入到lucene文档更新索引中
-
Querying
search lucene index
for each lucene doc in result set load data from DB by stored record's ID通过存储记录的ID从DB中查询每个lucene doc的搜索lucene索引
Just to note, I switched from Lucene to Sphinx due to it superb performance
注意,我从Lucene到Sphinx,因为它的性能很好。
#5
1
I think this article is a good starting point:
我认为这篇文章是一个很好的起点:
http://www.aspfree.com/c/a/braindump/working-with-lucene-net/
http://www.aspfree.com/c/a/braindump/working-with-lucene-net/