结合Lucene.NET和关系数据库的最佳实践?

时间:2021-01-09 03:08:31

I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.

我正在开发一个项目,我将拥有大量数据,并且可以通过几种非常有效表达为SQL查询的表单进行搜索,但也需要通过自然语言处理进行搜索。

My plan is to build an index using Lucene for this form of search.

我的计划是使用Lucene为这种搜索形式构建索引。

My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.

我的问题是,如果我这样做并执行搜索,Lucene将返回索引中匹配文档的ID,然后我必须从关系数据库中查找这些实体。

This could be done in two ways (That I can think of so far):

这可以通过两种方式完成(到目前为止我能想到):

  • N amount of queries (Horrible)
  • N次查询(可怕)

  • Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.
  • 立即将所有ID传递给存储过程(也许作为逗号分隔参数)。这具有限制为最大参数大小的缺点,以及UDF将字符串拆分为临时表的缓慢性能。

I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.

我几乎想把所有内容镜像到lucenes索引,这样我就可以定期从后备存储生成索引,但只需要为前端访问它。

Advice?

4 个解决方案

#1


I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.

我会将'前端'数据存储在索引本身中,避免任何数据库交互。仅当您需要有关特定记录的更多信息时才会查询数据库。

#2


When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.

当我遇到这个问题时,我选择了一个具有全文搜索功能的关系数据库(我使用了PostgreSQL 8.3,它内置了ft支持,具有词干和词库支持)。这样,数据库可以使用SQL和ft命令进行查询。缺点是你需要一个具有全文搜索功能的数据库,这些功能可能不如lucene所能做的那样。

#3


I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.

我想答案取决于你将如何处理结果,如果你要在网格中显示结果并让用户选择他想要访问的确切文档,那么你可能想要在索引中添加足够的文本帮助用户识别文档,比如说200个字符的模糊,然后一旦成员选择文档命中DB就可以检索整个文件。

This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.

这肯定会影响索引的大小,因此这是您需要记住的另一个考虑因素。我还会在数据库和前端之间放置一个缓存,以便最常用的项目不会每次都产生数据库访问的全部成本。

#4


Probably not an option depending on how much stuff is in your database, but what I have done is store the db id's in the search index along with the properties I wanted indexed. Then in my service classes I cache all the data needed to display search results for all the objects (e.g., name, db id, image url's, description blurbs, social media info). The service class returns a Dictionary that can look up objects by db id, and I use the id's returned by Lucene.NET to pull data from the in-memory cache.

可能不是一个选项,取决于您的数据库中有多少东西,但我所做的是将db id存储在搜索索引中以及我想要索引的属性。然后在我的服务类中,我缓存显示所有对象的搜索结果所需的所有数据(例如,名称,数据库ID,图像URL,描述模糊,社交媒体信息)。服务类返回一个可以按db id查找对象的Dictionary,我使用Lucene.NET返回的id从内存缓存中提取数据。

You could also forego the in-memory cache and store all the necessary properties for displaying a search result in the search index. I didn't do this because the in-memory cache is also used in scenarios other than search.

您还可以放弃内存缓存并存储在搜索索引中显示搜索结果所需的所有属性。我没有这样做,因为内存缓存也用于搜索以外的场景。

The in-memory cache is always fresh to within a few hours, and the only time I have to hit the db is if I need to pull more detailed data for a single object (if the user clicks on the link for a specific object to go to the page for that object).

内存缓存在几个小时内总是新鲜的,我唯一需要访问数据库的是我需要为单个对象提取更详细的数据(如果用户点击特定对象的链接到转到该对象的页面)。

#1


I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.

我会将'前端'数据存储在索引本身中,避免任何数据库交互。仅当您需要有关特定记录的更多信息时才会查询数据库。

#2


When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.

当我遇到这个问题时,我选择了一个具有全文搜索功能的关系数据库(我使用了PostgreSQL 8.3,它内置了ft支持,具有词干和词库支持)。这样,数据库可以使用SQL和ft命令进行查询。缺点是你需要一个具有全文搜索功能的数据库,这些功能可能不如lucene所能做的那样。

#3


I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.

我想答案取决于你将如何处理结果,如果你要在网格中显示结果并让用户选择他想要访问的确切文档,那么你可能想要在索引中添加足够的文本帮助用户识别文档,比如说200个字符的模糊,然后一旦成员选择文档命中DB就可以检索整个文件。

This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.

这肯定会影响索引的大小,因此这是您需要记住的另一个考虑因素。我还会在数据库和前端之间放置一个缓存,以便最常用的项目不会每次都产生数据库访问的全部成本。

#4


Probably not an option depending on how much stuff is in your database, but what I have done is store the db id's in the search index along with the properties I wanted indexed. Then in my service classes I cache all the data needed to display search results for all the objects (e.g., name, db id, image url's, description blurbs, social media info). The service class returns a Dictionary that can look up objects by db id, and I use the id's returned by Lucene.NET to pull data from the in-memory cache.

可能不是一个选项,取决于您的数据库中有多少东西,但我所做的是将db id存储在搜索索引中以及我想要索引的属性。然后在我的服务类中,我缓存显示所有对象的搜索结果所需的所有数据(例如,名称,数据库ID,图像URL,描述模糊,社交媒体信息)。服务类返回一个可以按db id查找对象的Dictionary,我使用Lucene.NET返回的id从内存缓存中提取数据。

You could also forego the in-memory cache and store all the necessary properties for displaying a search result in the search index. I didn't do this because the in-memory cache is also used in scenarios other than search.

您还可以放弃内存缓存并存储在搜索索引中显示搜索结果所需的所有属性。我没有这样做,因为内存缓存也用于搜索以外的场景。

The in-memory cache is always fresh to within a few hours, and the only time I have to hit the db is if I need to pull more detailed data for a single object (if the user clicks on the link for a specific object to go to the page for that object).

内存缓存在几个小时内总是新鲜的,我唯一需要访问数据库的是我需要为单个对象提取更详细的数据(如果用户点击特定对象的链接到转到该对象的页面)。