使用Lucene查询RDBMS数据库

时间:2021-07-17 05:52:36

I've skimmed the docs for the Java version of Lucene, but I can't really see the top-level "this is how it works" info so far (I'm aware I need to RTFM, I just can't see the wood for the trees).

我已经浏览了Lucene的Java版本的文档,但到目前为止我无法真正看到*“这是它的工作原理”信息(我知道我需要RTFM,我只是看不到用于树木的木材)。

I understand Lucene uses search indexes to return results. As far as I know, it only returns "hits" from those indexes. If I haven't added an item of data when building the index then it won't be returned.

我了解Lucene使用搜索索引返回结果。据我所知,它只返回那些索引的“命中”。如果我在构建索引时没有添加数据项,那么它将不会被返回。

That's fine, so now I want to check the following assumption:

那很好,所以现在我想检查以下假设:

Q: Does that mean that any data I want displayed on a search page needs to be added to the Lucene index?

问:这是否意味着我想要在搜索页面上显示的任何数据都需要添加到Lucene索引中?

I.e.
If I want to search for Products by things like sku, description, category name, etc, but I also want to display the Customer they belong to in search results, do I:

即如果我想通过sku,描述,类别名称等搜索产品,但我也想在搜索结果中显示他们所属的客户,我是否:

  1. Make sure the Lucene index has the denormalised Customer's name in the index.
  2. 确保Lucene索引在索引中具有非规范化的Customer名称。

  3. Use the hits returned by Lucene to somehow query the database for the actual product records and use a JOIN to get the Customer's name.
  4. 使用Lucene返回的命中以某种方式查询数据库中的实际产品记录,并使用JOIN获取客户的名称。

I assume it's option 1, since I'm assuming there's no way to "join" the results of a Lucene query to an RDBMS, but wanted to ask it my assumptions about the general usage are correct.

我假设它是选项1,因为我假设没有办法将Lucene查询的结果“加入”到RDBMS,但是想问一下我对一般用法的假设是正确的。

3 个解决方案

#1


1  

Usually the index would only contain the fields you want to search on, not necessarily the ones you want to display. Indexes should be optimized to be as small as possible, to keep search performance good.

通常,索引只包含您要搜索的字段,不一定是您要显示的字段。应将索引优化为尽可能小,以保持搜索性能良好。

To be able to display more data add a field to your index that allows you to retrieve your full document/data, i.e. a unique key for your Product (product id?).

为了能够显示更多数据,请在索引中添加一个字段,以便您检索完整的文档/数据,即产品的唯一键(产品ID?)。

#2


1  

I have been trying to figure out the same problem, but I think that its too much work. I'm thinking of this as an alternative. Plse correct me if I'm wrong in my thinking!

我一直试图弄清楚同样的问题,但我认为它的工作太多了。我认为这是另一种选择。如果我的想法错了,请纠正我!

Your situation is like this: RDBMS product (many) <------> (many) Customer

你的情况是这样的:RDBMS产品(很多)<------>(很多)客户

Instead of putting only customer in lucene index to get product keys, and then query RDBMS with IN Query, I'd suggest, create the lucene index with the cartesian product of Product as well as Customer.

而不是只使用lucene索引中的客户获取产品密钥,然后使用IN Query查询RDBMS,我建议使用Product和Customer的cartesian产品创建lucene索引。

Like customer_1, product_1 customer_1, product_2 customer_2, product_2..

与customer_1,product_1 customer_1,product_2 customer_2,product_2 ..

This way, when you are searching for a product in lucene, it will give both the customer as well as the products id.. and instead of joining them in RDBMS, you can simply look up those customers as well as products for more information from RDBMS, if there is a need. If you are using caching, then the additional details lookup cost will also go down.

这样,当您在lucene中搜索产品时,它将同时给予客户以及产品ID ..而不是在RDBMS中加入它们,您可以简单地查找这些客户以及产品以获取更多信息。 RDBMS,如果有需要的话。如果您正在使用缓存,那么额外的详细信息查找成本也将下降。

#3


0  

Based on BrokenGlass's answer, I've thought some more and am proposing the following to see if I'm on the right lines:

基于BrokenGlass的答案,我已经考虑了一些,并提出以下建议,看看我是否在正确的路线上:

Basically, taking option 2 further, one could do the following:

基本上,进一步采取备选方案2,可以做到以下几点:

  1. Put only the data you want to search on into the Lucene index, plus some sort of key value (e.g. the PK of a table in your database).
  2. 只将您要搜索的数据放入Lucene索引中,加上某种键值(例如数据库中表的PK)。

  3. Query Lucene to get a list of hits.
  4. 查询Lucene以获取命中列表。

  5. Using your data access layer of choice, build a query for your database that includes an IN (value [, value]) predicate.
  6. 使用您选择的数据访问层,为您的数据库构建一个包含IN(value [,value])谓词的查询。

  7. Get the results for that query from your database (which may well include JOINs to other tables).
  8. 从您的数据库中获取该查询的结果(可能包括JOIN到其他表)。

  9. Put those results in a dictionary, using the PK of the resultset as the key.
  10. 将这些结果放在字典中,使用结果集的PK作为键。

  11. Iterate the Lucene hits again in order, pulling the items from the dictionary using the PK so you can build a list of results in the order that Lucene returned the hits (i.e. sorted by relevance).
  12. 再次按顺序迭代Lucene命中,使用PK从字典中提取项目,这样您就可以按照Lucene返回命中的顺序构建结果列表(即按相关性排序)。

  13. Display that "sorted" list of results to the user.
  14. 向用户显示“排序”结果列表。

Of course steps 5 and 6 could be better, but for the sake of explanation I put that verbose method in my description. If the Lucene hits include some sort of "relevance" value, then you could attribute that to the resultset and perform a standard sort, but that's an exercise for the reader. :)

当然,第5步和第6步可能会更好,但为了便于解释,我将这种冗长的方法放在我的描述中。如果Lucene命中包含某种“相关性”值,那么您可以将其归因于结果集并执行标准排序,但这对读者来说是一种练习。 :)

Could this be it?

这可能吗?

#1


1  

Usually the index would only contain the fields you want to search on, not necessarily the ones you want to display. Indexes should be optimized to be as small as possible, to keep search performance good.

通常,索引只包含您要搜索的字段,不一定是您要显示的字段。应将索引优化为尽可能小,以保持搜索性能良好。

To be able to display more data add a field to your index that allows you to retrieve your full document/data, i.e. a unique key for your Product (product id?).

为了能够显示更多数据,请在索引中添加一个字段,以便您检索完整的文档/数据,即产品的唯一键(产品ID?)。

#2


1  

I have been trying to figure out the same problem, but I think that its too much work. I'm thinking of this as an alternative. Plse correct me if I'm wrong in my thinking!

我一直试图弄清楚同样的问题,但我认为它的工作太多了。我认为这是另一种选择。如果我的想法错了,请纠正我!

Your situation is like this: RDBMS product (many) <------> (many) Customer

你的情况是这样的:RDBMS产品(很多)<------>(很多)客户

Instead of putting only customer in lucene index to get product keys, and then query RDBMS with IN Query, I'd suggest, create the lucene index with the cartesian product of Product as well as Customer.

而不是只使用lucene索引中的客户获取产品密钥,然后使用IN Query查询RDBMS,我建议使用Product和Customer的cartesian产品创建lucene索引。

Like customer_1, product_1 customer_1, product_2 customer_2, product_2..

与customer_1,product_1 customer_1,product_2 customer_2,product_2 ..

This way, when you are searching for a product in lucene, it will give both the customer as well as the products id.. and instead of joining them in RDBMS, you can simply look up those customers as well as products for more information from RDBMS, if there is a need. If you are using caching, then the additional details lookup cost will also go down.

这样,当您在lucene中搜索产品时,它将同时给予客户以及产品ID ..而不是在RDBMS中加入它们,您可以简单地查找这些客户以及产品以获取更多信息。 RDBMS,如果有需要的话。如果您正在使用缓存,那么额外的详细信息查找成本也将下降。

#3


0  

Based on BrokenGlass's answer, I've thought some more and am proposing the following to see if I'm on the right lines:

基于BrokenGlass的答案,我已经考虑了一些,并提出以下建议,看看我是否在正确的路线上:

Basically, taking option 2 further, one could do the following:

基本上,进一步采取备选方案2,可以做到以下几点:

  1. Put only the data you want to search on into the Lucene index, plus some sort of key value (e.g. the PK of a table in your database).
  2. 只将您要搜索的数据放入Lucene索引中,加上某种键值(例如数据库中表的PK)。

  3. Query Lucene to get a list of hits.
  4. 查询Lucene以获取命中列表。

  5. Using your data access layer of choice, build a query for your database that includes an IN (value [, value]) predicate.
  6. 使用您选择的数据访问层,为您的数据库构建一个包含IN(value [,value])谓词的查询。

  7. Get the results for that query from your database (which may well include JOINs to other tables).
  8. 从您的数据库中获取该查询的结果(可能包括JOIN到其他表)。

  9. Put those results in a dictionary, using the PK of the resultset as the key.
  10. 将这些结果放在字典中,使用结果集的PK作为键。

  11. Iterate the Lucene hits again in order, pulling the items from the dictionary using the PK so you can build a list of results in the order that Lucene returned the hits (i.e. sorted by relevance).
  12. 再次按顺序迭代Lucene命中,使用PK从字典中提取项目,这样您就可以按照Lucene返回命中的顺序构建结果列表(即按相关性排序)。

  13. Display that "sorted" list of results to the user.
  14. 向用户显示“排序”结果列表。

Of course steps 5 and 6 could be better, but for the sake of explanation I put that verbose method in my description. If the Lucene hits include some sort of "relevance" value, then you could attribute that to the resultset and perform a standard sort, but that's an exercise for the reader. :)

当然,第5步和第6步可能会更好,但为了便于解释,我将这种冗长的方法放在我的描述中。如果Lucene命中包含某种“相关性”值,那么您可以将其归因于结果集并执行标准排序,但这对读者来说是一种练习。 :)

Could this be it?

这可能吗?