使用Lucene搜索API查找完全匹配

I'm working on a company search API using Lucene. My Lucene company index has got 2 companies: 1.Abigail Adams National Bancorp, Inc. 2.National Bancorp

我正在使用Lucene开发公司搜索API。我的Lucene公司指数有2家公司:1.Abigail Adams National Bancorp,Inc。2.National Bancorp

If the user types in National Bancorp, then only company # 2(ie. National Bancorp) should be returned and not #1.....ie. only exact matches should be returned. How do I achieve this functionality?

如果用户输入National Bancorp,则只返回公司#2(即National Bancorp),而不是#1 ..... ie。只返回完全匹配。我如何实现此功能?

Thanks for reading.

谢谢阅读。

4 个解决方案

#1

You can use KeywordAnalyzer to index and search on this field. Keyword Analyzer will generate only one token for the entire string.

您可以使用KeywordAnalyzer对此字段进行索引和搜索。关键字分析器将仅为整个字符串生成一个标记。

#2

You may want to reconsider your requirements, depending on whether or not I correctly understood your question. Please bare with me if I did misunderstand you.

您可能需要重新考虑您的要求,具体取决于我是否正确理解您的问题。如果我误解了你,请和我一起露面。

Just a little food for thought:

只是一点思考的东西:

If you only want exact matches returned, then why are you searching in the first place?

如果您只想要返回完全匹配,那么您为什么要首先搜索?
Are you sure that the user expects exact matches? I typically search assuming that the search engine will accommodate missing words.

您确定用户希望完全匹配吗?我通常搜索假设搜索引擎将容纳丢失的单词。
Suppose the user searched for National Bank but National Bank was no longer in your index. Would you still want Abigail Adams National Bancorp, Inc to be excluded from the results simply because it was not an exact match?

假设用户搜索了国家银行,但国家银行不再在您的索引中。您是否仍然希望Abigail Adams National Bancorp,Inc被排除在结果之外,因为它不完全匹配?

In light of this, I would suggest you continue to present all possible matches (exact or not) to the user and let them decide for themselves which is most appropriate for them. I say this simply because you may not be thinking the same way as all of your users. Lucene will take care of making sure the closest matches rank highest in the results, helping them make quicker choices.

鉴于此,我建议你继续向用户展示所有可能的匹配(准确与否),并让他们自己决定最适合他们的匹配。我这样说只是因为你可能没有和所有用户一样思考问题。 Lucene将负责确保最接近的比赛在结果中排名最高,帮助他们做出更快的选择。

#3

This is something that may warrant the use of the shingle filter. This filter groups multiple words together. For example, Abigail Adams National Bancorp with a ShingleFilter of 3 tokens would produce (assuming a simple WhitespaceAnalyzer) [Abigail], [Abigail Adams], [Abigail Adams National], [Adams National Bancorp], [Adams National], [Adams], [National], [National Bancorp] and [Bancorp].

这可以保证使用木瓦过滤器。此过滤器将多个单词组合在一起例如,带有3个令牌的ShingleFilter的Abigail Adams National Bancorp将生产(假设一个简单的WhitespaceAnalyzer)[Abigail],[Abigail Adams],[Abigail Adams National],[Adams National Bancorp],[Adams National],[Adams] ,[National],[National Bancorp]和[Bancorp]。

If a user the queries for National Bancorp, you will get an exact match on National Bancorp itself, and a lower scored exact match on Abigail Adams National Bancorp (lower scored because this one has much more tokens in the field, thus lowering the idf). I think it makes sense to return both documents on such a query.

如果用户对National Bancorp的查询,您将获得National Bancorp本身的完全匹配,以及Abigail Adams National Bancorp的较低得分精确匹配(较低的得分,因为这个在该领域有更多的令牌,因此降低了idf) 。我认为在这样的查询中返回两个文档是有意义的。

You may want to apply the shingle filter at query time as well, depending on the use case.

您可能还希望在查询时应用shingle过滤器,具体取决于用例。

#4

I googled a lot with no help for the same problem. After scratching my head for a while I found the solution. Search the string within double quotes, that will solve your problem.

我搜索了很多,没有任何帮助同样的问题。抓了一会儿后,我找到了解决方案。在双引号内搜索字符串,这将解决您的问题。

National Bancorp will return both #1 and #2 but "National Bancorp" will return only #2.

National Bancorp将返回#1和#2,但“National Bancorp”将仅返回#2。

#1