在Lucene索引中存储带撇号的单词

时间:2022-02-28 03:09:49

I've a company field in Lucene Index. One of the company names indexed is : Moody's

我在Lucene Index有一个公司领域。索引的公司名称之一是:穆迪

When user types in any of the following keywords,I want this company to come up in search results. 1.Moo 2.Mood 3.Moodys 4.Moody's

当用户键入以下任何关键字时,我希望该公司出现在搜索结果中。 1.Moo 2.Mood 3.Moodys 4.Moody's

How should I store this index in Lucene and what type of Lucene Query should I use to get this behaviour?

我应该如何在Lucene中存储此索引以及我应该使用哪种类型的Lucene Query来获取此行为?

Thanks.

2 个解决方案

#1


Based on your clarifications, I want to divide your question into two, and answer each in turn:

根据您的说明,我想将您的问题分成两部分,然后依次回答:

  1. How do I index words with apostrophes as equivalent to similar words without an apostrophe? e.g. mapping Moodys and Moody's to the same index term.
  2. 如何将带有撇号的单词编码为等同于没有撇号的相似单词?例如将Moodys和Moody映射到相同的索引词。

  3. How do I implement auto-complete search in Lucene - i.e. given an index, find documents using word prefixes, e.g. map Moo to Moodys ?
  4. 如何在Lucene中实现自动完成搜索 - 即给定索引,使用单词前缀查找文档,例如,将Moo映射到Moodys?

1 is relatively easy - Use a StandardToeknizer to create a token combining the apostrophe and s with the previous word, then a StandardFilter to remove the apostrophe and s. This will convert Moody's to Moody. A StandardAnalyzer does this and much more (lowercasing and stop word removal), which may be more than you need. Using a stemmer should take both Moodys and Moody to the same token. Try SnowBallFilter for this.

1相对简单 - 使用StandardToeknizer创建一个令牌,将撇号和s与前一个单词组合,然后使用StandardFilter删除撇号和s。这将把穆迪转变为穆迪。 StandardAnalyzer执行此操作以及更多(小写和停止删除单词),这可能超出您的需要。使用词干分析器应该同时使用Moodys和Moody。试试SnowBallFilter吧。

2 is harder: Lucene's PrefixQuery, to which Alan alluded, will only work when the company name is the first word in a field. You need something like the answer to this question about auto-complete in Lucene.

2更难:Alan提到的Lucene的PrefixQuery只有在公司名称是字段中的第一个单词时才会起作用。你需要像Lucene中关于自动完成这个问题的答案。

#2


The StandardAnalyser should work for 3 and 4, however won't work for 1 and 2.

StandardAnalyser应该适用于3和4,但不适用于1和2。

Without writing your own (complex) text analyser, I would think about how you're expecting company names to be searched for. For example, basic lucene search syntax means that you could find "Moody's" if you search using wildcards: "Moo*" and "Mood*". Therefore, you might want to consider appending an "*" to the search term before submitting to lucene, however this might cause some confusion if the user isn't aware of this wildcard addition under the hood.

如果不编写自己的(复杂的)文本分析器,我会考虑如何期待搜索公司名称。例如,基本的lucene搜索语法意味着如果使用通配符搜索,您可以找到“Moody's”:“Moo *”和“Mood *”。因此,在提交到lucene之前,您可能需要考虑在搜索词中添加“*”,但是如果用户不知道这个通配符添加在引擎盖下,这可能会引起一些混淆。

#1


Based on your clarifications, I want to divide your question into two, and answer each in turn:

根据您的说明,我想将您的问题分成两部分,然后依次回答:

  1. How do I index words with apostrophes as equivalent to similar words without an apostrophe? e.g. mapping Moodys and Moody's to the same index term.
  2. 如何将带有撇号的单词编码为等同于没有撇号的相似单词?例如将Moodys和Moody映射到相同的索引词。

  3. How do I implement auto-complete search in Lucene - i.e. given an index, find documents using word prefixes, e.g. map Moo to Moodys ?
  4. 如何在Lucene中实现自动完成搜索 - 即给定索引,使用单词前缀查找文档,例如,将Moo映射到Moodys?

1 is relatively easy - Use a StandardToeknizer to create a token combining the apostrophe and s with the previous word, then a StandardFilter to remove the apostrophe and s. This will convert Moody's to Moody. A StandardAnalyzer does this and much more (lowercasing and stop word removal), which may be more than you need. Using a stemmer should take both Moodys and Moody to the same token. Try SnowBallFilter for this.

1相对简单 - 使用StandardToeknizer创建一个令牌,将撇号和s与前一个单词组合,然后使用StandardFilter删除撇号和s。这将把穆迪转变为穆迪。 StandardAnalyzer执行此操作以及更多(小写和停止删除单词),这可能超出您的需要。使用词干分析器应该同时使用Moodys和Moody。试试SnowBallFilter吧。

2 is harder: Lucene's PrefixQuery, to which Alan alluded, will only work when the company name is the first word in a field. You need something like the answer to this question about auto-complete in Lucene.

2更难:Alan提到的Lucene的PrefixQuery只有在公司名称是字段中的第一个单词时才会起作用。你需要像Lucene中关于自动完成这个问题的答案。

#2


The StandardAnalyser should work for 3 and 4, however won't work for 1 and 2.

StandardAnalyser应该适用于3和4,但不适用于1和2。

Without writing your own (complex) text analyser, I would think about how you're expecting company names to be searched for. For example, basic lucene search syntax means that you could find "Moody's" if you search using wildcards: "Moo*" and "Mood*". Therefore, you might want to consider appending an "*" to the search term before submitting to lucene, however this might cause some confusion if the user isn't aware of this wildcard addition under the hood.

如果不编写自己的(复杂的)文本分析器,我会考虑如何期待搜索公司名称。例如,基本的lucene搜索语法意味着如果使用通配符搜索,您可以找到“Moody's”:“Moo *”和“Mood *”。因此,在提交到lucene之前,您可能需要考虑在搜索词中添加“*”,但是如果用户不知道这个通配符添加在引擎盖下,这可能会引起一些混淆。