使用Lucene.net索引多语言内容

时间:2021-06-18 03:10:17

I use Lucene.net for indexing content & documents etc.. on websites. The index is very simple and has this format:

我使用Lucene.net在网站上索引内容和文档等。索引非常简单,具有以下格式:

LuceneId - unique id for Lucene (TypeId + ItemId)
TypeId   - the type of text (eg. page content, product, public doc etc..)
ItemId   - the web page id, document id etc..
Text     - the text indexed
Title    - web page title, document name etc.. to display with the search results

I've got these options to adapt it to serve multi-lingual content:

我有这些选项来适应它以提供多语言内容:

  1. Create a separate index for each language. E.g. Lucene-enGB, Lucene-frFR etc..
  2. 为每种语言创建单独的索引。例如。 Lucene-enGB,Lucene-frFR等。

  3. Keep the one index and add an additional 'language' field to it to filter the results.
  4. 保留一个索引并为其添加一个额外的“语言”字段以过滤结果。

Which is the best option - or is there another? I've not used multiple indexes before so I'm leaning toward the second.

哪个是最好的选择 - 还是另一个?之前我没有使用多个索引,所以我倾向于第二个。

2 个解决方案

#1


I do [2], but one problem I have is that I cannot use different analyzers depending on the language. I've combined the stopwords of the languages I want, but I lose the capability of more advanced stuff that the analyzer will offer such as stemming etc.

我做[2],但我遇到的一个问题是我根据语言不能使用不同的分析仪。我结合了我想要的语言的停用词,但是我失去了分析器提供的更先进的东西的功能,例如词干等。

#2


You can eliminate option 1 and 2.
You can use one index and the fields that contains arabic words create two fileds for each: If you have field "Text" might contain arabic or english contents ==>

您可以删除选项1和2.您可以使用一个索引,包含阿拉伯语单词的字段为每个创建两个文件:如果您有字段“文本”可能包含阿拉伯语或英语内容==>

  • Create 2 fields for "Text" : 1 field, "Text", indexed/searched with your standard analyzer and another one, "Text_AR" , with the arabicAnalyzer. In order to achieve that you can use PreFieldAnalyzerWrapper
  • 为“Text”创建2个字段:1个字段,“Text”,使用标准分析器索引/搜索,另一个使用arabicAnalyzer编写“Text_AR”。为了实现这一点,您可以使用PreFieldAnalyzerWrapper

#1


I do [2], but one problem I have is that I cannot use different analyzers depending on the language. I've combined the stopwords of the languages I want, but I lose the capability of more advanced stuff that the analyzer will offer such as stemming etc.

我做[2],但我遇到的一个问题是我根据语言不能使用不同的分析仪。我结合了我想要的语言的停用词,但是我失去了分析器提供的更先进的东西的功能,例如词干等。

#2


You can eliminate option 1 and 2.
You can use one index and the fields that contains arabic words create two fileds for each: If you have field "Text" might contain arabic or english contents ==>

您可以删除选项1和2.您可以使用一个索引,包含阿拉伯语单词的字段为每个创建两个文件:如果您有字段“文本”可能包含阿拉伯语或英语内容==>

  • Create 2 fields for "Text" : 1 field, "Text", indexed/searched with your standard analyzer and another one, "Text_AR" , with the arabicAnalyzer. In order to achieve that you can use PreFieldAnalyzerWrapper
  • 为“Text”创建2个字段:1个字段,“Text”,使用标准分析器索引/搜索,另一个使用arabicAnalyzer编写“Text_AR”。为了实现这一点,您可以使用PreFieldAnalyzerWrapper