如何按Lucene.Net字段排序并忽略常见的停用词,如'a'和'the'?

时间:2022-05-16 03:06:55

I've found how to sort query results by a given field in a Lucene.Net index instead of by score; all it takes is a field that is indexed but not tokenized. However, what I haven't been able to figure out is how to sort that field while ignoring stop words such as "a" and "the", so that the following book titles, for example, would sort in ascending order like so:

我已经找到了如何通过Lucene.Net索引中的给定字段而不是分数对查询结果进行排序;所需要的只是一个被索引但没有被标记化的字段。但是,我无法弄清楚的是如何在忽略诸如“a”和“the”之类的停用词的同时对该字段进行排序,以便以下书籍标题按升序排序,如下所示:

  1. The Cat in the Hat
  2. 帽子里的猫

  3. Horton Hears a Who
  4. 霍顿听到了谁

Is such a thing possible, and if yes, how?

这样的事情是否可能,如果是的话,怎么样?

I'm using Lucene.Net 2.3.1.2.

我正在使用Lucene.Net 2.3.1.2。

5 个解决方案

#1


1  

I wrap the results returned by Lucene into my own collection of custom objects. Then I can populate it with extra info/context information (and use things like the highlighter class to pull out a snippet of the matches), plus add paging. If you took a similar route you could create a "result" class/object, add something like a SortBy property and grab whatever field you wanted to sort by, strip out any stop words, then save it in this property. Now just sort the collection based on that property instead.

我将Lucene返回的结果包装到我自己的自定义对象集合中。然后我可以使用额外的信息/上下文信息填充它(并使用诸如荧光笔类之类的东西来提取匹配的片段),以及添加分页。如果您采用类似的路线,您可以创建一个“结果”类/对象,添加类似于SortBy属性的东西并抓住您想要排序的任何字段,去掉任何停用词,然后将其保存在此属性中。现在只需根据该属性对集合进行排序。

#2


0  

When you create your index, create a field that only contains the words you wish to sort on, then when retrieving, sort on that field but display the full title.

创建索引时,创建一个仅包含要排序的单词的字段,然后在检索时,对该字段进行排序,但显示完整标题。

#3


0  

It's been a while since I used Lucene but my guess would be to add an extra field for sorting and storing the value in there with the stop words already stripped. You can probably use the same analyzers to generate this value.

自从我使用Lucene已经有一段时间了,但我的猜测是添加一个额外的字段用于排序和存储值,并且已经剥离了停用词。您可以使用相同的分析器来生成此值。

#4


0  

There seems to be a catch-22 in that you must tokenize a field with an analyzer in order to strip punctuation and stop words, but you can't sort on tokenized fields. How then to strip the stop words without tokenizing?

似乎有一个catch-22,您必须使用分析器对字段进行标记,以便去除标点符号并停止单词,但是您无法对标记化字段进行排序。那么如何在没有标记的情况下去除停用词?

#5


0  

For search, I found search lucene .net index with sort option link interesting to solve ur problem

对于搜索,我发现搜索lucene .net索引与排序选项链接有趣解决你的问题

#1


1  

I wrap the results returned by Lucene into my own collection of custom objects. Then I can populate it with extra info/context information (and use things like the highlighter class to pull out a snippet of the matches), plus add paging. If you took a similar route you could create a "result" class/object, add something like a SortBy property and grab whatever field you wanted to sort by, strip out any stop words, then save it in this property. Now just sort the collection based on that property instead.

我将Lucene返回的结果包装到我自己的自定义对象集合中。然后我可以使用额外的信息/上下文信息填充它(并使用诸如荧光笔类之类的东西来提取匹配的片段),以及添加分页。如果您采用类似的路线,您可以创建一个“结果”类/对象,添加类似于SortBy属性的东西并抓住您想要排序的任何字段,去掉任何停用词,然后将其保存在此属性中。现在只需根据该属性对集合进行排序。

#2


0  

When you create your index, create a field that only contains the words you wish to sort on, then when retrieving, sort on that field but display the full title.

创建索引时,创建一个仅包含要排序的单词的字段,然后在检索时,对该字段进行排序,但显示完整标题。

#3


0  

It's been a while since I used Lucene but my guess would be to add an extra field for sorting and storing the value in there with the stop words already stripped. You can probably use the same analyzers to generate this value.

自从我使用Lucene已经有一段时间了,但我的猜测是添加一个额外的字段用于排序和存储值,并且已经剥离了停用词。您可以使用相同的分析器来生成此值。

#4


0  

There seems to be a catch-22 in that you must tokenize a field with an analyzer in order to strip punctuation and stop words, but you can't sort on tokenized fields. How then to strip the stop words without tokenizing?

似乎有一个catch-22,您必须使用分析器对字段进行标记,以便去除标点符号并停止单词,但是您无法对标记化字段进行排序。那么如何在没有标记的情况下去除停用词?

#5


0  

For search, I found search lucene .net index with sort option link interesting to solve ur problem

对于搜索,我发现搜索lucene .net索引与排序选项链接有趣解决你的问题