Solr滤波器查询中的模糊问题

时间:2021-07-08 01:30:01

It would be grateful if somebody can help me with my problem. I have this query:

如果有人能帮我解决问题,我将不胜感激。我有这个问题:

select?q=city:Frankfurt am Main~&fq=street:Gerhart-Hauptmann-Str.~

This is not working for me. I want to use fuzzy search to catch some user input mistakes.

这不适合我。我想使用模糊搜索来捕获一些用户输入错误。

Here is what I want:

这就是我想要的:

  • Frankfurt am Main should be searched completely in the field city with fuzzy search
  • 应该通过模糊搜索在田野城市中完全搜索法兰克福

  • Gerhart-Hauptmann-Str. should be converted into three terms with fuzzy search.
  • 格哈特 - 豪普特曼-STR。应该用模糊搜索转换成三个术语。

Debug output of what I get actually:

我得到的实际调试输出:

"debug": {
    "rawquerystring": "city:Frankfurt am Main~",
    "querystring": "city:Frankfurt am Main~",
    "parsedquery": city:frankfurt text:am text:Main~2",
    "parsedquery_toString": "city:frankfurt text:am text:Main~2",
    "explain": {...},
    "QParser": "LuceneQParser",
    "filter_queries": [
      "street:Gerhart-Hauptmann-Str.~"
    ],
    "parsed_filter_queries": [
      "street:gerhart-hauptmann-str.~2"
    ],

I (think) I want this output:

我(想)我想要这个输出:

 "debug": {
        "rawquerystring": "city:Frankfurt am Main~",
        "querystring": "city:Frankfurt am Main~",
        "parsedquery": city:frankfurt~2 city:am~2 text:Main~2",
        "parsedquery_toString": "city:frankfurt~2 city:am~2 text:Main~2",
        "explain": {...},
        "QParser": "LuceneQParser",
        "filter_queries": [
          "street:Gerhart-Hauptmann-Str.~"
        ],
        "parsed_filter_queries": [
         # My analyser converts Str. to strasse
          "street:gerhart~2 street:hauptmann~2 strasse~2"
        ],

The definition of the fields in the schema.xml

schema.xml中字段的定义

<field name="city" type="admin_name" indexed="true" stored="true" />
<field name="street" type="street_name" indexed="true" stored="true" multiValued="false"/>

<fieldType name="admin_name" class="solr.TextField" >
       <analyzer>         
          <tokenizer class="solr.StandardTokenizerFactory"/>          
          <filter class="solr.LowerCaseFilterFactory" />
          <filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de_admin.txt"/>       
          <filter class="solr.ASCIIFoldingFilterFactory"/>
       </analyzer>   
    </fieldType>

    <fieldType name="street_name" class="solr.TextField" >
       <analyzer>         
          <tokenizer class="solr.StandardTokenizerFactory"/>          
          <filter class="solr.LowerCaseFilterFactory" />
          <!-- The StartEndSynonymFilter replaces synonyms which 
               are at the start or the end of an term. The types
               START_SYNONYM or END_SYNONYM will be set. -->          
          <filter class="my.StartEndSynonymFilterFactory" synonyms="lang/synonyms_de_street.txt"/>        
          <filter class="solr.ASCIIFoldingFilterFactory"/>
       </analyzer>   
    </fieldType>

Is this somehow possible?

这有点可能吗?

If you need additional information to answer, please leave a hint in a comment.

如果您需要其他信息来回答,请在评论中留言。

1 个解决方案

#1


  1. Tokenizing on Hyphens
  2. 对连字符进行标记

Have a look at the WordDelimiterFilterFactory: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

看一下WordDelimiterFilterFactory:https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

  1. Applying Fuzzy to every single term
  2. 将模糊应用于每个单项

DISCLAIMER: I have not yet used fuzzy search in my SOLR setups.

免责声明:我还没有在SOLR设置中使用模糊搜索。

You might have to be careful with tokenizing the city names and applying the fuzzy search to every single token. Your example "Frankfurt am Main" would in this case apply fuzzy search to "am", as well. Please try with parenthesis: (Frankfurt am Main)~ whether this gets you the intended result.

您可能必须小心标记城市名称并将模糊搜索应用于每个令牌。在这种情况下,您的示例“法兰克福”将模糊搜索应用于“am”。请尝试用括号:(法兰克福)〜这是否能获得预期的结果。

However, in case of names (city or streets) I'm not sure you should be even tokenizing them. Maybe storing them as one case insensitive token and applying the fuzzy search like this "Frankfurt am Main"~ (with quotes in the query) is actually what you need.

但是,如果是名字(城市或街道),我不确定你是否应该对它们进行标记。也许将它们存储为一个不区分大小写的令牌并应用模糊搜索,如“Frankfurt am Main”〜(在查询中带引号)实际上就是您所需要的。

Nevertheless, you should try and get it to work in the way you have described it. Then look at the query results. And (maybe in parallel) setup an index where you store the city and street names as single tokens (KeywordTokenizer with lower casing and ascii folding, e.g.) and apply fuzzy search to them as single terms. I would guess that the results will be sharper. But best - try it out and compare.

不过,你应该尝试按照你所描述的方式使它工作。然后查看查询结果。并且(可能并行)设置索引,将城市和街道名称存储为单个标记(具有较低套管和ascii折叠的KeywordTokenizer,例如),并将模糊搜索作为单个术语应用于它们。我猜想结果会更清晰。但最好 - 尝试一下并进行比较。

In addition, I would suggest to try out the (extended or not) DisMax Handler for input without even caring to differentiate between cities and streets on the input side: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

另外,我建议尝试使用(扩展或不扩展)DisMax处理程序进行输入,甚至无需区分输入端的城市和街道:https://cwiki.apache.org/confluence/display/solr/The +扩展+ DisMax +查询+分析器

With the dismax handler processing the input, you can allow the user to input search terms very freely (like having a single search field where cities and streets can be input in random order and format).

通过dismax处理程序处理输入,您可以允许用户非常*地输入搜索项(例如,具有单个搜索字段,其中城市和街道可以以随机顺序和格式输入)。

#1


  1. Tokenizing on Hyphens
  2. 对连字符进行标记

Have a look at the WordDelimiterFilterFactory: https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

看一下WordDelimiterFilterFactory:https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

  1. Applying Fuzzy to every single term
  2. 将模糊应用于每个单项

DISCLAIMER: I have not yet used fuzzy search in my SOLR setups.

免责声明:我还没有在SOLR设置中使用模糊搜索。

You might have to be careful with tokenizing the city names and applying the fuzzy search to every single token. Your example "Frankfurt am Main" would in this case apply fuzzy search to "am", as well. Please try with parenthesis: (Frankfurt am Main)~ whether this gets you the intended result.

您可能必须小心标记城市名称并将模糊搜索应用于每个令牌。在这种情况下,您的示例“法兰克福”将模糊搜索应用于“am”。请尝试用括号:(法兰克福)〜这是否能获得预期的结果。

However, in case of names (city or streets) I'm not sure you should be even tokenizing them. Maybe storing them as one case insensitive token and applying the fuzzy search like this "Frankfurt am Main"~ (with quotes in the query) is actually what you need.

但是,如果是名字(城市或街道),我不确定你是否应该对它们进行标记。也许将它们存储为一个不区分大小写的令牌并应用模糊搜索,如“Frankfurt am Main”〜(在查询中带引号)实际上就是您所需要的。

Nevertheless, you should try and get it to work in the way you have described it. Then look at the query results. And (maybe in parallel) setup an index where you store the city and street names as single tokens (KeywordTokenizer with lower casing and ascii folding, e.g.) and apply fuzzy search to them as single terms. I would guess that the results will be sharper. But best - try it out and compare.

不过,你应该尝试按照你所描述的方式使它工作。然后查看查询结果。并且(可能并行)设置索引,将城市和街道名称存储为单个标记(具有较低套管和ascii折叠的KeywordTokenizer,例如),并将模糊搜索作为单个术语应用于它们。我猜想结果会更清晰。但最好 - 尝试一下并进行比较。

In addition, I would suggest to try out the (extended or not) DisMax Handler for input without even caring to differentiate between cities and streets on the input side: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

另外,我建议尝试使用(扩展或不扩展)DisMax处理程序进行输入,甚至无需区分输入端的城市和街道:https://cwiki.apache.org/confluence/display/solr/The +扩展+ DisMax +查询+分析器

With the dismax handler processing the input, you can allow the user to input search terms very freely (like having a single search field where cities and streets can be input in random order and format).

通过dismax处理程序处理输入,您可以允许用户非常*地输入搜索项(例如,具有单个搜索字段,其中城市和街道可以以随机顺序和格式输入)。