ElasticSearch使用模式分析器搜索特殊字符

时间:2021-09-16 20:08:11

I'm currently using a custom analyzer with the tokenizer set to be the pattern (\W|_)+ So so each term is only letters and split on any non letter. As an example I have a document with the contents [dbo].[Material_Get] and another with dbo.Another_Material_Get. I want to be able to search for "Material_Get" and have a hit on both documents but if I put a search of "[Material_Get]" it still hits on dbo.Another_Material_Get even though it doesn't have the brackets in it. Also if I search for "Material Get" (in a quoted search) I shouldn't get any hits since neither of them have that phrase in it.

我目前正在使用自定义分析器,将标记器设置为模式(\ W | _)+所以所以每个术语只是字母并且在任何非字母上分开。作为一个例子,我有一个文档,其内容为[dbo]。[Material_Get],另一个文档的内容为dbo.Another_Material_Get。我希望能够搜索“Material_Get”并对两个文档都进行搜索,但如果我搜索“[Material_Get]”,它仍会在dbo.Another_Material_Get上点击,即使它没有括号。此外,如果我搜索“材料获取”(在引用的搜索中)我不应该得到任何命中,因为它们都没有这个短语。

I could settle for an analyzer/tokenizer that would find whenever there is the input string anywhere in the file even if it has other things next to it. For example searching for "aterial_get" would match in both. Is it possible to do either of my cases?

我可以选择一个分析器/标记器,只要文件中的任何地方有输入字符串就可以找到,即使它旁边还有其他东西。例如,搜索“aterial_get”将在两者中匹配。我可以做任何一种情况吗?

1 个解决方案

#1


From what you have explained what I got is that you want to do partial matches also like searching for "aterial_get".

从你所解释的我得到的是你想做部分匹配也喜欢搜索“aterial_get”。

To satisfy all your requirement, you need to change the mapping of your field to have ngram token filter in the analyzer and without removing the special characters. A sample analyzer can look like

为了满足您的所有要求,您需要更改字段的映射以在分析器中具有ngram标记过滤器,而无需删除特殊字符。样品分析仪看起来像

{
  "settings":{
    "analysis":{
      "analyzer":{
        "partialmatch":{
          "type":"custom",
          "tokenizer":"keyword",
          "filter":[ "lowercase", "ngram" ] 
        }
      },
      "filter":{
        "ngram":{
          "type":"ngram",
          "min_gram":2,
          "max_gram":15
        }
      }
    }
  }
}

And define in your mapping for your_field the analyzer "partialmatch" defined above. You can change the values of min_gram and max_gram as per your needs.

并在your_field的映射中定义上面定义的分析器“partialmatch”。您可以根据需要更改min_gram和max_gram的值。

With this mapping you can do a normal term search like below

使用此映射,您可以执行如下所示的常规术语搜索

{
    "term": {
        "your_field": "aterial_get"
    }
}

#1


From what you have explained what I got is that you want to do partial matches also like searching for "aterial_get".

从你所解释的我得到的是你想做部分匹配也喜欢搜索“aterial_get”。

To satisfy all your requirement, you need to change the mapping of your field to have ngram token filter in the analyzer and without removing the special characters. A sample analyzer can look like

为了满足您的所有要求,您需要更改字段的映射以在分析器中具有ngram标记过滤器,而无需删除特殊字符。样品分析仪看起来像

{
  "settings":{
    "analysis":{
      "analyzer":{
        "partialmatch":{
          "type":"custom",
          "tokenizer":"keyword",
          "filter":[ "lowercase", "ngram" ] 
        }
      },
      "filter":{
        "ngram":{
          "type":"ngram",
          "min_gram":2,
          "max_gram":15
        }
      }
    }
  }
}

And define in your mapping for your_field the analyzer "partialmatch" defined above. You can change the values of min_gram and max_gram as per your needs.

并在your_field的映射中定义上面定义的分析器“partialmatch”。您可以根据需要更改min_gram和max_gram的值。

With this mapping you can do a normal term search like below

使用此映射,您可以执行如下所示的常规术语搜索

{
    "term": {
        "your_field": "aterial_get"
    }
}