多字段特性及Mapping中配置自定义Analyzer

时间:2021-09-24 05:22:22

目录

报错org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried

#原因:线程占用
#杀死进程,启动进程
kill -9 `ps -ef | grep [e]lasticsearch | grep [j]ava | awk '{print $2}'`
elasticsearch

多字段特性

厂商名字实现精确匹配
增加一个keyword字段
使用不同的analyzer
不同语言
pinyin字段的搜索
还支持为搜索和索引指定不通的analyzer

Exact Values v.s Full Text

Exact Values:包括数字、日期、具体一个字符串(例如"Apple Store")
Elasticsearch中的keyword
全文本,非结构化的文本数据
Elasticsearch中的text

Exact Values不需要分词

Elasticsearch为每一个字段创建一个倒排索引
Exact Value在索引时,不需要做特殊的分词处理

自定义分词

当Elasticsearch自带的分词器无法满足时,可以自定义分词器。通过自组合不同的组件实现
Character Filter
Tokenizer
Token Filter

Character Filter

在Tokenizer之前对文本进行处理,例如增加删除及替换字符。可以配置多个Character Filter。会影响Tokenizer的position和offset信息
一些自带的Character Filters
HTML strip - 去除 html 标签
Mapping - 字符串替换
Pattern replace - 正则匹配替换

Tokenizer

将原始的文本按照一定的规则,切分为词(term or token)
Elasticsearch内置的Tokenizers
whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
可以用java开发插件,实现自己的Tokenizer

Token Filters

将Tokenizer输出的单词(term),进行增加,修改,删除
自带的Token Filters
Lowercase / stop / synonym(添加近义词)

设置一个Custom Analyzer

提交请求,清除html标签

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text":"<b>hello world</b>"
}

返回响应

{
  "tokens" : [
    {
      "token" : "hello world",
      "start_offset" : 3,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

使用char filter进行替换减号

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type":"mapping",
      "mappings":["- => _"]
    }
    ],
    "text": "123-456, I-test! test-990 650-555-1234"
}

返回结果

{
  "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test_990",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "650_555_1234",
      "start_offset" : 26,
      "end_offset" : 38,
      "type" : "<NUM>",
      "position" : 3
    }
  ]
}

char filter 替换表情符号

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type":"mapping",
      "mappings":[":) => happy",":( => sad"]
    }
    ],
    "text": ["I am felling :)","Feeling :( today"]
}

返回响应

{
  "tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "Feeling",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 104
    },
    {
      "token" : "sad",
      "start_offset" : 24,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 105
    },
    {
      "token" : "today",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 106
    }
  ]
}

正则表达式

GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type":"pattern_replace",
      "pattern":"http://(.*)",
      "replacement":"$1"
    }
    ],
    "text": "http://www.elastic.co"
}

返回值

  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

按目录切分

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/usr/ymruan/a/b"
}

返回结果

{
  "tokens" : [
    {
      "token" : "/usr",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/ymruan",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/ymruan/a",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/ymruan/a/b",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    }
  ]
}

whitespace与stop

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

返回结果

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "falls",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "mainly",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "plain.",
      "start_offset" : 38,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    }
  ]
}

remove 加入lowercase后,The被当成stopword删除

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

返回结果

{
  "tokens" : [
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "falls",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "mainly",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "plain.",
      "start_offset" : 38,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    }
  ]
}

听20章10分钟视频再记录