Elasticsearch实践（四）：IK分词

环境：Elasticsearch 6.2.4 + Kibana 6.2.4 + ik 6.2.4

Elasticsearch默认也能对中文进行分词。

我们先来看看自带的中文分词效果：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json;'  -d '{"analyzer": "default","text": "今天天气真好"}'

GET /_analyze

{

  "analyzer": "default",

  "text": "今天天气真好"

}

结果：

{

  "tokens": [

    {

      "token": "今",

      "start_offset": 0,

      "end_offset": 1,

      "type": "<IDEOGRAPHIC>",

      "position": 0

    },

    {

      "token": "天",

      "start_offset": 1,

      "end_offset": 2,

      "type": "<IDEOGRAPHIC>",

      "position": 1

    },

    {

      "token": "天",

      "start_offset": 2,

      "end_offset": 3,

      "type": "<IDEOGRAPHIC>",

      "position": 2

    },

    {

      "token": "气",

      "start_offset": 3,

      "end_offset": 4,

      "type": "<IDEOGRAPHIC>",

      "position": 3

    },

    {

      "token": "真",

      "start_offset": 4,

      "end_offset": 5,

      "type": "<IDEOGRAPHIC>",

      "position": 4

    },

    {

      "token": "好",

      "start_offset": 5,

      "end_offset": 6,

      "type": "<IDEOGRAPHIC>",

      "position": 5

    }

  ]

}

我们发现，是按照每个字进行分词的。这种在实际应用里肯定达不到想要的效果。当然，如果是日志搜索，使用自带的就足够了。

analyzer=default其实调用的是standard分词器。

接下来，我们安装IK分词插件进行分词。

安装IK

IK项目地址：https://github.com/medcl/elasticsearch-analysis-ik

首先需要说明的是，IK插件必须和 ElasticSearch 的版本一致，否则不兼容。

安装方法1：

从 https://github.com/medcl/elasticsearch-analysis-ik/releases 下载压缩包，然后在ES的plugins目录创建analysis-ik子目录，把压缩包的内容复制到这个目录里面即可。最终plugins/analysis-ik/目录里面的内容：

plugins/analysis-ik/

    commons-codec-1.9.jar

    commons-logging-1.2.jar

    elasticsearch-analysis-ik-6.2.4.jar

    httpclient-4.5.2.jar

    httpcore-4.4.4.jar

    plugin-descriptor.properties

然后重启 ElasticSearch。

安装方法2：

./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip

如果已下载压缩包，直接使用：

./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install file:///tmp/elasticsearch-analysis-ik-6.2.4.zip

然后重启 ElasticSearch。

IK分词

IK支持两种分词模式：

ik_max_word: 会将文本做最细粒度的拆分，会穷尽各种可能的组合
ik_smart: 会做最粗粒度的拆分

接下来，我们测算IK分词效果和自带的有什么不同：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "ik_smart","text": "今天天气真好"}'

结果：

{

  "tokens": [

    {

      "token": "今天天气",

      "start_offset": 0,

      "end_offset": 4,

      "type": "CN_WORD",

      "position": 0

    },

    {

      "token": "真好",

      "start_offset": 4,

      "end_offset": 6,

      "type": "CN_WORD",

      "position": 1

    }

  ]

}

再试一下ik_max_word的效果：

{

  "tokens": [

    {

      "token": "今天天气",

      "start_offset": 0,

      "end_offset": 4,

      "type": "CN_WORD",

      "position": 0

    },

    {

      "token": "今天",

      "start_offset": 0,

      "end_offset": 2,

      "type": "CN_WORD",

      "position": 1

    },

    {

      "token": "天天",

      "start_offset": 1,

      "end_offset": 3,

      "type": "CN_WORD",

      "position": 2

    },

    {

      "token": "天气",

      "start_offset": 2,

      "end_offset": 4,

      "type": "CN_WORD",

      "position": 3

    },

    {

      "token": "真好",

      "start_offset": 4,

      "end_offset": 6,

      "type": "CN_WORD",

      "position": 4

    }

  ]

}

设置mapping默认分词器

示例：

{

    "properties": {

        "content": {

            "type": "text",

            "analyzer": "ik_max_word",

            "search_analyzer": "ik_max_word"

        }

    }

}

注：这里设置 search_analyzer 与 analyzer 相同是为了确保搜索时和索引时使用相同的分词器，以确保查询中的术语与反向索引中的术语具有相同的格式。如果不设置 search_analyzer，则 search_analyzer 与 analyzer 相同。详细请查阅：https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

防盗版声明：本文系原创文章，发布于公众号飞鸿影的博客(fhyblog)及博客园，转载需作者同意。

自定义分词词典

我们也可以定义自己的词典供IK使用。比如：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "ik_smart","text": "去朝阳公园"}'

结果：

{

  "tokens": [

    {

      "token": "去",

      "start_offset": 0,

      "end_offset": 1,

      "type": "CN_CHAR",

      "position": 0

    },

    {

      "token": "朝阳",

      "start_offset": 1,

      "end_offset": 3,

      "type": "CN_WORD",

      "position": 1

    },

    {

      "token": "公园",

      "start_offset": 3,

      "end_offset": 5,

      "type": "CN_WORD",

      "position": 2

    }

  ]

}

我们希望朝阳公园作为一个整体，这时候可以把该词加入到自己的词典里。

新建自己的词典只需要简单几步就可以完成：

1、在elasticsearch-6.2.4/config/analysis-ik/目录增加一个my.dic:

$ touch my.dic

$ echo 朝阳公园 > my.dic

$ cat my.dic

朝阳公园

.dic为词典文件，其实就是简单的文本文件，词语与词语直接需要换行。注意是UTF8编码。我们看一下自带的分词文件：

$ head -n 5 main.dic

一一列举

一一对应

一一道来

一丁

一丁不识

2、然后修改elasticsearch-6.2.4/config/analysis-ik/IKAnalyzer.cfg.xml文件：

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

	<comment>IK Analyzer 扩展配置</comment>

	<!--用户可以在这里配置自己的扩展字典 -->

	<entry key="ext_dict">my.dic</entry>

	 <!--用户可以在这里配置自己的扩展停止词字典-->

	<entry key="ext_stopwords"></entry>

	<!--用户可以在这里配置远程扩展字典 -->

	<!-- <entry key="remote_ext_dict">words_location</entry> -->

	<!--用户可以在这里配置远程扩展停止词字典-->

	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->

</properties>

增加了my.dic，然后重启ES。我们再看一下效果：

GET /_analyze

{

  "analyzer": "ik_smart",

  "text": "去朝阳公园"

}

结果：

{

  "tokens": [

    {

      "token": "去",

      "start_offset": 0,

      "end_offset": 1,

      "type": "CN_CHAR",

      "position": 0

    },

    {

      "token": "朝阳公园",

      "start_offset": 1,

      "end_offset": 5,

      "type": "CN_WORD",

      "position": 1

    }

  ]

}

说明自定义词典生效了。如果有多个词典，使用英文分号隔开：

<entry key="ext_dict">my.dic;custom/single_word_low_freq.dic</entry>

另外，我们看到配置里还有个扩展停止词字典，这个是用来辅助断句的。我们可以看一下自带的一个扩展停止词字典：

$ head -n 5 extra_stopword.dic

也

了

仍

从

以

也就是IK分词器遇到这些词就认为前面的词语不会与这些词构成词语。

IK分词也支持远程词典，远程词典的好处是支持热更新。词典格式和本地的一致，都是一行一个分词（换行符用 \n），还要求填写的URL满足：

该 http 请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。

详见：https://github.com/medcl/elasticsearch-analysis-ik 热更新 IK 分词使用方法部分。

注意：上面的示例里我们改的是``elasticsearch-6.2.4/config/analysis-ik/目录下内容，是因为IK是通过方法2里elasticsearch-plugin安装的。如果你是通过解压方式安装的，那么IK配置会在plugins目录，即：elasticsearch-6.2.4/plugins/analysis-ik/config`。也就是说插件的配置既可以放在插件所在目录，也可以放在Elasticsearch的config目录里面。

ES内置的Analyzer分析器

es自带了许多内置的Analyzer分析器，无需配置就可以直接在index中使用：

标准分词器（standard）：以单词边界切分字符串为terms，根据Unicode文本分割算法。它会移除大部分的标点符号，小写分词后的term，支持停用词。
简单分词器（simple）：该分词器会在遇到非字母时切分字符串，小写所有的term。
空格分词器（whitespace）：遇到空格字符时切分字符串，
停用词分词器（stop）：类似简单分词器，同时支持移除停用词。
关键词分词器（keyword）：无操作分词器，会输出与输入相同的内容作为一个single term。
模式分词器（pattern）：使用正则表达式讲字符串且分为terms。支持小写字母和停用词。
语言分词器（language）：支持许多基于特定语言的分词器，比如english或french。
签名分词器（fingerprint）：是一个专家分词器，会产生一个签名，可以用于去重检测。
自定义分词器：如果内置分词器无法满足你的需求，可以自定义custom分词器，根据不同的character filters，tokenizer，token filters的组合。例如IK就是自定义分词器。

详见文档：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

参考

1、medcl/elasticsearch-analysis-ik: The IK Analysis plugin integrates Lucene IK analyzer into elasticsearch, support customized dictionary.

https://github.com/medcl/elasticsearch-analysis-ik

2、ElesticSearch IK中文分词使用详解 - xsdxs的博客 - CSDN博客

https://blog.csdn.net/xsdxs/article/details/72853288

Elasticsearch实践（四）：IK分词的更多相关文章

linux（centos 7）下安装elasticsearch 5 的 IK 分词器
(一)到IK 下载对应的版本(直接下载release版本,避免mvn打包),下载后是一个zip压缩包 (二)将压缩包上传至elasticsearch 的安装目录下的plugins下,进行解压,运行如 ...
Centos7部署elasticsearch并且安装ik分词以及插件kibana
第一步下载对应的安装包 elasticsearch下载地址:https://www.elastic.co/cn/downloads/elasticsearch ik分词下载:https://gith ...
Elasticsearch 7&period;x - IK分词器插件（ik&lowbar;smart，ik&lowbar;max&lowbar;word）
一.安装IK分词器 Elasticsearch也需要安装IK分析器以实现对中文更好的分词支持. 去Github下载最新版elasticsearch-ik https://github.com/medc ...
Elasticsearch下安装ik分词器
安装ik分词器(必须安装maven) 上传相应jar包解压到相应目录 unzip elasticsearch-analysis-ik-master.zip(zip包) cp -r elasticse ...
【ELK】【docker】【elasticsearch】2&period;使用elasticSearch+kibana+logstash+ik分词器+pinyin分词器+繁简体转化分词器 6&period;5&period;4 启动 ELK+logstash概念描述
官网地址:https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod ...
Elasticsearch拼音和ik分词器的结合应用
一.创建索引时,自定义拼音分词和ik分词 PUT /my_index { "index": { "analysis": { "analyzer&quo ...
通过docker安装elasticsearch和安装ik分词器插件及安装kibana
前提: 已经安装好docker运行环境: 步骤: 1.安装elasticsearch 6.2.2版本,目前最新版是7.2.0,这里之所以选择6.2.2是因为最新的SpringBoot2.1.6默认支持 ...
【ELK】【docker】【elasticsearch】1&period; 使用Docker和Elasticsearch+ kibana 5&period;6&period;9 搭建全文本搜索引擎应用集群,安装ik分词器
系列文章:[建议从第二章开始] [ELK][docker][elasticsearch]1. 使用Docker和Elasticsearch+ kibana 5.6.9 搭建全文本搜索引擎应用集群,安 ...
docker 部署 elasticsearch + elasticsearch-head + elasticsearch-head跨域问题 + IK分词器
0. docker pull 拉取elasticsearch + elasticsearch-head 镜像 1. 启动elasticsearch Docker镜像 docker run -di ...
Docker 下Elasticsearch 的安装和ik分词器
(1)docker镜像下载 docker pull elasticsearch:5.6.8 (2)安装es容器 docker run -di --name=changgou_elasticsearch ...

随机推荐

Debian安装中文输入法
简单暴力: apt-get install ibus-pinyin 你也可以通过GUI下面到Synaptic Package Manager里面选中ibus-pinyin进行安装安装完成后重启计算机 ...
customized English word breaker for sql server 2008
Open the Registry Editor, by: Clicking Start, and clicking Run. In the Run dialog box, in the Open b ...
正则表达式里&quot&semi;-&quot&semi;中划线的使用注意
今天要匹配正则表达式,把非法的字符找出来,开始的写法是这个 [^A-Za-z0-9_.*-+%!],我的目的是把_.*-+%!这7个字符算合法字符,但是发现有许多其他字符也合法了,原来是中划线的位置不 ...
MFC中Edit Control值的获取与赋值
void CEditControlDlg::OnClickedButton() { // TODO: Add your control notification handler code here / ...
Hadoop HDFS概念学习系列之分布式文件管理系统（二十五）
数据量越来越多,在一个操作系统管辖的范围存在不了,那么就分配到更多的操作系统管理的磁盘中,但是不方便管理和维护,因此迫切需要一种系统来管理多台机器上的文件,这就是分布式文件管理系统. 是一种允许文件 ...
Media Queries详细
@media only screen and (max-device-width: 480px) { //页面最大宽度480px } <link rel="stylesheet&quo ...
iOS开发面试题整理（二）
8 类别的作用?继承和类别在实现中有何区别? 答案:category 可以在不获悉,不改变原来代码的情况下往里面添加新的方法,只能添加,不能删除修改. 并且如果类别和原来类中的方法产生名称冲突,则类别 ...
查看hadoop管理页面，修改本地hosts，Browse the filesystem
问题: hadoop管理界面,ip:50070,中点击Browse the filesystem会出现网页无法访问,看地址栏,是集群中的主机名::50075/browseDirectory.jsp?n ...
不安分的this
不安分的this 前言:关于javascript中的this,上网一搜一大片的文章.惊! 而我个人认为要想分清this,就有必要先搞清楚“对象”. 目录: 一.函数对象的认识二.this 一.函数对 ...
八&period;nginx网站服务实践应用
期中集群架构-第八章-期中架构nginx章节====================================================================== 01. web ...