Elasticsearch安装ik中文分词插件（四）

一、IK简介

　　IK Analyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始， IKAnalyzer已经推出了4个大版本。最初，它是以开源项目Luence为应用主体的，结合词典分词和文法分析算法的中文分词组件。从3.0版本开始，IK发展为面向Java的公用分词组件，独立于Lucene项目，同时提供了对Lucene的默认优化实现。在2012版本中，IK实现了简单的分词歧义排除算法，标志着IK分词器从单纯的词典分词向模拟语义分词衍化。

　　IK Analyzer 2012特性:

采用了特有的“正向迭代最细粒度切分算法“，支持细粒度和智能分词两种切分模式。
在系统环境：Core2 i7 3.4G双核，4G内存，window 7 64位， Sun JDK 1.6_29 64位普通pc环境测试，IK2012具有160万字/秒（3000KB/S）的高速处理能力。
2012版本的智能分词模式支持简单的分词排歧义处理和数量词合并输出。
采用了多子处理器分析模式，支持：英文字母、数字、中文词汇等分词处理，兼容韩文、日文字符。
优化的词典存储，更小的内存占用。支持用户词典扩展定义。特别的，在2012版本，词典支持中文，英文，数字混合词语。

二、配置编译环境

　　从Github下载的IK分词是源码包，需要maven环境编译

　　1、下载maven

# wget http://mirrors.hust.edu.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz

　　2、解压　

# tar zxf apache-maven-3.3.-bin.tar.gz -C /usr/src/

　　3、配置环境变量

# vi /etc/profile

    export MAVEN_HOME=/usr/local/apache-maven-3.3.

    export PATH=$PATH:$MAVEN_HOME/bin

# source /etc/profile

三、安装IK分词插件

　　1、下载

　　　　到GitHub上下载适合ElasticSearch版本的IK，地址：https://github.com/medcl/elasticsearch-analysis-ik；也可以通过git clone https://github.com/medcl/elasticsearch-analysis-ik，下载分词器源码。

　　2、解压编译

# unzip elasticsearch-analysis-ik-master.zip

# cd elasticsearch-analysis-ik-master/

# mvn clean package

　　3、复制编译完成的IK分词到elasticsearch的插件路径

# mkdir $elasticsearch/plugins/ik

# cp target/releases/elasticsearch-analysis-ik-1.9..zip $elasticsearch/plugins/ik/

# cd $elasticsearch/plugins/ik/

# unzip elasticsearch-analysis-ik-1.9..zip

　　4、重启elasticsearch，使ik插件生效

# /etc/init.d/elasticsearch restart

四、ik分词测试

　　1、创建一个索引，名为“index”

# curl -XPUT http://localhost:9200/index

　　2、为“index”创建mapping

# curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'

{

    "fulltext": {

            "_all": {

            "analyzer": "ik_max_word",

            "search_analyzer": "ik_max_word",

            "term_vector": "no",

            "store": "false"

        },

        "properties": {

            "content": {

                "type": "string",

                "store": "no",

                "term_vector": "with_positions_offsets",

                "analyzer": "ik_max_word",

                "search_analyzer": "ik_max_word",

                "include_in_all": "true",

                "boost":

            }

        }

    }

}'

3、测试

# curl 'http://10.10.10.26:9200/index/_analyze?analyzer=ik&pretty=true' -d '{"text":"*国歌"}'

显示如下：

{

  "tokens" : [ {

    "token" : "*",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_WORD",

    "position" :

  }, {

    "token" : "中华人民",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_WORD",

    "position" :

  }, {

    "token" : "中华",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_WORD",

    "position" :

  }, {

    "token" : "华人",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_WORD",

    "position" :

  }, {

    "token" : "人民*",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_WORD",

    "position" :

  }, {

    "token" : "人民",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_WORD",

    "position" :

  }, {

    "token" : "*",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_WORD",

    "position" :

  }, {

    "token" : "共和",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_WORD",

    "position" :

  }, {

    "token" : "国",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_CHAR",

    "position" :

  }, {

    "token" : "国歌",

    "start_offset" : ,

    "end_offset" : ,

    "type" : "CN_WORD",

    "position" :

  } ]

}

elasticsearch-analysis-ik的Github地址：https://github.com/medcl/elasticsearch-analysis-ik

秒客网

Elasticsearch安装ik中文分词插件（四）

相关文章