ElasticSearch 中文分词搜索环境搭建

ElasticSearch 是强大的搜索工具，并且是ELK套件的重要组成部分

好记性不如乱笔头，这次是在windows环境下搭建es中文分词搜索测试环境，步骤如下

1、安装jdk1.8，配置好环境变量

2、下载ElasticSearch7.1.1，版本变化比较快，刚才看了下最新版已经是7.2.0，本环境基于7.1.1搭建，下载地址https://www.elastic.co/cn/downloads/elasticsearch，得到一个zip压缩包，解压缩后cmd下运行下面的命令即可启动ES

./bin/elasticsearch.bat

正常启动的话提示符下回输出一些日志记录

ElasticSearch 中文分词搜索环境搭建

浏览器中输入http://localhost:9200/测试服务是否能够正常访问，正常情况会显示下面的概要信息，说明ES搭建成功

ElasticSearch 中文分词搜索环境搭建

3、ElasticSearch 虽然提供了强大Restful接口，但没有一个UI界面操作起来不是很直观，elasticsearch-head很好的解决这个问题，elasticsearch-head是基于node的一个工具，通过连接ES服务提供可视化展示界面，详细参考：

https://github.com/mobz/elasticsearch-head，安装步骤也是很简单，如下

git clone git://github.com/mobz/elasticsearch-head.git

cd elasticsearch-head

npm install

npm run start

服务正常启动后显示界面如下

ElasticSearch 中文分词搜索环境搭建

浏览器中输入http://localhost:9100/可以看到对应UI

ElasticSearch 中文分词搜索环境搭建

4、中文分词插件详细介绍见https://github.com/medcl/elasticsearch-analysis-ik，注意版本不要选错，否则会按照失败，es7.1.1选择对应版本，安装步骤如下：

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.1/elasticsearch-analysis-ik-7.1.1.zip

5、测试中文分词检索功能，先建立索引，在postman或者elasticsearch-head中发送如下请求

--创建索引

curl -XPUT http://localhost:9200/news 

--索引中添加数据

curl -XPOST http://localhost:9200/news/_create/1 -H 'Content-Type:application/json' -d'

{"content":"美国留给伊拉克的是个烂摊子吗"}

'

添加的数据如下

ElasticSearch 中文分词搜索环境搭建

添加索引映射

curl -XPOST http://localhost:9200/news/_mapping -H 'Content-Type:application/json' -d'

{

        "properties": {

            "content": {

                "type": "text",

                "analyzer": "ik_max_word",

                "search_analyzer": "ik_smart"

            }

        }

}'

ik_max_word ik_smart两者的区别

ik_max_word: 会将文本做最细粒度的拆分，比如会将“*国歌”拆分为“*,中华人民,中华,华人,人民*,人民,人,民,*,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query；

ik_smart: 会做最粗粒度的拆分，比如会将“*国歌”拆分为“*,国歌”，适合 Phrase 查询。

测试示例：

http://localhost:9200/_analyze，通过ik_max_word分词，结果如下

输入

{"text":"*人民大会堂","analyzer":"ik_max_word" }

输出

{

    "tokens": [

        {

            "token": "*",

            "start_offset": 0,

            "end_offset": 7,

            "type": "CN_WORD",

            "position": 0

        },

        {

            "token": "中华人民",

            "start_offset": 0,

            "end_offset": 4,

            "type": "CN_WORD",

            "position": 1

        },

        {

            "token": "中华",

            "start_offset": 0,

            "end_offset": 2,

            "type": "CN_WORD",

            "position": 2

        },

        {

            "token": "华人",

            "start_offset": 1,

            "end_offset": 3,

            "type": "CN_WORD",

            "position": 3

        },

        {

            "token": "人民*",

            "start_offset": 2,

            "end_offset": 7,

            "type": "CN_WORD",

            "position": 4

        },

        {

            "token": "人民",

            "start_offset": 2,

            "end_offset": 4,

            "type": "CN_WORD",

            "position": 5

        },

        {

            "token": "*",

            "start_offset": 4,

            "end_offset": 7,

            "type": "CN_WORD",

            "position": 6

        },

        {

            "token": "共和",

            "start_offset": 4,

            "end_offset": 6,

            "type": "CN_WORD",

            "position": 7

        },

        {

            "token": "国人",

            "start_offset": 6,

            "end_offset": 8,

            "type": "CN_WORD",

            "position": 8

        },

        {

            "token": "人民大会堂",

            "start_offset": 7,

            "end_offset": 12,

            "type": "CN_WORD",

            "position": 9

        },

        {

            "token": "人民大会",

            "start_offset": 7,

            "end_offset": 11,

            "type": "CN_WORD",

            "position": 10

        },

        {

            "token": "人民",

            "start_offset": 7,

            "end_offset": 9,

            "type": "CN_WORD",

            "position": 11

        },

        {

            "token": "大会堂",

            "start_offset": 9,

            "end_offset": 12,

            "type": "CN_WORD",

            "position": 12

        },

        {

            "token": "大会",

            "start_offset": 9,

            "end_offset": 11,

            "type": "CN_WORD",

            "position": 13

        },

        {

            "token": "会堂",

            "start_offset": 10,

            "end_offset": 12,

            "type": "CN_WORD",

            "position": 14

        }

    ]

}

如果输入

{"text":"*人民大会堂","analyzer":"ik_smart" }

输出

{

    "tokens": [

        {

            "token": "*",

            "start_offset": 0,

            "end_offset": 7,

            "type": "CN_WORD",

            "position": 0

        },

        {

            "token": "人民大会堂",

            "start_offset": 7,

            "end_offset": 12,

            "type": "CN_WORD",

            "position": 1

        }

    ]

}

根据分词检索输入语法，请求url：http://localhost:9200/news/_search

输入：

{

    "query" : { "match" : { "content" : "*国歌" }},

    "highlight" : {

        "pre_tags" : ["<tag1>", "<tag2>"],

        "post_tags" : ["</tag1>", "</tag2>"],

        "fields" : {

            "content" : {}

        }

    }

}

输出：

{

    "took": 11,

    "timed_out": false,

    "_shards": {

        "total": 5,

        "successful": 5,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": {

            "value": 2,

            "relation": "eq"

        },

        "max_score": 1.6810182,

        "hits": [

            {

                "_index": "news",

                "_type": "_doc",

                "_id": "6",

                "_score": 1.6810182,

                "_source": {

                    "content": "中华民族国歌"

                },

                "highlight": {

                    "content": [

                        "<tag1>中华</tag1>民族<tag1>国歌</tag1>"

                    ]

                }

            },

            {

                "_index": "news",

                "_type": "_doc",

                "_id": "5",

                "_score": 0.9426802,

                "_source": {

                    "content": "人民公社"

                },

                "highlight": {

                    "content": [

                        "<tag1>人民</tag1>公社"

                    ]

                }

            }

        ]

    }

}

运行效果如下

ElasticSearch 中文分词搜索环境搭建

秒客网

ElasticSearch 中文分词搜索环境搭建

相关文章