Elasticsearch1.x 基于lc-pinyin和ik分词实现中文、拼音、同义词搜索

一、简介

有时候我们需要在项目中支持中文和拼音的搜索。采用ik分词来做中文分词是目前比好的方式。至于拼音分词可以采用lc-pinyin，虽然lc-pinyin能很好的解决首字母和全拼的搜索，但是在中文分词上却是不支持的，lc-pinyin只能把中文拆成单字来处理。要是能把IK分词和lc-pinyin分词结合那该多好，不是么？本文主要介绍如何把ik和lc-pinyin结合起来使用，使我们的搜索既支持中文搜索，又支持拼音搜索。

环境：elasticsearch1.4.5， elasticsearch-analysis-lc-pinyin1.4.5，elasticsearch-analysis-ik1.3.0

二、配置lc-pinyin和ik分词器

1. 首先需要安装lc-pinyin和ik分词，这里就不再讲如何安装，不会的童鞋参考：http://blog.csdn.net/chennanymy/article/details/52336368

2. 安装好lc-pinyin和ik分词插件后就可以配置分词器额，打开 config/elasticsearch.yml文件，在末尾加上下面的配置

index:
  analysis:
    analyzer:
      ik_max_word:
          type: ik
          use_smart: false
      ik_smart:
          type: ik
          use_smart: true
      ik_syno:
          tokenizer: ik
          filter: [ik_synonym_filter]
      ik_syno_smart:
          tokenizer: ik_tk_smart
          filter: [ik_synonym_filter]
      lc:
         alias: [lc_analyzer]
         type: org.elasticsearch.index.analysis.LcPinyinAnalyzerProvider
      lc_index:
         type: lc
         analysisMode: index
      lc_search:
         type: lc
         analysisMode: search
    tokenizer:
      ik_tk_smart:
         type: ik
         use_smart: true
    filter:
      ik_synonym_filter:
          type: synonym
          synonyms_path: analysis/synonym.txt

上面的配置定义了一个同义词过滤器“ik_synonym_filter” 并指定了一个同义词文件,该文件目录结构如下

Elasticsearch1.x 基于lc-pinyin和ik分词实现中文、拼音、同义词搜索

对同义词不熟悉的同学，可以参考官网教程：https://www.elastic.co/guide/en/elasticsearch/reference/1.5/analysis-synonym-tokenfilter.html

到这里分词器就配置完成了。

三、测试同义词

下面我们就来测试一下同义词是否有效，打开同义词配置文件在里面加入两行

Elasticsearch1.x 基于lc-pinyin和ik分词实现中文、拼音、同义词搜索

第一行同义词表示当要索引的文字中包含了蜡烛、园丁、师傅、先生都会被转换成老师索引到文档中

第二行同义词表示当出现中文、汉语、汉字中任何一个词的时候都把上述3个词索引到文档中。所以这种方式是比较费索引的

Elasticsearch1.x 基于lc-pinyin和ik分词实现中文、拼音、同义词搜索

上图可以看到我们配置的同义词生效了。

四、搜索测试

首先创建索引和mapping，这里我们采用multi_field来做针对同一字段设置不同的分词器

content字段采用拼音分词，content.cn采用ik分词

curl -XPUT http://localhost:9200/index  
curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'  
{
    "fulltext": {
        "properties": {
            "content": {
                "type": "string",
                "index_analyzer": "lc_index",
                "search_analyzer": "lc_search",
                "fields": {
                   "cn": {
                        "type": "string",
                        "index_analyzer": "ik_syno",
                        "search_analyzer": "ik_syno_smart"
                    }
                }
            }
        }
    }
}'

然后索引几条数据

curl -XPOST http://localhost:9200/index/fulltext/1 -d'
{"content":"湖北工业大学"}
'

curl -XPOST http://localhost:9200/index/fulltext/2 -d'
{"content":"华中科技大学"}
'

curl -XPOST http://localhost:9200/index/fulltext/3 -d'
{"content":"武汉大学"}
'

curl -XPOST http://localhost:9200/index/fulltext/4 -d'
{"content":"武汉理工大学"}
'
curl -XPOST http://localhost:9200/index/fulltext/5 -d'
{"content":"香港中文大学"}
'

执行查询

@Test
    public void testMultiMatch() {
        final String index = "index";
        final String type = "fulltext";
        SearchRequestBuilder requestBuilder = elasticIndexOperateHelper.getClient().prepareSearch(index).setTypes(type);

        String input = "中文大学";
        QueryBuilder pinyinSearch = QueryBuilders
                .matchQuery("content", input)
                .type(MatchQueryBuilder.Type.PHRASE)
                .analyzer("lc_search")
                .boost(4)
                .zeroTermsQuery(MatchQueryBuilder.ZeroTermsQuery.NONE);

        QueryBuilder chineseSearch = QueryBuilders
                .matchQuery("content.cn", input)
                .type(MatchQueryBuilder.Type.BOOLEAN)
                .analyzer("ik_syno_smart")
                .boost(8)
                .zeroTermsQuery(MatchQueryBuilder.ZeroTermsQuery.NONE);

        QueryBuilder mixQueryBuilder = QueryBuilders.boolQuery().should(pinyinSearch).should(chineseSearch).minimumNumberShouldMatch(1);

        requestBuilder = requestBuilder
                .setQuery(mixQueryBuilder)
                .setHighlighterPreTags("<tag1>", "<tag2>")
                .setHighlighterPostTags("</tag1>", "</tag2>")
                .addHighlightedField("content")
                .addHighlightedField("content.cn")
                .setHighlighterRequireFieldMatch(true);

        SearchResponse response = requestBuilder.execute().actionGet();
        System.out.println(requestBuilder);
        System.out.println(response);

    }

返回结果

{
  "took" : 734,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 1.7481089,
    "hits" : [ {
      "_index" : "index",
      "_type" : "fulltext",
      "_id" : "5",
      "_score" : 1.7481089,
      "_source":
{"content":"武汉中文大学"}
,
      "highlight" : {
        "content.cn" : [ "武汉<tag1>中文</tag1><tag1>大学</tag1>" ],
        "content" : [ "武汉<tag1>中</tag1><tag1>文</tag1><tag1>大</tag1><tag1>学</tag1>" ]
      }
    }, {
      "_index" : "index",
      "_type" : "fulltext",
      "_id" : "3",
      "_score" : 0.014395926,
      "_source":
{"content":"武汉大学"}
,
      "highlight" : {
        "content.cn" : [ "武汉<tag1>大学</tag1>" ]
      }
    }, {
      "_index" : "index",
      "_type" : "fulltext",
      "_id" : "1",
      "_score" : 0.009597284,
      "_source":
{"content":"湖北工业大学"}
,
      "highlight" : {
        "content.cn" : [ "湖北工业<tag1>大学</tag1>" ]
      }
    }, {
      "_index" : "index",
      "_type" : "fulltext",
      "_id" : "2",
      "_score" : 0.0077423635,
      "_source":
{"content":"华中科技大学"}
,
      "highlight" : {
        "content.cn" : [ "华中科技<tag1>大学</tag1>" ]
      }
    }, {
      "_index" : "index",
      "_type" : "fulltext",
      "_id" : "4",
      "_score" : 0.0061938907,
      "_source":
{"content":"武汉理工大学"}
,
      "highlight" : {
        "content.cn" : [ "武汉理工<tag1>大学</tag1>" ]
      }
    } ]
  }
}

搜索：gongyedaxue

{
  "took" : 105,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.9900317,
    "hits" : [ {
      "_index" : "index",
      "_type" : "fulltext",
      "_id" : "1",
      "_score" : 0.9900317,
      "_source":
{"content":"湖北工业大学"}
,
      "highlight" : {
        "content" : [ "湖北<tag1>工</tag1><tag1>业</tag1><tag1>大</tag1><tag1>学</tag1>" ]
      }
    } ]
  }
}

秒客网

Elasticsearch1.x 基于lc-pinyin和ik分词实现中文、拼音、同义词搜索

搜索：gongyedaxue

相关文章

Elasticsearch1.x 基于lc-pinyin和ik分词实现 中文、拼音、同义词搜索

搜索：gongyedaxue

相关文章

Elasticsearch1.x 基于lc-pinyin和ik分词实现中文、拼音、同义词搜索