Elasticsearch 正/倒排索引与分词详解

1.正排索引和倒排索引简介

对于搜索引擎来讲。

正排索引是文档 Id 到文档内容、单词的关联关系，也就是说可以通过 Id获取到文档的内容。

倒排索引是单词到文档 Id 的关联关系，也就是说了一通过单词搜索到文档 Id。

倒排索引的查询流程是：首先根据关键字搜索到对应的文档 Id，然后根据正排索引查询文档 Id 的完整内容，最后返回给用户想要的结果。

2.倒排索引

倒排索引是搜索引擎的核心，主要包含两个部分：

• 单词词典（Trem Dictionary）：记录的是所有的文档分词后的结果

• 倒排列表（Posting List）：记录了单词对应文档的集合，由倒排索引项（Posting）组成。

单词字典的实现一般采用B+Tree的方式，来保证高效

2.1 倒排索引项（Posting）主要包含如下的信息：

• 文档ID，用于获取原始文档的信息

• 单词频率（TF，Term Frequency），记录该单词在该文档中出现的次数，用于后续相关性算分。

• 位置（Position），记录单词在文档中的分词位置（多个），用于做词语搜索。

• 偏移（Offset），记录单词在文档的开始和结束位置，用于高亮显示。

es存储的是一个json的内容，其中包含很多字段，每个字段都会有自己的倒排索引。

3.正排索引

搜索的时候，要依靠倒排索引：排序的时候，需要依靠正排索引，看到每个document的每个field，然后进行排序，所谓的正排索引，其实就是doc values。

在建立索引的时候，一方面会建立倒排索引，以供搜索用；一方面建立正排索引，也就是doc values，以供排序，聚合，过滤等操作使用。

doc values是被保存在硬盘上的，此时如果内存足够，os会自动将其缓存在内存中，性能还是会很高的，如果内存不足够，os会将其写入磁盘上。

• 例子：

doc1：hello world you and me
doc2：hi，world，how are you

建立倒排索引
word     doc1         doc2
hello      *            *
you        *            *
and        *            
me         *
hi                      *
how                     *
are                     *

hello you --> hello, you
hello doc1
you   doc1,doc2

排序：sort by age
因为不可能将所有的document的分词都重新取出来进行排序，所以需要正排索引，用于排序

doc1:{"name":"jack","age":27}
doc2:{"name":"tom","age":30}

建立正排索引，每一行是一个doc
document   name     age
doc1       jack     27
doc2       tom      30

建立索引的时候会执行上面两个操作：一个做倒排索引，一个做正排排序

4.分词和分词器

分词是指将文本转换成一系列的单词的过程，也可以叫做文本分析，在es中称为Analysis。例如文本“elasticsearch是最流行的搜索引擎”，经过分词后变成“elasticsearch”，“流行”，“搜索引擎”

分词器（Analyzer）是es中专门用于分词的组件，它的组成如下：

Elasticsearch 正/倒排索引与分词详解

分词器组成的调用是有顺序的：

Elasticsearch 正/倒排索引与分词详解

4.1 默认的分词器

• standard

• standard tokenizer:以单词的边界进行切分

• standard token filter:什么都不做

• lowercase token filter:将所有字母转换成小写

• stop token filter(默认被禁用)，移除停用词，比如a the it等等

4.2 修改分词器的设置

例子：启用standard的基于english的分词器的停用词token filter。其中，es_std是这个分词器的名称。

请求举例：

PUT /index0
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std":{
          "type":"standard",
          "stopwords":"_english_"
        }
      }
    }
  }
}

返回示例：

Elasticsearch 正/倒排索引与分词详解

测试：使用standard分词器分词a little dog

GET /index0/_analyze
{
  "analyzer":"standard",
  "text":"a little dog"
}

执行结果：

Elasticsearch 正/倒排索引与分词详解

使用设置的es_std分词器分词a little dog，可以看到结果中，停用词过滤了

GET /index0/_analyze
{
  "analyzer":"es_std",
  "text":"a little dog"
}

执行结果

Elasticsearch 正/倒排索引与分词详解

4.3 定制化自己的分词器

例如：

char_filter:类型为mapping，定义自己的替换过滤器，这里我们将&转换为and，并将这个过滤器起名为&_to_and

my_stopwords:类型为stop，定义自己的停用词，这里我们设置了两个停用词a和the

my_analyzer：类型为customer，自定义分词器，分词前操作：html_strip过滤html代码标签，&_to_and是我们自己定义的字符过滤器（将&提换成and)，分词使用standard，停用词使用my_stopwords,并将所有的词转成小写

PUT /index0
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and":{
          "type":"mapping",
          "mappings":["&=> and"]
        }
      },
      "filter":{
        "my_stopwords":{
          "type":"stop",
          "stopwords":["a","the"]
        }
      },
      "analyzer":{
        "my_analyzer":{
          "type":"custom",
          "char_filter":["html_strip","&_to_and"],
          "tokenizer":"standard",
          "filter":["lowercase","my_stopwords"]
        }
      }
    }
  }
}

执行：报错，索引已存在

{
  "error": {
    "root_cause": [
      {
        "type": "resource_already_exists_exception",
        "reason": "index [index0/JZZuLDa8R3uDOnPt_qhDHw] already exists",
        "index_uuid": "JZZuLDa8R3uDOnPt_qhDHw",
        "index": "index0"
      }
    ],
    "type": "resource_already_exists_exception",
    "reason": "index [index0/JZZuLDa8R3uDOnPt_qhDHw] already exists",
    "index_uuid": "JZZuLDa8R3uDOnPt_qhDHw",
    "index": "index0"
  },
  "status": 400
}

我们先删除这个索引 DELETE /index0，然后再执行。执行成功：

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "index0"
}

测试我们的分词器my_analyzer：

模拟一段文本：tom and jery in the a house <a> & me HAHA

从执行结果中可以看出，a和the过滤了，HAHA转成了小写，&转成了and，<a>标签过滤了

GET /index0/_analyze
{
  "analyzer": "my_analyzer",
  "text":"tom and jery in the a house <a> & me HAHA"
}

执行结果

{
  "tokens": [
    {
      "token": "tom",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "and",
      "start_offset": 4,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "jery",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "in",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "house",
      "start_offset": 22,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "and",
      "start_offset": 32,
      "end_offset": 33,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "me",
      "start_offset": 34,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "haha",
      "start_offset": 37,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}

4.4 在我们的索引中使用我们自定义的分词器

设置mytype中的字段content使用我们的自定义的分词器my_analyzer

GET /index0/_mapping/my_type
{
    "properties":{
        "content":{
            "type":"text",
            "analyzer":"my_analyzer"
        }
    }
}

返回结果

{
  "acknowledged": true
}

4.5 Analyze API

es提供了一个测试分词的api接口，方便验证分词效果，endpoint是_analyze。这个api具有以下特点：

• 可以直接指定analyzer进行测试

• 可以自定义分词器进行测试

4.5.1 直接指定analyzer进行测试

请求举例：

POST _analyze
{
    "analyzer": "standard",
    "text": "hello world"
}

analyzer表示指定的分词器，这里使用es自带的分词器standard，text用来指定待分词的文本

Elasticsearch 正/倒排索引与分词详解

从结果中可以看到，分词器将文本分成了hello 和 world两个单词，当没有指定分词器的时候默认使用standard。

4.5.2 自定义分词器进行测试

• 请求举例：

POST _analyze
{
  "tokenizer": "standard",
  "filter": [ "lowercase" ],
  "text": "Hello World"
}

根据分词的流程，首先通过tokenizer指定的分词方法standard进行分词，然后会经过filter将大写转化为小写。 Elasticsearch 正/倒排索引与分词详解

秒客网

Elasticsearch 正/倒排索引与分词详解

相关文章