ES系列十四、ES聚合分析(聚合分析简介、指标聚合、桶聚合)

时间:2022-12-03 12:53:06

一、聚合分析简介

1. ES聚合分析是什么?

聚合分析是数据库中重要的功能特性,完成对一个查询的数据集中数据的聚合计算,如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。

对一个数据集求最大、最小、和、平均值等指标的聚合,在ES中称为指标聚合   metric

而关系型数据库中除了有聚合函数外,还可以对查询出的数据进行分组group by,再在组上进行指标聚合。在 ES 中group by 称为分桶桶聚合 bucketing

ES中还提供了矩阵聚合(matrix)、管道聚合(pipleline),但还在完善中。

2. ES聚合分析查询的写法

在查询请求体中以aggregations节点按如下语法定义聚合分析:

"aggregations" : {
"<aggregation_name>" : { <!--聚合的名字 -->
"<aggregation_type>" : { <!--聚合的类型 -->
<aggregation_body> <!--聚合体:对哪些字段进行聚合 -->
}
[,"meta" : { [<meta_data_body>] } ]? <!--元 -->
[,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定义子聚合 -->
}
[,"<aggregation_name_2>" : { ... } ]*<!--聚合的名字 -->
}

 说明:

aggregations 也可简写为 aggs

3. 聚合分析的值来源

聚合计算的值可以取字段的值,也可是脚本计算的结果

二、指标聚合

1. max min sum avg

示例1:查询所有记录中年龄的最大值

POST /book1/_search?pretty

{
"size": ,
"aggs": {
"maxage": {
"max": {
"field": "age"
}
}
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"maxage": {
"value": 54
}
}
}

示例2:加上查询条件,查询名字包含'test'的年龄最大值:

POST /book1/_search?pretty

{
"query":{
"term":{
"name":"test"
}
},
"size": ,
"sort": [
{
"age": {
"order": "desc"
}
}
],
"aggs": {
"maxage": {
"max": {
"field": "age"
}
}
}
}

结果2:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": null,
"hits": [
{
"_index": "book1",
"_type": "english",
"_id": "6IUkUmUBRzBxBrDgFok2",
"_score": null,
"_source": {
"name": "test goog my money",
"age": [
,
,
, ],
"class": "dsfdsf",
"addr": "中国"
},
"sort": [ ]
},
{
"_index": "book1",
"_type": "english",
"_id": "54UiUmUBRzBxBrDgfIl9",
"_score": null,
"_source": {
"name": "test goog my money",
"age": [
,
, ],
"class": "dsfdsf",
"addr": "中国"
},
"sort": [ ]
}
]
},
"aggregations": {
"maxage": {
"value": 54
}
}
}

示例3:值来源于脚本,查询所有记录的平均年龄是多少,并对平均年龄加10

POST /book1/_search?pretty
{
"size":,
"aggs": {
"avg_age": {
"avg": {
"script": {
"source": "doc.age.value"
}
}
},
"avg_age10": {
"avg": {
"script": {
"source": "doc.age.value + 10"
}
}
}
}
}

结果3:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"avg_age": {
"value": 7.585365853658536
},
"avg_age10": {
"value": 17.585365853658537
}
}
}

示例4:指定field,在脚本中用_value 取字段的值

POST  /book1/_search?pretty
{
"size":,
"aggs": {
"sun_age": {
"sum": {
"field":"age",
"script": {
"source": "_value * 2"
}
}
}
}
}

结果4:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"sun_age": {
"value":
}
}
}

示例5:为没有值字段指定值。如未指定,缺失该字段值的文档将被忽略:

POST /book1/_search?pretty

{
"size":,
"aggs": {
"sun_age": {
"avg": {
"field":"age",
"missing":
}
}
}
}

结果5:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"sun_age": {
"value": 12.847826086956522
}
}
}

2. 文档计数 count

示例1:统计银行索引book下年龄为12的文档数量

POST book1/english/_count
{
"query":{
"match":{
"age":
}
}
}

结果1:

{
"count": ,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
}
}

3. Value count 统计某字段有值的文档数

示例1:

POST /book1/_search?size=
{
"aggs":{
"age_count":{
"value_count":{
"field":"age"
} }
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_count": {
"value":
}
}
}

4. cardinality  值去重计数

示例1:

POST  /book1/_search?size=
{
"aggs":{
"age_count":{
"value_count":{
"field":"age"
} },
"name_count":{
"cardinality":{
"field":"age"
}
}
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"name_count": {
"value": 11
},
"age_count": {
"value": 38

}
}
}

说明:有值的38个,去掉重复的之后以一共有11个。

5. stats 统计 count max min avg sum 5个值

示例1:

POST  /book1/_search?size=
{
"aggs":{
"age_count":{
"stats":{
"field":"age"
} }
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_count": {
"count": 38,
"min": 1,
"max": 54,
"avg": 12.394736842105264,
"sum": 471

}
}
}

6. Extended stats

高级统计,比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间。

示例1:

POST /book1/_search?size=

{
"aggs":{
"age_stats":{
"extended_stats":{
"field":"age"
} }
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_stats": {
"count": ,
"min": ,
"max": ,
"avg": 12.394736842105264,
"sum": ,
"sum_of_squares": 11049,
"variance": 137.13365650969527,
"std_deviation": 11.710408041981085,
"std_deviation_bounds": {
"upper": 35.81555292606743,
"lower": -11.026079241856905

}
}
}
}

7. Percentiles 占比百分位对应的值统计

示例1:

对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中间的结果,可以理解为:占比为50%的文档的age值 <= 12,或反过来:age<=12的文档数占总命中文档数的50%。

POST /book1/_search?size=
{
"aggs":{
"age_percentiles":{
"percentiles":{
"field":"age"
} }
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_percentiles": {
"values": {
"1.0": 1,
"5.0": 1,
"25.0": 1,
"50.0": 12,
"75.0": 13,
"95.0": 40.600000000000016,
"99.0": 54

}
}
}
}

示例2:指定分位值(占比50%,96%,99%的范围值分别是多少)

POST /book1/_search?size=
{
"aggs":{
"age_percentiles":{
"percentiles":{
"field":"age",
"percents" : [,,]
} }
}
}

结果2:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_percentiles": {
"values": {
"50.0": ,
"96.0": 44.779999999999966,
"99.0":
}
}
}
}

说明:50%的数值<= 12, 96%的数值<= 96%, 99%的数值<= 54

8. Percentiles rank 统计值小于等于指定值的文档占比

示例1:统计年龄小于25和30的文档的占比,和第7项相反

POST /book1/_search?size=
{
"aggs":{
"aggs_perc_rank":{
"percentile_ranks":{
"field":"age",
"values" : [,]
} }
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"aggs_perc_rank": {
"values": {
"12.0": 71.05263157894737,
"35.0": 92.76315789473685
}
}
}
}

结果说明:年龄小于12的文档占比为71%,年龄小于35的文档占比为92%,

9. Geo Bounds aggregation 求文档集中的地理位置坐标点的范围

参考官网链接:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geobounds-aggregation.html

10. Geo Centroid aggregation  求地理位置中心点坐标值

参考官网链接:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html

三、桶聚合

ES系列十四、ES聚合分析(聚合分析简介、指标聚合、桶聚合)

1. Terms Aggregation  根据字段值项分组聚合

示例1:

POST /book1/_search?size=

{
"aggs":{
"age_terms":{
"terms":{
"field":"age"
}
}
}
}

说明:相当于group by age

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}

结果说明:

"doc_count_error_upper_bound": 0:文档计数的最大偏差值

"sum_other_doc_count": 1:未返回的其他文档数,不在桶里的文档数量

默认情况下返回按文档计数从高到低的前10个分组:

示例2:sizz可以指定返回多少组数

POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":
} }
}
}

结果2:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}

示例3:每个分组上显示偏差值

POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":,
"show_term_doc_count_error": true
} }
}
}

结果3:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
},
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
},
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
},
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
},
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
}
]
}
}
}

示例4:shard_size 指定每个分片上返回多少个分组

POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":,
"shard_size":
} }
}
}

结果4:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}

order  指定分组的排序

示例5:根据分组值"_key"排序

POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":,
"order":{"_key":"desc"}
} }
}
}

结果5:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}

示例6:根据文档计数"_count"排序

POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":,
"order":{"_count":"desc"}
} }
}
}

结果6:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}

示例7:取分组指标值排序

POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"order":{"max_age":"desc"}
},
"aggs":{
"max_age":{
"max":{
"field":"age"
}
},
"min_age":{
"min":{
"field":"age"
}
}
} } }
}

说明:先根据age 分组,再计算每个组的最大最小值,最后根据最大值倒排

示例8:筛选分组-正则表达式匹配值

POST book1/_search?size=
{
"aggs":{
"tags":{
"terms":{
"field":"name",
"include":"里*",
"exclude":"test*"
} } }
}

结果8:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"tags": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": "里",
"doc_count":
}
]
}
}
}

示例9:筛选分组-指定值列表

POST book1/_search?size=
{
"aggs":{
"Chinese":{
"terms":{
"field":"name",
"include":["里","国"]
} },
"Test":{
"terms":{
"field":"name",
"exclude":["test","the"]
}
} }
}

结果9:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"Test": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": "里",
"doc_count":
},
{
"key": "否",
"doc_count":
},
{
"key": "a",
"doc_count":
},
{
"key": "default",
"doc_count":
},
{
"key": "document",
"doc_count":
},
{
"key": "for",
"doc_count":
},
{
"key": "absolute",
"doc_count":
},
{
"key": "account",
"doc_count":
},
{
"key": "accurate",
"doc_count":
},
{
"key": "documents",
"doc_count":
}
]
},
"Chinese": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": "国",
"doc_count":
}
]
}
}
}

示例10:根据脚本计算值分组

POST book1/_search?size=
{
"aggs":{
"name":{
"terms":{
"script":{
"source":"doc['age'].value + doc.age.value",
"lang": "painless"
}
}
}
}
}

说明:脚本取值的方式doc['age'].value 或者 doc.age.value

结果10:

{
    "took": 18,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 41,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "name": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "24",
                    "doc_count": 16
                },
                {
                    "key": "2",
                    "doc_count": 11
                },
                {
                    "key": "0",
                    "doc_count": 8
                },
                {
                    "key": "22",
                    "doc_count": 1
                },
                {
                    "key": "26",
                    "doc_count": 1
                },
                {
                    "key": "28",
                    "doc_count": 1
                },
                {
                    "key": "32",
                    "doc_count": 1
                },
                {
                    "key": "42",
                    "doc_count": 1
                },
                {
                    "key": "66",
                    "doc_count": 1
                }
            ]
        }
    }
}

2.  filter Aggregation  对满足过滤查询的文档进行聚合计算

示例1:在查询命中的文档中选取符合过滤条件的文档进行聚合,先过滤再聚合(和上面的示例9示例9:筛选分组,区分开:先聚合再过滤)

POST book1/_search?size=
{
"aggs":{
"age_terms":{
"filter":{
"match":{"name":"test"}
},
"aggs":{
"avg_age":{
"avg":{"field":"age" }
}
}
}
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count": ,
"avg_age": {
"value": 19.9
}
}
}
}

3. Filters Aggregation  多个过滤组聚合计算

示例1:分别统计包含‘test’,和‘里’的文档的个数

POST book1/_search?size=
{
"aggs":{
"age_terms":{
"filters":{
"filters":{
"test":{
"match":{"name":"test"}
},
"china":{
"match":{"name":"里"}
}
}
}
}
}
}

结果:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"buckets": {
"china": {
"doc_count":
},
"test": {
"doc_count":
}
}
}
}
}

例如:日志中选出 error和warning日志的个数,作日志预警

GET logs/_search
{
"size": ,
"aggs": {
"messages": {
"filters": {
"filters": {
"errors": {
"match": {
"body": "error"
}
},
"warnings": {
"match": {
"body": "warning"
}
}
}
}
}
}
}

示例2:为其他值组指定key

POST book1/_search?size=
{
"aggs":{
"age_terms":{
"filters":{
"other_bucket_key": "other_messages",
"filters":{
"test":{
"match":{"name":"test"}
},
"china":{
"match":{"name":"里"}
}
}
}
}
}
}

结果2:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"buckets": {
"china": {
"doc_count":
},
"test": {
"doc_count":
},
"other_messages": {
"doc_count":
}
}
}
}
}

4. Range Aggregation 范围分组聚合

示例1:

POST book1/_search?size=

{
"aggs":{
"age_range":{
"range":{
"field":"age",
"keyed":true,
"ranges":[
{
"to":,
"key":"TW"
},
{
"from":,
"to":,
"key":"TH"
},
{
"from":,
"key":"SIX"
}
]
}
}
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_range": {
"buckets": {
"TW": {
"to": ,
"doc_count":
},
"TH": {
"from": ,
"to": ,
"doc_count":
},
"SIX": {
"from": ,
"doc_count":
}
}
}
}
}

5. Date Range Aggregation  时间范围分组聚合

示例1:

POST /bank/_search?size=
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{
"to": "now-10M/M"
},
{
"from": "now-10M/M"
}
]
}
}
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"range": {
"buckets": [
{
"key": "*-2017-08-01T00:00:00.000Z",
"to": ,
"to_as_string": "2017-08-01T00:00:00.000Z",
"doc_count":
},
{
"key": "2017-08-01T00:00:00.000Z-*",
"from": ,
"from_as_string": "2017-08-01T00:00:00.000Z",
"doc_count":
}
]
}
}
}

6. Date Histogram Aggregation  时间直方图(柱状)聚合

就是按天、月、年等进行聚合统计。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 间隔聚合或指定的时间间隔聚合。

示例1:

POST /bank/_search?size=
{
"aggs": {
"sales_over_time": {
"date_histogram": {
"field": "date",
"interval": "month"
}
}
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"sales_over_time": {
"buckets": []
}
}
}

7. Missing Aggregation  缺失值的桶聚合

示例:统计没有值的文档的数量

POST /book/_search?size=
{
"aggs" : {
"account_without_a_age" : {
"missing" : { "field" : "age" }
}
}
}

结果1:

{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"account_without_age": {
"doc_count":
}
}
}

8. Geo Distance Aggregation  地理距离分区聚合

参考官网链接:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geodistance-aggregation.html