一、聚合分析简介
1. ES聚合分析是什么?
聚合分析是数据库中重要的功能特性,完成对一个查询的数据集中数据的聚合计算,如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。
对一个数据集求最大、最小、和、平均值等指标的聚合,在ES中称为指标聚合 metric
而关系型数据库中除了有聚合函数外,还可以对查询出的数据进行分组group by,再在组上进行指标聚合。在 ES 中group by 称为分桶,桶聚合 bucketing
ES中还提供了矩阵聚合(matrix)、管道聚合(pipleline),但还在完善中。
2. ES聚合分析查询的写法
在查询请求体中以aggregations节点按如下语法定义聚合分析:
"aggregations" : {
"<aggregation_name>" : { <!--聚合的名字 -->
"<aggregation_type>" : { <!--聚合的类型 -->
<aggregation_body> <!--聚合体:对哪些字段进行聚合 -->
}
[,"meta" : { [<meta_data_body>] } ]? <!--元 -->
[,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定义子聚合 -->
}
[,"<aggregation_name_2>" : { ... } ]*<!--聚合的名字 -->
}
说明:
aggregations 也可简写为 aggs
3. 聚合分析的值来源
聚合计算的值可以取字段的值,也可是脚本计算的结果。
二、指标聚合
1. max min sum avg
示例1:查询所有记录中年龄的最大值
POST /book1/_search?pretty {
"size": ,
"aggs": {
"maxage": {
"max": {
"field": "age"
}
}
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"maxage": {
"value": 54
}
}
}
示例2:加上查询条件,查询名字包含'test'的年龄最大值:
POST /book1/_search?pretty {
"query":{
"term":{
"name":"test"
}
},
"size": ,
"sort": [
{
"age": {
"order": "desc"
}
}
],
"aggs": {
"maxage": {
"max": {
"field": "age"
}
}
}
}
结果2:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": null,
"hits": [
{
"_index": "book1",
"_type": "english",
"_id": "6IUkUmUBRzBxBrDgFok2",
"_score": null,
"_source": {
"name": "test goog my money",
"age": [
,
,
, ],
"class": "dsfdsf",
"addr": "中国"
},
"sort": [ ]
},
{
"_index": "book1",
"_type": "english",
"_id": "54UiUmUBRzBxBrDgfIl9",
"_score": null,
"_source": {
"name": "test goog my money",
"age": [
,
, ],
"class": "dsfdsf",
"addr": "中国"
},
"sort": [ ]
}
]
},
"aggregations": {
"maxage": {
"value": 54
}
}
}
示例3:值来源于脚本,查询所有记录的平均年龄是多少,并对平均年龄加10
POST /book1/_search?pretty
{
"size":,
"aggs": {
"avg_age": {
"avg": {
"script": {
"source": "doc.age.value"
}
}
},
"avg_age10": {
"avg": {
"script": {
"source": "doc.age.value + 10"
}
}
}
}
}
结果3:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"avg_age": {
"value": 7.585365853658536
},
"avg_age10": {
"value": 17.585365853658537
}
}
}
示例4:指定field,在脚本中用_value 取字段的值
POST /book1/_search?pretty
{
"size":,
"aggs": {
"sun_age": {
"sum": {
"field":"age",
"script": {
"source": "_value * 2"
}
}
}
}
}
结果4:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"sun_age": {
"value":
}
}
}
示例5:为没有值字段指定值。如未指定,缺失该字段值的文档将被忽略:
POST /book1/_search?pretty {
"size":,
"aggs": {
"sun_age": {
"avg": {
"field":"age",
"missing":
}
}
}
}
结果5:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"sun_age": {
"value": 12.847826086956522
}
}
}
2. 文档计数 count
示例1:统计银行索引book下年龄为12的文档数量
POST book1/english/_count
{
"query":{
"match":{
"age":
}
}
}
结果1:
{
"count": ,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
}
}
3. Value count 统计某字段有值的文档数
示例1:
POST /book1/_search?size=
{
"aggs":{
"age_count":{
"value_count":{
"field":"age"
} }
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_count": {
"value":
}
}
}
4. cardinality 值去重计数
示例1:
POST /book1/_search?size=
{
"aggs":{
"age_count":{
"value_count":{
"field":"age"
} },
"name_count":{
"cardinality":{
"field":"age"
}
}
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"name_count": {
"value": 11
},
"age_count": {
"value": 38
}
}
}
说明:有值的38个,去掉重复的之后以一共有11个。
5. stats 统计 count max min avg sum 5个值
示例1:
POST /book1/_search?size=
{
"aggs":{
"age_count":{
"stats":{
"field":"age"
} }
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_count": {
"count": 38,
"min": 1,
"max": 54,
"avg": 12.394736842105264,
"sum": 471
}
}
}
6. Extended stats
高级统计,比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间。
示例1:
POST /book1/_search?size= {
"aggs":{
"age_stats":{
"extended_stats":{
"field":"age"
} }
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_stats": {
"count": ,
"min": ,
"max": ,
"avg": 12.394736842105264,
"sum": ,
"sum_of_squares": 11049,
"variance": 137.13365650969527,
"std_deviation": 11.710408041981085,
"std_deviation_bounds": {
"upper": 35.81555292606743,
"lower": -11.026079241856905
}
}
}
}
7. Percentiles 占比百分位对应的值统计
示例1:
对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中间的结果,可以理解为:占比为50%的文档的age值 <= 12,或反过来:age<=12的文档数占总命中文档数的50%。
POST /book1/_search?size=
{
"aggs":{
"age_percentiles":{
"percentiles":{
"field":"age"
} }
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_percentiles": {
"values": {
"1.0": 1,
"5.0": 1,
"25.0": 1,
"50.0": 12,
"75.0": 13,
"95.0": 40.600000000000016,
"99.0": 54
}
}
}
}
示例2:指定分位值(占比50%,96%,99%的范围值分别是多少)
POST /book1/_search?size=
{
"aggs":{
"age_percentiles":{
"percentiles":{
"field":"age",
"percents" : [,,]
} }
}
}
结果2:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_percentiles": {
"values": {
"50.0": ,
"96.0": 44.779999999999966,
"99.0":
}
}
}
}
说明:50%的数值<= 12, 96%的数值<= 96%, 99%的数值<= 54
8. Percentiles rank 统计值小于等于指定值的文档占比
示例1:统计年龄小于25和30的文档的占比,和第7项相反
POST /book1/_search?size=
{
"aggs":{
"aggs_perc_rank":{
"percentile_ranks":{
"field":"age",
"values" : [,]
} }
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"aggs_perc_rank": {
"values": {
"12.0": 71.05263157894737,
"35.0": 92.76315789473685
}
}
}
}
结果说明:年龄小于12的文档占比为71%,年龄小于35的文档占比为92%,
9. Geo Bounds aggregation 求文档集中的地理位置坐标点的范围
参考官网链接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geobounds-aggregation.html
10. Geo Centroid aggregation 求地理位置中心点坐标值
参考官网链接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html
三、桶聚合
1. Terms Aggregation 根据字段值项分组聚合
示例1:
POST /book1/_search?size= {
"aggs":{
"age_terms":{
"terms":{
"field":"age"
}
}
}
}
说明:相当于group by age
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}
结果说明:
"doc_count_error_upper_bound": 0:文档计数的最大偏差值
"sum_other_doc_count": 1:未返回的其他文档数,不在桶里的文档数量
默认情况下返回按文档计数从高到低的前10个分组:
示例2:sizz可以指定返回多少组数
POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":
} }
}
}
结果2:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}
示例3:每个分组上显示偏差值
POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":,
"show_term_doc_count_error": true
} }
}
}
结果3:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
},
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
},
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
},
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
},
{
"key": ,
"doc_count": ,
"doc_count_error_upper_bound":
}
]
}
}
}
示例4:shard_size 指定每个分片上返回多少个分组
POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":,
"shard_size":
} }
}
}
结果4:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}
order 指定分组的排序
示例5:根据分组值"_key"排序
POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":,
"order":{"_key":"desc"}
} }
}
}
结果5:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}
示例6:根据文档计数"_count"排序
POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"size":,
"order":{"_count":"desc"}
} }
}
}
结果6:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
},
{
"key": ,
"doc_count":
}
]
}
}
}
示例7:取分组指标值排序
POST /book1/_search?size=
{
"aggs":{
"age_terms":{
"terms":{
"field":"age",
"order":{"max_age":"desc"}
},
"aggs":{
"max_age":{
"max":{
"field":"age"
}
},
"min_age":{
"min":{
"field":"age"
}
}
} } }
}
说明:先根据age 分组,再计算每个组的最大最小值,最后根据最大值倒排
示例8:筛选分组-正则表达式匹配值
POST book1/_search?size=
{
"aggs":{
"tags":{
"terms":{
"field":"name",
"include":"里*",
"exclude":"test*"
} } }
}
结果8:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"tags": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": "里",
"doc_count":
}
]
}
}
}
示例9:筛选分组-指定值列表
POST book1/_search?size=
{
"aggs":{
"Chinese":{
"terms":{
"field":"name",
"include":["里","国"]
} },
"Test":{
"terms":{
"field":"name",
"exclude":["test","the"]
}
} }
}
结果9:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"Test": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": "里",
"doc_count":
},
{
"key": "否",
"doc_count":
},
{
"key": "a",
"doc_count":
},
{
"key": "default",
"doc_count":
},
{
"key": "document",
"doc_count":
},
{
"key": "for",
"doc_count":
},
{
"key": "absolute",
"doc_count":
},
{
"key": "account",
"doc_count":
},
{
"key": "accurate",
"doc_count":
},
{
"key": "documents",
"doc_count":
}
]
},
"Chinese": {
"doc_count_error_upper_bound": ,
"sum_other_doc_count": ,
"buckets": [
{
"key": "国",
"doc_count":
}
]
}
}
}
示例10:根据脚本计算值分组
POST book1/_search?size=
{
"aggs":{
"name":{
"terms":{
"script":{
"source":"doc['age'].value + doc.age.value",
"lang": "painless"
}
}
}
}
}
说明:脚本取值的方式doc['age'].value 或者 doc.age.value
结果10:
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 41,
"max_score": 0,
"hits": []
},
"aggregations": {
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "24",
"doc_count": 16
},
{
"key": "2",
"doc_count": 11
},
{
"key": "0",
"doc_count": 8
},
{
"key": "22",
"doc_count": 1
},
{
"key": "26",
"doc_count": 1
},
{
"key": "28",
"doc_count": 1
},
{
"key": "32",
"doc_count": 1
},
{
"key": "42",
"doc_count": 1
},
{
"key": "66",
"doc_count": 1
}
]
}
}
}
2. filter Aggregation 对满足过滤查询的文档进行聚合计算
示例1:在查询命中的文档中选取符合过滤条件的文档进行聚合,先过滤再聚合(和上面的示例9示例9:筛选分组,区分开:先聚合再过滤)
POST book1/_search?size=
{
"aggs":{
"age_terms":{
"filter":{
"match":{"name":"test"}
},
"aggs":{
"avg_age":{
"avg":{"field":"age" }
}
}
}
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count": ,
"avg_age": {
"value": 19.9
}
}
}
}
3. Filters Aggregation 多个过滤组聚合计算
示例1:分别统计包含‘test’,和‘里’的文档的个数
POST book1/_search?size=
{
"aggs":{
"age_terms":{
"filters":{
"filters":{
"test":{
"match":{"name":"test"}
},
"china":{
"match":{"name":"里"}
}
}
}
}
}
}
结果:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"buckets": {
"china": {
"doc_count":
},
"test": {
"doc_count":
}
}
}
}
}
例如:日志中选出 error和warning日志的个数,作日志预警
GET logs/_search
{
"size": ,
"aggs": {
"messages": {
"filters": {
"filters": {
"errors": {
"match": {
"body": "error"
}
},
"warnings": {
"match": {
"body": "warning"
}
}
}
}
}
}
}
示例2:为其他值组指定key
POST book1/_search?size=
{
"aggs":{
"age_terms":{
"filters":{
"other_bucket_key": "other_messages",
"filters":{
"test":{
"match":{"name":"test"}
},
"china":{
"match":{"name":"里"}
}
}
}
}
}
}
结果2:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_terms": {
"buckets": {
"china": {
"doc_count":
},
"test": {
"doc_count":
},
"other_messages": {
"doc_count":
}
}
}
}
}
4. Range Aggregation 范围分组聚合
示例1:
POST book1/_search?size= {
"aggs":{
"age_range":{
"range":{
"field":"age",
"keyed":true,
"ranges":[
{
"to":,
"key":"TW"
},
{
"from":,
"to":,
"key":"TH"
},
{
"from":,
"key":"SIX"
}
]
}
}
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"age_range": {
"buckets": {
"TW": {
"to": ,
"doc_count":
},
"TH": {
"from": ,
"to": ,
"doc_count":
},
"SIX": {
"from": ,
"doc_count":
}
}
}
}
}
5. Date Range Aggregation 时间范围分组聚合
示例1:
POST /bank/_search?size=
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{
"to": "now-10M/M"
},
{
"from": "now-10M/M"
}
]
}
}
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"range": {
"buckets": [
{
"key": "*-2017-08-01T00:00:00.000Z",
"to": ,
"to_as_string": "2017-08-01T00:00:00.000Z",
"doc_count":
},
{
"key": "2017-08-01T00:00:00.000Z-*",
"from": ,
"from_as_string": "2017-08-01T00:00:00.000Z",
"doc_count":
}
]
}
}
}
6. Date Histogram Aggregation 时间直方图(柱状)聚合
就是按天、月、年等进行聚合统计。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 间隔聚合或指定的时间间隔聚合。
示例1:
POST /bank/_search?size=
{
"aggs": {
"sales_over_time": {
"date_histogram": {
"field": "date",
"interval": "month"
}
}
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"sales_over_time": {
"buckets": []
}
}
}
7. Missing Aggregation 缺失值的桶聚合
示例:统计没有值的文档的数量
POST /book/_search?size=
{
"aggs" : {
"account_without_a_age" : {
"missing" : { "field" : "age" }
}
}
}
结果1:
{
"took": ,
"timed_out": false,
"_shards": {
"total": ,
"successful": ,
"skipped": ,
"failed":
},
"hits": {
"total": ,
"max_score": ,
"hits": []
},
"aggregations": {
"account_without_age": {
"doc_count":
}
}
}
8. Geo Distance Aggregation 地理距离分区聚合
参考官网链接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geodistance-aggregation.html