Querying or Reading Data
OpenTSDB offers a number of means to extract data such as CLI tools, an HTTP API and as a GnuPlot graph. Querying with OpenTSDB's tag based system can be a bit tricky so read through this document and checkout the following pages for deeper information. Example queries on this page follow the HTTP API format.
OpenTSDB提供很多方法抽取数据,例如CLI tools,HTTPAPI,Gnuplot图。
基于系统查询OpenTSDB的tag是件困难的事情。通过阅读下面的文档获取更多信息。
Query Components
OpenTSDB's query language is fairly simple but flexible. Each query has the following components:
OpenTSDB的查询语言是比较简单,灵活的。每个查询包括如下部分:
Parameter | Date Type | Required | Description | Example |
---|---|---|---|---|
Start Time | String or Integer | Yes |
Starting time for the query. This may be an absolute or relative time. See Dates and Times for details 相对值和绝对值都支持 |
24h-ago |
End Time | String or Integer | No |
An end time for the query. If the end time is not supplied, the current time on the TSD will be used. See Dates and Times for details. 可以不提供,默认是当前时间 |
1h-ago |
Metric | String | Yes |
The full name of a metric in the system. Must be the complete name. Case sensitive metric的名字,大小写敏感 |
sys.cpu.user |
Aggregation Function | String | Yes |
A mathematical function to use in combining multiple time series 整合多个时间序列的数学函数 |
sum |
Tags | String | No |
An optional set of tags for filtering or grouping 基于tags进行过滤 |
host=*,dc=lax |
Downsampler | String | No |
An optional interval and function to reduce the number of data points returned 减少返回数据点 |
1h-avg |
Rate | String | No |
An optional flag to calculate the rate of change for the result 计算结果变化的比率 |
rate |
Times
Absolute time stamps are supported in human readable format or Unix style integers. Relative times may be used for refreshing dashboards. Currently, all queries are able to cover a single time span. In the future we hope to provide an offset query parameter that would allow for aggregations or graphing of a metric over different time periods, such as comparing last week to 1 year ago. See Dates and Times for details on what is permissible.
更利于阅读的时间戳或者Unix风格的都是支持的。相对时间主要是用于更新dashboard。
目前,所有的查询都是基于单个时间范围的。未来,会提供一个偏移量时间查询参数,支持一段时间内数据的整合,例如比较上个星期和一年前的。
While OpenTSDB can store data with millisecond resolution, most queries will return the data with second resolution to provide backwards compatibility for existing tools. Unless a down sampling algorithm has been specified with a query, the data will automatically be down sampled to 1 second using the same aggregation function specified in a query. This way, if multiple data points are stored for a given second, they will be aggregated and returned in a normal query correctly.
虽然OpenTSDB支持毫秒级数据存储,大多数的查询返回秒级数据。除非通过下载数据的方式,否则默认都是1秒内的数据采样。
如果单秒内存在多个数据点,它们会被整合,然后在一单个查询中返回。
To extract data with millisecond resolution, use the /api/query endpoint and specify the msResolution JSON parameter or ms query string flag and it will bypass down sampling (unless specified) and return all timestamps in Unix epoch millisecond resolution. Also, the scancommandline utility will return the timestamp as written in storage.
使用毫秒级别方式抽取数据,通过api/query接口,确定msResolution Json参数或者指定ms查询字符串,返回Unix格式的毫秒数据格式。
scan commandline utility 返回存储中的时间戳。
Tags
Every time series is comprised of a metric and one or more tag name/value pairs. Since tags are optional in queries, if you request only the metric name, then every metric with any number or value of tags will be returned in the aggregated results. For example, if we have a stored data set:
sys.cpu.user host=webserver01,cpu=0 1356998400 1
sys.cpu.user host=webserver01,cpu=1 1356998400 4
sys.cpu.user host=webserver02,cpu=0 1356998400 2
sys.cpu.user host=webserver02,cpu=1 1356998400 1
and simply craft a query start=1356998400&m=sum:sys.cpu.user, we will get a value of 8 at 1356998400 that incorporates all 4 time series.
每个时间序列是由metric和一个或多个tag name/value键值对组成。
虽然在具体查询中tags是可选的,可以只指定metric 名称,每个metric返回都带有tags。
例如存储数据的格式如下:
sys.cpu.user host=webserver01,cpu=0 1356998400 1
sys.cpu.user host=webserver01,cpu=1 1356998400 4
sys.cpu.user host=webserver02,cpu=0 1356998400 2
sys.cpu.user host=webserver02,cpu=1 1356998400 1
查询的query为start=1356998400&m=sum:sys.cpu.user,会得到4个时间序列,值为8
If we want to aggregate the results for a specific group, we can filter on the host tag. The querystart=1356998400&m=sum:sys.cpu.user{host=webserver01} will return a value of 5, incorporating only the time series wherehost=webserver01. To drill down to a specific time series, you must include all of the tags for the series, e.g.start=1356998400&m=sum:sys.cpu.user{host=webserver01,cpu=0} will return 1.
如果你想整合一个特点的group,需要基于host tag进行过滤。例如查询start=1356998400&m=sum:sys.cpu.user{host=webserver01} ,返回值为5,只整合host=webserver01的数据。获取一个特定的时间序列,需要加上所有tag,例如start=1356998400&m=sum:sys.cpu.user{host=webserver01,cpu=0} ,返回值为1
Grouping
A query can also aggregate time series with multiple tags into groups based on a tag value. Two special characters can be passed to the right of the equals symbol in a query:
- * - The asterisk will return a separate result for each unique tag value
- | - The pipe will return a separate result only for the exact tag values specified
Let's take the following data set as an example:
sys.cpu.user host=webserver01,cpu=0 1356998400 1
sys.cpu.user host=webserver01,cpu=1 1356998400 4
sys.cpu.user host=webserver02,cpu=0 1356998400 2
sys.cpu.user host=webserver02,cpu=1 1356998400 1
sys.cpu.user host=webserver03,cpu=0 1356998400 5
sys.cpu.user host=webserver03,cpu=1 1356998400 3
If we want to query for the average CPU time across each server we can craft a query likestart=1356998400&m=avg:sys.cpu.user{host=*}. This will give us three results:
- The aggregated average for sys.cpu.user host=webserver01,cpu=0 and sys.cpu.user host=webserver01,cpu=1
- The aggregated average for sys.cpu.user host=webserver02,cpu=0 and sys.cpu.user host=webserver02,cpu=1
- The aggregated average for sys.cpu.user host=webserver03,cpu=0 and sys.cpu.user host=webserver03,cpu=1
*操作,返回唯一tag值对应的数据,这个例子返回三个结果,三个唯一tag值
However if we have many web servers in the system, this could create a ton of results. To filter on only the hosts we want you can use the pipe operator to select a subset of time series. For example start=1356998400&m=avg:sys.cpu.user{host=webserver01|webserver03}will return results only for webserver01 and webserver03.
|操作,是或的关系
Aggregation
A powerful feature of OpenTSDB is the ability to perform on-the-fly aggregations of multiple time series into a single set of data points. The original data is always available in storage but we can quickly extract the data in meaningful ways. Aggregation functions are means of merging two or more data points for a single time stamp into a single value. See Aggregators for details.
OpenTSDB一个强大的特性就是将多个时间序列整合为单个集合的数据点。原数据存储在storage上,可以按照不同方式获取数据。
整合函数是将多个数据点按照一个时间维度存储在单个值上。
Interpolation(插值采样)
When performing an aggregation, what happens if the time stamps of the data points for each time series fail to line up? Say we record the temperature every 5 minutes in different regions around the world. A sensor in Paris may send a temperature of 27c at 1356998400. Then a sensor in San Francisco may send a value of 18c at 1356998430, 30 seconds later. Antarctica may report -29c at 1356998529. If we run a query requesting the average temperature, we want all of the data points averaged together into a single point. This is where interpolationcomes into play. See Aggregators for details.
当进行聚合的时候,如果数据点得时间戳不是线性增长的话会怎么样?
例如每5分钟记录世界不同地方的温度。Paris是27c at 1356998400,30秒后San Francisco是18c at 1356998430,Antarctica是-29c at 1356998529。
如果我们查询平均问题,需要将所有数据点计算平均值到一个点。这个就是interpolation。
Downsampling
OpenTSDB can ingest a large amount of data, even a data point every second for a given time series. Thus queries may return a large number of data points. Accessing the results of a query with a large number of points from the API can eat up bandwidth. High frequencies of data can easily overwhelm Javascript graphing libraries, hence the choice to use GnuPlot. Graphs created by the GUI can be difficult to read, resulting in thick lines such as the graph below:
OpenTSDB可以存储很多的数据,每秒一个数据点的时间序列。但是查询可能会返回比较大的数据量。
通过API访问会比较消耗带宽。
画图可以使用Javascript图库,也可以使用GnuPlot。
Down sampling can be used at query time to reduce the number of data points returned so that you can extract better information from a graph or pass less data over a connection. Down sampling requires an aggregation function and a time interval. The aggregation function is used to compute a new data point across all of the data points in the specified interval with the proper mathematical function. For example, if the aggregation sum is used, then all of the data points within the interval will be summed together into a single value. If avg is chosen, then the average of all data points within the interval will be returned.
Down sampling可以用于减少数据量,这样你可以从图中获取更直观的信息。
Down sampling需要aggregation function 和time interval。
aggregation function 用于计算新的点,在确定的time interval内。
可以使用数据函数,例如sum和avg
Intervals are specified by a number and a unit of time. For example, 30m will aggregate data points every 30 minutes. 1h will aggregate across an hour. See Dates and Times for valid relative time units. Do not add the -ago to a down sampling query.
Intervals是特定的时间序列。
Using down sampling we can cleanup the previous graph to arrive at something much more useful:
使用down sampling后上图变为如下:
As of 2.1, downsampled timestamps are normalized based on the remainder of the original data point timestamp divided by the downsampling interval in milliseconds, i.e. the modulus. In Java the code is timestamp - (timestamp % interval_ms). For example, given a timestamp of 1388550980000, or 1/1/2014 04:36:20 UTC and an hourly interval that equates to 3600000 milliseconds, the resulting timestamp will be rounded to 1388548800000. All data points between 4 and 5 UTC will wind up in the 4 AM bucket. If you query for a day's worth of data downsampling on 1 hour, you will receive 24 data points (assuming there is data for all 24 hours).
Normalization works very well for common queries such as a day's worth of data downsampled to 1 minute or 1 hour. However if you try to downsample on an odd interval, such as 36 minutes, then the timestamps may look a little strange due to the nature of the modulus calculation. Given an interval of 36 minutes and our example above, the interval would be 2160000 milliseconds and the resulting timestamp1388549520 or 04:12:00 UTC. All data points between 04:12 and 04:48 would wind up in a single bucket. Also note that OpenTSDB cannot currently normalize on non-UTC times and it cannot normalize on weekly or monthly boundaries.
时间序列的Normalization
Rate
A number of data sources return values as constantly incrementing counters. One example is a web site hit counter. When you start a web server, it may have a hit counter of 0. After five minutes the value may be 1,024. After another five minutes it may be 2,048. The graph for a counter will be a somewhat straight line angling up to the right and isn't always very useful. OpenTSDB provides the rate key word that calculates the rate of change in values over time. This will transform counters into lines with spikes to show you when activity occurred and can be much more useful.
OpenTSDB提供rate key计算一段时间内值的变化趋势,这样将网站的counters转换为不同趋势的线,更利用展示活跃期。
The rate is the first derivative of the values. It's defined as (v2 - v1) / (t2 - t1). Therefore you will get the rate of change per second. Currently the rate of change between millisecond values defaults to a per second calculation.
计算公式为(v2 - v1) / (t2 - t1)。因此可以获取每秒的变化。
OpenTSDB 2.0 provides support for special monotonically increasing counter data handling including the ability to set a "rollover" value and suppress anomalous fluctuations. When the counterMax value is specified in a query, if a data point approaches this value and the point after is less than the previous, the max value will be used to calculate an accurate rate given the two points. For example, if we were recording an integer counter on 1 byte, the maximum value would be 65,535. If the value at t0 is 64000 and the value at t1 is 1000, the resulting rate would usually be calculated as -63000. However we know that it's likely the counter rolled over so we can set the max to 65535and now the calculation will be 65535 - t0 + t1 to give us 2535.
OpenTSDB2.0支持特定单调递增处理。
counterMax
t0 64000
t1 1000
rate为-63000
最大值为65535,计算公式为65535 - t0 + t1 =2535
(没明白??)
Systems that track data in counters often revert to 0 when restarted. When that happens and we could get a spurious result when using the max counter feature. For example, if the counter has reached 2000 at t0 and someone reboots the server, the next value may be 500 at t1. If we set our max to 65535 the result would be 65535 - 2000 + 500 to give us 64035. If the normal rate is a few points per second, this particular spike, with 30s between points, would create a rate spike of 2,134.5! To avoid this, we can set the resetValue which will, when the rate exceeds this value, return a data point of 0 so as to avoid spikes in either direction. For the example above, if we know that our rate almost never exceeds 100, we could configure a resetValue of 100 and when the data point above is calculated, it will return 0 instead of2,134.5. The default value of 0 means the reset value will be ignored, no rates will be suppressed.
Order of operations
Understanding the order of operations is important. When returning query results the following is the order in which processing takes place:
- Grouping
- Down Sampling
- Interpolation
- Aggregation
- Rate Calculations
理解操作的顺序很重要!!!