I'd like to ask a question about how to improve performance in a big MySQL table using innodb engine:
我想问一个关于如何使用innodb引擎提高大型MySQL表的性能的问题:
There's currently a table in my database with around 200 million rows. This table periodically stores the data collected by different sensors. The structure of the table is as follows:
目前我的数据库中有一个表,大约有2亿行。该表定期存储由不同传感器收集的数据。该表的结构如下:
CREATE TABLE sns_value (
value_id int(11) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
date timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
KEY idx_sensor id (sensor_id),
KEY idx_date (date),
KEY idx_type_id (type_id) );
At first, I thought of partitioning the table in months, but due to the steady addition of new sensors it would reach the current size in about a month.
起初,我想过几个月就把桌子分开,但由于新传感器的稳定增加,它将在一个月左右达到目前的尺寸。
Another solution that I came up with was partitioning the table by sensors. However, due to the limit of 1024 partitions of MySQL that wasn't an option.
我想出的另一个解决方案是通过传感器对表进行分区。但是,由于MySQL的1024个分区的限制不是一个选项。
I believe that the right solution would be using a table with the same structure for each of the sensors:
我相信正确的解决方案是为每个传感器使用具有相同结构的表:
sns_value_XXXXX
sns_value_XXXXX
This way there would be more than 1.000 tables with an estimated size of 30 million rows per year. These tables could, at the same time, be partitioned in months for fastest access to data.
这样,将有超过1000个表,估计每年大小为3000万行。同时,这些表可以在几个月内进行分区,以便最快地访问数据。
What problems would result from this solution? Is there a more normalized solution?
这个解决方案会带来什么问题?有更正常的解决方案吗?
Editing with additional information
使用其他信息进行编辑
I consider the table to be big in relation to my server:
我认为该表与我的服务器有关:
- Cloud 2xCPU and 8GB Memory
- 云2xCPU和8GB内存
- LAMP (CentOS 6.5 and MySQL 5.1.73)
- LAMP(CentOS 6.5和MySQL 5.1.73)
Each sensor may have more than one variable types (CO, CO2, etc.).
每个传感器可能有多种变量类型(CO,CO2等)。
I mainly have two slow queries:
我主要有两个慢查询:
1) Daily summary for each sensor and type (avg, max, min):
1)每个传感器和类型的每日摘要(平均值,最大值,最小值):
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max, type_id
FROM sns_value
WHERE sensor_id=1 AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
This takes more than 5 min.
这需要超过5分钟。
2) Vertical to Horizontal view and export:
2)垂直到水平视图和导出:
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1 AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date LIMIT 4500;
This also takes more than 5 min.
这也需要5分钟以上。
Other considerations
其他考虑
- Timestamps may be repeated due to inserts characteristics.
- 由于插入特性,可以重复时间戳。
- Periodic inserts must coexist with selects.
- 定期插入必须与选择共存。
- No updates nor deletes are performed on the table.
- 表上没有更新或删除。
Suppositions made to the "one table for each sensor" approach
假设“每个传感器的一个表”方法
- Tables for each sensor would be much smaller so access would be faster.
- 每个传感器的表格要小得多,因此访问速度会更快。
- Selects will be performed only on one table for each sensor.
- 对于每个传感器,仅在一个表上执行选择。
- Selects mixing data from different sensors are not time-critical.
- 选择来自不同传感器的混合数据不是时间关键的。
Update 02/02/2015
更新02/02/2015
We have created a new table for each year of data, which we have also partitioned in a daily basis. Each table has around 250 million rows with 365 partitions. The new index used is as Ollie suggested (sensor_id, date, type_id, value) but the query still takes between 30 seconds and 2 minutes. We do not use the first query (daily summary), just the second (vertical to horizontal view).
我们为每年的数据创建了一个新表,我们也每天对其进行分区。每个表有大约2.5亿行,有365个分区。使用的新索引是Ollie建议的(sensor_id,date,type_id,value),但查询仍需要30秒到2分钟。我们不使用第一个查询(每日摘要),只使用第二个查询(垂直于水平视图)。
In order to be able to partition the table, the primary index had to be removed.
为了能够对表进行分区,必须删除主索引。
Are we missing something? Is there a way to improve the performance?
我们错过了什么吗?有没有办法改善表现?
Many thanks!
非常感谢!
3 个解决方案
#1
1
Edited based on changes to the question
根据问题的变化进行编辑
One table per sensor is, with respect, a very bad idea indeed. There are several reasons for that:
事实上,每个传感器一张表确实是一个非常坏的想法。有几个原因:
- MySQL servers on ordinary operating systems have a hard time with thousands of tables. Most OSs can't handle that many simultaneous file accesses at once.
- 普通操作系统上的MySQL服务器很难有数千个表。大多数操作系统无法同时处理多个同时进行的文件访问。
- You'll have to create tables each time you add (or delete) sensors.
- 每次添加(或删除)传感器时都必须创建表。
- Queries that involve data from multiple sensors will be slow and convoluted.
- 涉及来自多个传感器的数据的查询将是缓慢且复杂的。
My previous version of this answer suggested range partitioning by timestamp. But that won't work with your value_id
primary key. However, with the queries you've shown and proper indexing of your table, partitioning probably won't be necessary.
我之前的这个答案版本建议按时间戳划分范围。但是这不适用于你的value_id主键。但是,对于您显示的查询以及对表的正确索引,可能不需要进行分区。
(Avoid the column name date
if you can: it's a reserved word and you'll have lots of trouble writing queries. Instead I suggest you use ts
, meaning timestamp.)
(如果可以的话,请避免使用列名称日期:它是一个保留字,您在编写查询时会遇到很多麻烦。相反,我建议您使用ts,即时间戳。)
Beware: int(11)
values aren't aren't big enough for your value_id
column. You're going to run out of ids. Use bigint(20)
for that column.
注意:int(11)值对于value_id列来说不够大。你将用完ids。对该列使用bigint(20)。
You've mentioned two queries. Both these queries can be made quite efficient with appropriate compound indexes, even if you keep all your values in a single table. Here's the first one.
你提到了两个问题。即使将所有值保存在单个表中,这两个查询都可以通过适当的复合索引非常高效。这是第一个。
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max,
type_id
FROM sns_value
WHERE sensor_id=1
AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
For this query, you're first looking up sensor_id
using a constant, then you're looking up a range of date
values, then you're aggregating by type_id
. Finally you're extracting the value
column. Therefore, a so-called compound covering index on (sensor_id, date, type_id, value)
will be able to satisfy your query directly with an index scan. This should be very fast for you--certainly faster than 5 minutes even with a large table.
对于此查询,您首先使用常量查找sensor_id,然后查找一系列日期值,然后按type_id进行聚合。最后,您要提取值列。因此,所谓的复合覆盖索引(sensor_id,date,type_id,value)将能够通过索引扫描直接满足您的查询。这应该对你来说非常快 - 即使有一张大桌子,肯定会超过5分钟。
In your second query, a similar indexing strategy will work.
在第二个查询中,类似的索引策略将起作用。
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1
AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date
LIMIT 4500;
Again, you start with a constant value of sensor_id
and then use a date
range. You then extract both type_id
and value
. That means the same four column index I mentioned should work for you.
再次,您从sensor_id的常量值开始,然后使用日期范围。然后提取type_id和value。这意味着我提到的相同的四列索引应该适合您。
CREATE TABLE sns_value (
value_id bigint(20) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
INDEX query_opt (sensor_id, ts, type_id, value)
);
#2
0
Creating separate table for a range of sensors would be an idea.
为一系列传感器创建单独的表将是一个想法。
Do not use the auto_increment for a primary key, if you dont have to. Usually DB engine is clustering the data by its primary key.
如果您不需要,请不要将auto_increment用于主键。通常,数据库引擎通过其主键对数据进行聚类。
Use composite key instead, depends from your usecase, the sequence of columns may be different.
使用复合键,取决于您的用例,列的顺序可能不同。
EDIT: Also added the type into the PK. Considering the queries, i would do it like this. Choosing the field names is intentional, they should be descriptive and always consider the reserverd words.
编辑:还将类型添加到PK中。考虑到查询,我会这样做。选择字段名称是有意的,它们应该是描述性的,并且始终考虑保留的单词。
CREATE TABLE snsXX_readings (
sensor_id int(11) NOT NULL,
reading int(11) NOT NULL,
reading_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
type_id int(11) NOT NULL,
PRIMARY KEY (reading_time, sensor_id, type_id),
KEY idx date_idx (date),
KEY idx type_id (type_id)
);
Also, consider summarizing the readings or grouping them into a single field.
另外,请考虑总结读数或将它们分组到单个字段中。
#3
0
You can try get randomize summary data
您可以尝试获取随机化摘要数据
I have similar table. table engine myisam(smallest table size), 10m record, no index on my table because useless(tested). Get all range for the all data. result:10sn this query.
我有类似的表。表引擎myisam(最小表大小),10m记录,我的表上没有索引因为无用(已测试)。获取所有数据的所有范围。结果:10sn这个查询。
SELECT * FROM (
SELECT sensor_id, value, date
FROM sns_value l
WHERE l.sensor_id= 123 AND
(l.date BETWEEN '2013-10-29 12:28:29' AND '2015-10-29 12:28:29')
ORDER BY RAND() LIMIT 2000
) as tmp
ORDER BY tmp.date;
This query on first step get between dates and sorting randomize first 2k data, on the second step sort data. the query every time get 2k result for different data.
第一步的查询在日期和排序之间获得随机化的第一个2k数据,在第二步排序数据。查询每次获得2k结果的不同数据。
#1
1
Edited based on changes to the question
根据问题的变化进行编辑
One table per sensor is, with respect, a very bad idea indeed. There are several reasons for that:
事实上,每个传感器一张表确实是一个非常坏的想法。有几个原因:
- MySQL servers on ordinary operating systems have a hard time with thousands of tables. Most OSs can't handle that many simultaneous file accesses at once.
- 普通操作系统上的MySQL服务器很难有数千个表。大多数操作系统无法同时处理多个同时进行的文件访问。
- You'll have to create tables each time you add (or delete) sensors.
- 每次添加(或删除)传感器时都必须创建表。
- Queries that involve data from multiple sensors will be slow and convoluted.
- 涉及来自多个传感器的数据的查询将是缓慢且复杂的。
My previous version of this answer suggested range partitioning by timestamp. But that won't work with your value_id
primary key. However, with the queries you've shown and proper indexing of your table, partitioning probably won't be necessary.
我之前的这个答案版本建议按时间戳划分范围。但是这不适用于你的value_id主键。但是,对于您显示的查询以及对表的正确索引,可能不需要进行分区。
(Avoid the column name date
if you can: it's a reserved word and you'll have lots of trouble writing queries. Instead I suggest you use ts
, meaning timestamp.)
(如果可以的话,请避免使用列名称日期:它是一个保留字,您在编写查询时会遇到很多麻烦。相反,我建议您使用ts,即时间戳。)
Beware: int(11)
values aren't aren't big enough for your value_id
column. You're going to run out of ids. Use bigint(20)
for that column.
注意:int(11)值对于value_id列来说不够大。你将用完ids。对该列使用bigint(20)。
You've mentioned two queries. Both these queries can be made quite efficient with appropriate compound indexes, even if you keep all your values in a single table. Here's the first one.
你提到了两个问题。即使将所有值保存在单个表中,这两个查询都可以通过适当的复合索引非常高效。这是第一个。
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max,
type_id
FROM sns_value
WHERE sensor_id=1
AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
For this query, you're first looking up sensor_id
using a constant, then you're looking up a range of date
values, then you're aggregating by type_id
. Finally you're extracting the value
column. Therefore, a so-called compound covering index on (sensor_id, date, type_id, value)
will be able to satisfy your query directly with an index scan. This should be very fast for you--certainly faster than 5 minutes even with a large table.
对于此查询,您首先使用常量查找sensor_id,然后查找一系列日期值,然后按type_id进行聚合。最后,您要提取值列。因此,所谓的复合覆盖索引(sensor_id,date,type_id,value)将能够通过索引扫描直接满足您的查询。这应该对你来说非常快 - 即使有一张大桌子,肯定会超过5分钟。
In your second query, a similar indexing strategy will work.
在第二个查询中,类似的索引策略将起作用。
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1
AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date
LIMIT 4500;
Again, you start with a constant value of sensor_id
and then use a date
range. You then extract both type_id
and value
. That means the same four column index I mentioned should work for you.
再次,您从sensor_id的常量值开始,然后使用日期范围。然后提取type_id和value。这意味着我提到的相同的四列索引应该适合您。
CREATE TABLE sns_value (
value_id bigint(20) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
INDEX query_opt (sensor_id, ts, type_id, value)
);
#2
0
Creating separate table for a range of sensors would be an idea.
为一系列传感器创建单独的表将是一个想法。
Do not use the auto_increment for a primary key, if you dont have to. Usually DB engine is clustering the data by its primary key.
如果您不需要,请不要将auto_increment用于主键。通常,数据库引擎通过其主键对数据进行聚类。
Use composite key instead, depends from your usecase, the sequence of columns may be different.
使用复合键,取决于您的用例,列的顺序可能不同。
EDIT: Also added the type into the PK. Considering the queries, i would do it like this. Choosing the field names is intentional, they should be descriptive and always consider the reserverd words.
编辑:还将类型添加到PK中。考虑到查询,我会这样做。选择字段名称是有意的,它们应该是描述性的,并且始终考虑保留的单词。
CREATE TABLE snsXX_readings (
sensor_id int(11) NOT NULL,
reading int(11) NOT NULL,
reading_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
type_id int(11) NOT NULL,
PRIMARY KEY (reading_time, sensor_id, type_id),
KEY idx date_idx (date),
KEY idx type_id (type_id)
);
Also, consider summarizing the readings or grouping them into a single field.
另外,请考虑总结读数或将它们分组到单个字段中。
#3
0
You can try get randomize summary data
您可以尝试获取随机化摘要数据
I have similar table. table engine myisam(smallest table size), 10m record, no index on my table because useless(tested). Get all range for the all data. result:10sn this query.
我有类似的表。表引擎myisam(最小表大小),10m记录,我的表上没有索引因为无用(已测试)。获取所有数据的所有范围。结果:10sn这个查询。
SELECT * FROM (
SELECT sensor_id, value, date
FROM sns_value l
WHERE l.sensor_id= 123 AND
(l.date BETWEEN '2013-10-29 12:28:29' AND '2015-10-29 12:28:29')
ORDER BY RAND() LIMIT 2000
) as tmp
ORDER BY tmp.date;
This query on first step get between dates and sorting randomize first 2k data, on the second step sort data. the query every time get 2k result for different data.
第一步的查询在日期和排序之间获得随机化的第一个2k数据,在第二步排序数据。查询每次获得2k结果的不同数据。