I'm building a Web Application that is connected to a MySQL database. I've got two huge tables containing each about 40 millions rows at the moment, and they are receiving new rows everyday (which adds ~ 500 000-1000 000 rows everyday).
我正在构建一个连接到MySQL数据库的Web应用程序。我有两个巨大的表,目前每行约有4千万行,他们每天都会收到新行(每天增加约500 000-1000 000行)。
The process to add new rows runs during the night, while no one can use the application, and the new rows' content depends on the result of some basic SELECT
queries on the current database. In order to get the result of those SELECT
statement fast enough, I'm using simple indexes (one column per index) on each column that appears at least once in a WHERE
clause.
添加新行的过程在夜间运行,而没有人可以使用该应用程序,新行的内容取决于当前数据库上一些基本SELECT查询的结果。为了足够快地获得这些SELECT语句的结果,我在每个列上使用简单索引(每个索引一列),在WHERE子句中至少出现一次。
The thing is, during the day, some totally different queries are run against those tables, including some "range WHERE clause" (SELECT * FROM t1 WHERE a = a1 AND b = b1 AND (date BETWEEN d1 AND d2)
). I found on stack this very helpful mini-cookbook that advises you on which INDEXes you should use depending on how the database is queried: http://mysql.rjweb.org/doc.php/index_cookbook_mysql They advice to use compound index: in my example query above it would give INDEX(a, b, date).
问题是,在白天,对这些表运行一些完全不同的查询,包括一些“范围WHERE子句”(SELECT * FROM t1 WHERE a = a1 AND b = b1 AND(date BETWEEN d1 AND d2))。我在堆栈上找到了这个非常有用的迷你食谱,根据查询数据库的方式建议你应该使用哪些索引:http://mysql.rjweb.org/doc.php/index_cookbook_mysql他们建议使用复合索引:in我上面的示例查询将给出INDEX(a,b,date)。
It indeed increased the speed of the queries run during the day (from 1 minute to 8 seconds so I was truly happy).
它确实提高了白天运行查询的速度(从1分钟到8秒,所以我真的很高兴)。
However, with those compound indexes, the required time to add new rows during the night totally explode (it would take more than one day to add the daily content).
但是,使用这些复合索引,在夜间添加新行所需的时间完全爆炸(添加每日内容需要一天以上)。
Here is my question: would that be ok to drop all the indexes every night, add the new content, and set back up the daily indexes? Or would that be dangerous since indexes are not meant to be rebuilt every day, especially on such big tables? I know such an operation would take approximately two hours in total (drop and recreate INDEXes).
这是我的问题:是否可以每晚删除所有索引,添加新内容,并设置备份每日索引?或者这是危险的,因为索引不是每天都要重建,特别是在如此大的桌子上?我知道这样的操作总共需要大约两个小时(删除并重新创建INDEX)。
I am aware of the existence of ALTER TABLE table_name DISABLE KEYS;
but I'm using InnoDB and I believe it is not made to work on InnoDB table.
我知道ALTER TABLE table_name存在DISABLE KEYS;但我正在使用InnoDB,我相信它不适用于InnoDB表。
Any senior advice would be welcome! Thanks in advance.
任何高级建议都会受到欢迎!提前致谢。
2 个解决方案
#1
2
I believe you have answered your own question: You need the indexes during the day, but not at night. Given what you describe, you should drop the indexes for the bulk inserts at night and re-create them afterwards. Dropping indexes for data loads is not unheard of, and seems appropriate in your case.
我相信你已经回答了自己的问题:你需要白天的指数,但不是晚上。根据您的描述,您应该在晚上删除批量插入的索引,然后重新创建它们。删除数据加载的索引并非闻所未闻,在您的情况下似乎是合适的。
I would ask about how you are inserting new data. One method is to insert the values one row at a time. Another is to put the values into a temporary table (with no index) and do a bulk insert:
我会问你如何插入新数据。一种方法是一次插入一行值。另一种方法是将值放入临时表(没有索引)并执行批量插入:
insert into bigtable( . . .)
select . . .
from smalltable;
These have different performance characteristics. You might find that using a single insert
(if you are not already doing so) is fast enough for your purposes.
它们具有不同的性能特征。您可能会发现使用单个插入(如果您还没有这样做)的速度足够快。
#2
2
A digression... PARTITIONing
by date should be very useful for you since you are deleting things over a year ago. I would recommend PARTITION BY RANGE(TO_DAYS(...))
and breaking it into 14 or 54 partitions (months or weeks, plus some overhead). This will eliminate the time it takes to delete the old rows, since DROP PARTITION
is almost instantaneous.
一个题外话...按日期分区应该对你非常有用,因为你在一年前删除了一些东西。我会推荐PARTITION BY RANGE(TO_DAYS(...))并将其分成14或54个分区(数月或数周,加上一些开销)。这将消除删除旧行所需的时间,因为DROP PARTITION几乎是瞬时的。
More details are in my partition blog. Your situation sounds like both Use case #1 and Use case #3.
更多详细信息在我的分区博客中。你的情况听起来像是用例#1和用例#3。
But back to your clever idea of dropping and rebuilding indexes. To others, I point out the caveat that you have the luxury of not otherwise touching the table for long enough to do the rebuild.
但回到你关于删除和重建索引的聪明想法。对于其他人,我指出了一个警告,即你有足够的时间来触摸桌面以进行重建。
With PARTITIONing
, all the rows being inserted will go into the 'latest' partition, correct? This partition is a lot smaller than the entire table, so there is a better chance that the indexes will fit in RAM, thereby be 10 times as fast to update (without rebuilding the indexes). If you provide SHOW CREATE TABLE
, SHOW TABLE STATUS
, innodb_buffer_pool_size
, and RAM size, I can help you do the arithmetic to see if your 'last' partition will fit in RAM.
使用PARTITIONing,所有插入的行都将进入“最新”分区,对吗?此分区比整个表小很多,因此索引更有可能适合RAM,因此更新速度快10倍(无需重建索引)。如果你提供SHOW CREATE TABLE,SHOW TABLE STATUS,innodb_buffer_pool_size和RAM大小,我可以帮你做算术,看看你的'last'分区是否适合RAM。
A note about index updates in InnoDB -- they are 'delayed' by sitting in the "Change buffer", which is a portion of the buffer_pool. See innodb_change_buffer_size_max
, available since 5.6. Are you using that version, or newer? (If not, you ought to upgrade, for many reasons.)
关于InnoDB中索引更新的注释 - 它们通过坐在“更改缓冲区”(它是buffer_pool的一部分)中“延迟”。见innodb_change_buffer_size_max,自5.6起可用。您使用的是该版本还是更新版本? (如果没有,你应该升级,原因有很多。)
The default for that setting is 25, meaning that 25% of the buffer_pool is set aside for pending updates to indexes, as caused by INSERT
, etc. That acts like a "cache", such that multiple updates to the same index block are held there until they get bumped out. A higher setting should make index updates hit the disk less often, hence finish faster.
该设置的默认值是25,这意味着BUFFER_POOL的25%是预留未决更新索引,如引起INSERT,等等。这就像“高速缓存”,从而使得多个更新到同一索引块被保持直到他们被撞出来。更高的设置应该使索引更新更少地访问磁盘,因此更快完成。
Where I am heading with this... By increasing this setting, you would make the inserts (direct, not rebuild) more efficient. I'm thinking that this might speed it up:
我正在努力的方向......通过增加此设置,您可以使插入(直接,而不是重建)更有效。我想这可能会加速它:
Just before the nightly INSERTs
:
就在每晚INSERT之前:
innodb_change_buffer_size_max = 70
innodb_old_blocks_pct = 10
Soon after the nightly INSERTs
:
每晚插入后不久:
innodb_change_buffer_size_max = 25
innodb_old_blocks_pct = 37
(I am not sure about that other setting, but it seems reasonable to push it out of the way.)
(我不确定其他设置,但将其推开似乎是合理的。)
Meanwhile, what is the setting of innodb_buffer_pool_size
? Typically, it should be 70% of available RAM.
同时,innodb_buffer_pool_size的设置是什么?通常,它应该是可用RAM的70%。
In a similar application, I had big, hourly, dumps to load into a table, and a 90-day retention. I stretched my Partition rules by having 90 daily partitions and 24 hourly partitions. Every night, I spent a lot of time (but less than an hour) doing REORGANIZE PARTITION
to turn the 24 hourly partitions into a new daily (and dropping the 90-day-old partition). During each hour, the load had the added advantage that nothing else was touching the 1-hour partition -- I could do normalization, summarization, and loading all in 7 minutes. The entire 90 days fit in 400GB. (Side note: a large number of partitions is a performance killer until 8.0; so don't even consider daily partitions for you 1-year retention.)
在类似的应用程序中,我有大量的每小时转储加载到表中,并保留了90天。我通过每天90个分区和24小时分区来扩展我的分区规则。每天晚上,我花了很多时间(但不到一个小时)进行REORGANIZE PARTITION将24小时分区变成新的每日(并放弃90天的分区)。在每个小时内,负载具有额外的优势,即没有其他任何东西触及1小时分区 - 我可以在7分钟内完成规范化,汇总和加载。整个90天适合400GB。 (旁注:大量分区是性能杀手,直到8.0;所以甚至不要考虑每日分区为你1年保留。)
The Summary tables made so that 50-minute queries (in the prototype) shrank to only 2 seconds. Perhaps you need a summary table with PRIMARY KEY (a, b, date)
? That will let you get rid of such an index on the 'Fact' table. Oops, that eliminates the entire premise of your original question ! See the links at the bottom of my blogs; look for "Summary Tables". A general rule: Don't have any indexes (other than the PRIMARY KEY
) on the Fact table; use Summary tables for things that need messier indexes.
摘要表使得50分钟的查询(在原型中)缩小到仅2秒。也许您需要一个带PRIMARY KEY(a,b,日期)的汇总表?这将让你摆脱'Fact'表上的这样一个索引。糟糕,这消除了原始问题的整个前提!查看我博客底部的链接;寻找“汇总表”。一般规则:事实表上没有任何索引(PRIMARY KEY除外);对需要更复杂索引的事物使用汇总表。
#1
2
I believe you have answered your own question: You need the indexes during the day, but not at night. Given what you describe, you should drop the indexes for the bulk inserts at night and re-create them afterwards. Dropping indexes for data loads is not unheard of, and seems appropriate in your case.
我相信你已经回答了自己的问题:你需要白天的指数,但不是晚上。根据您的描述,您应该在晚上删除批量插入的索引,然后重新创建它们。删除数据加载的索引并非闻所未闻,在您的情况下似乎是合适的。
I would ask about how you are inserting new data. One method is to insert the values one row at a time. Another is to put the values into a temporary table (with no index) and do a bulk insert:
我会问你如何插入新数据。一种方法是一次插入一行值。另一种方法是将值放入临时表(没有索引)并执行批量插入:
insert into bigtable( . . .)
select . . .
from smalltable;
These have different performance characteristics. You might find that using a single insert
(if you are not already doing so) is fast enough for your purposes.
它们具有不同的性能特征。您可能会发现使用单个插入(如果您还没有这样做)的速度足够快。
#2
2
A digression... PARTITIONing
by date should be very useful for you since you are deleting things over a year ago. I would recommend PARTITION BY RANGE(TO_DAYS(...))
and breaking it into 14 or 54 partitions (months or weeks, plus some overhead). This will eliminate the time it takes to delete the old rows, since DROP PARTITION
is almost instantaneous.
一个题外话...按日期分区应该对你非常有用,因为你在一年前删除了一些东西。我会推荐PARTITION BY RANGE(TO_DAYS(...))并将其分成14或54个分区(数月或数周,加上一些开销)。这将消除删除旧行所需的时间,因为DROP PARTITION几乎是瞬时的。
More details are in my partition blog. Your situation sounds like both Use case #1 and Use case #3.
更多详细信息在我的分区博客中。你的情况听起来像是用例#1和用例#3。
But back to your clever idea of dropping and rebuilding indexes. To others, I point out the caveat that you have the luxury of not otherwise touching the table for long enough to do the rebuild.
但回到你关于删除和重建索引的聪明想法。对于其他人,我指出了一个警告,即你有足够的时间来触摸桌面以进行重建。
With PARTITIONing
, all the rows being inserted will go into the 'latest' partition, correct? This partition is a lot smaller than the entire table, so there is a better chance that the indexes will fit in RAM, thereby be 10 times as fast to update (without rebuilding the indexes). If you provide SHOW CREATE TABLE
, SHOW TABLE STATUS
, innodb_buffer_pool_size
, and RAM size, I can help you do the arithmetic to see if your 'last' partition will fit in RAM.
使用PARTITIONing,所有插入的行都将进入“最新”分区,对吗?此分区比整个表小很多,因此索引更有可能适合RAM,因此更新速度快10倍(无需重建索引)。如果你提供SHOW CREATE TABLE,SHOW TABLE STATUS,innodb_buffer_pool_size和RAM大小,我可以帮你做算术,看看你的'last'分区是否适合RAM。
A note about index updates in InnoDB -- they are 'delayed' by sitting in the "Change buffer", which is a portion of the buffer_pool. See innodb_change_buffer_size_max
, available since 5.6. Are you using that version, or newer? (If not, you ought to upgrade, for many reasons.)
关于InnoDB中索引更新的注释 - 它们通过坐在“更改缓冲区”(它是buffer_pool的一部分)中“延迟”。见innodb_change_buffer_size_max,自5.6起可用。您使用的是该版本还是更新版本? (如果没有,你应该升级,原因有很多。)
The default for that setting is 25, meaning that 25% of the buffer_pool is set aside for pending updates to indexes, as caused by INSERT
, etc. That acts like a "cache", such that multiple updates to the same index block are held there until they get bumped out. A higher setting should make index updates hit the disk less often, hence finish faster.
该设置的默认值是25,这意味着BUFFER_POOL的25%是预留未决更新索引,如引起INSERT,等等。这就像“高速缓存”,从而使得多个更新到同一索引块被保持直到他们被撞出来。更高的设置应该使索引更新更少地访问磁盘,因此更快完成。
Where I am heading with this... By increasing this setting, you would make the inserts (direct, not rebuild) more efficient. I'm thinking that this might speed it up:
我正在努力的方向......通过增加此设置,您可以使插入(直接,而不是重建)更有效。我想这可能会加速它:
Just before the nightly INSERTs
:
就在每晚INSERT之前:
innodb_change_buffer_size_max = 70
innodb_old_blocks_pct = 10
Soon after the nightly INSERTs
:
每晚插入后不久:
innodb_change_buffer_size_max = 25
innodb_old_blocks_pct = 37
(I am not sure about that other setting, but it seems reasonable to push it out of the way.)
(我不确定其他设置,但将其推开似乎是合理的。)
Meanwhile, what is the setting of innodb_buffer_pool_size
? Typically, it should be 70% of available RAM.
同时,innodb_buffer_pool_size的设置是什么?通常,它应该是可用RAM的70%。
In a similar application, I had big, hourly, dumps to load into a table, and a 90-day retention. I stretched my Partition rules by having 90 daily partitions and 24 hourly partitions. Every night, I spent a lot of time (but less than an hour) doing REORGANIZE PARTITION
to turn the 24 hourly partitions into a new daily (and dropping the 90-day-old partition). During each hour, the load had the added advantage that nothing else was touching the 1-hour partition -- I could do normalization, summarization, and loading all in 7 minutes. The entire 90 days fit in 400GB. (Side note: a large number of partitions is a performance killer until 8.0; so don't even consider daily partitions for you 1-year retention.)
在类似的应用程序中,我有大量的每小时转储加载到表中,并保留了90天。我通过每天90个分区和24小时分区来扩展我的分区规则。每天晚上,我花了很多时间(但不到一个小时)进行REORGANIZE PARTITION将24小时分区变成新的每日(并放弃90天的分区)。在每个小时内,负载具有额外的优势,即没有其他任何东西触及1小时分区 - 我可以在7分钟内完成规范化,汇总和加载。整个90天适合400GB。 (旁注:大量分区是性能杀手,直到8.0;所以甚至不要考虑每日分区为你1年保留。)
The Summary tables made so that 50-minute queries (in the prototype) shrank to only 2 seconds. Perhaps you need a summary table with PRIMARY KEY (a, b, date)
? That will let you get rid of such an index on the 'Fact' table. Oops, that eliminates the entire premise of your original question ! See the links at the bottom of my blogs; look for "Summary Tables". A general rule: Don't have any indexes (other than the PRIMARY KEY
) on the Fact table; use Summary tables for things that need messier indexes.
摘要表使得50分钟的查询(在原型中)缩小到仅2秒。也许您需要一个带PRIMARY KEY(a,b,日期)的汇总表?这将让你摆脱'Fact'表上的这样一个索引。糟糕,这消除了原始问题的整个前提!查看我博客底部的链接;寻找“汇总表”。一般规则:事实表上没有任何索引(PRIMARY KEY除外);对需要更复杂索引的事物使用汇总表。