具有大量临时表写入的复制

时间:2022-03-17 03:49:06

I've got a database which I intend to replicate for backup reasons (performance is not a problem at the moment).

我有一个数据库,我打算为了备份原因而复制(目前性能不是问题)。

We've set up the replication correctly and tested it and all was fine.

我们已经正确设置了复制并对其进行了测试,一切都很好。

Then we realized that it replicates all the writes to the temporary tables, which in effect meant that replication of one day's worth of data took almost two hours for the idle slave.

然后我们意识到它复制了对临时表的所有写入,这实际上意味着对于空闲从站,复制一天的数据花费了将近两个小时。

The reason for that is that we recompute some of the data in our db via cronjob every 15 mins to ensure it's in sync (it takes ~3 minutes in total, so it is unacceptable to do those operations during a web request; instead we just store the modifications without attempting to recompute anything while in the web request, and then do all of the work in bulk). In order to process that data efficiently, we use temporary tables (as there's lots of interdependencies).

原因是我们每隔15分钟通过cronjob重新计算数据库中的一些数据,以确保它是同步的(总共花费约3分钟,因此在Web请求期间执行这些操作是不可接受的;相反,我们只是存储修改,而不是在Web请求中尝试重新计算任何内容,然后批量完成所有工作)。为了有效地处理这些数据,我们使用临时表(因为有很多相互依赖性)。

Now, the first problem is that temporary tables do not persist if we restart the slave while it's in the middle of processing transactions that use that temp table. That can be avoided by not using temporary tables, although this has its own issues.

现在,第一个问题是,如果我们在处理使用该临时表的事务处理过程中重新启动从属服务器时临时表不会持久存在。不使用临时表可以避免这种情况,尽管这有其自身的问题。

The more serious problem is that the slave could easily catch up in less than half an hour if it wasn't for all that recomputation (which it does one after the other, so there's no benefit of rebuilding the data every 15 mins... and you can literally see it stuck at, say 1115, only to quickly catch up and got stuck at 1130 etc).

更严重的问题是,如果不是所有重新计算的话,奴隶可以在不到半小时内轻松赶上(它一个接一个地进行,所以每15分钟重建数据没有任何好处......而且你可以真正地看到它停留在,例如1115,只是为了快速赶上并在1130等卡住了)。

One solution we came up with is to move all that recomputation out of the replicated db, so that the slave doesn't replicate it. But it has disadvantages in that we'd have to prune the tables it eventually updates, making our slave in effect "castrated", ie. we'd have to recompute everything on it before we could actually use it.

我们提出的一个解决方案是将所有重新计算移出复制的数据库,以便从服务器不会复制它。但它的缺点在于我们必须修剪它最终更新的表格,使我们的奴隶实际上“阉割”,即。在我们真正使用它之前,我们必须重新计算它上面的所有内容。

Did anyone have a similar problem and/or how would you solve it? Am I missing something obvious?

有没有人有类似的问题和/或你会如何解决它?我错过了一些明显的东西吗

2 个解决方案

#1


3  

I've come up with the solution. It makes use of replicate-do-db mentioned by Nick. Writing it down here in case somebody had a similar problem.

我想出了解决方案。它利用了Nick提到的replicate-do-db。把它写下来以防有人遇到类似的问题。

The problem with just using replicate-(wild-)do* options in this case (like I said, we use temp tables to repopulate a central table) is that either you ignore temp tables and repopulate the central one with no data (which causes further problems as all the queries relying on the central table being up-to-date will produce different results) or you ignore the central table, which has a similar problem. Not to mention, you have to restart mysql after adding any of those options to my.cnf. We wanted something that would cover all those cases (and future ones) without the need for any further restart.

在这种情况下仅使用replicate-(wild-)do *选项的问题(就像我说的,我们使用临时表来重新填充中心表)是要么忽略临时表并重新填充没有数据的中心表(这会导致进一步的问题,因为依赖于*表的所有查询都是最新的会产生不同的结果)或者你忽略了具有类似问题的*表。更不用说,你必须在将任何这些选项添加到my.cnf后重新启动mysql。我们想要的东西可以覆盖所有这些情况(以及未来的情况),而无需进一步重启。

So, what we decided to do is to split the database into the "real" and a "workarea" databases. Only the "real" database is replicated (I guess you could decide on a convention of table names to be used for replicate-wild-do-table syntax).

因此,我们决定将数据库拆分为“真实”和“工作区”数据库。只复制“真实”数据库(我猜你可以决定用于复制 - 野生表格语法的表名约定)。

All the temporary table work is happening in "workarea" db, and to avoid the dependency problem mentioned above, we won't populate the central table (which sits in "real" db) by INSERT ... SELECT or RENAME TABLE, but rather query the tmp tables to generate a sort of a diff on the live table (ie. generate INSERT statements for new rows, DELETE for the old ones and update where necessary).

所有的临时表工作都发生在“workarea”db中,为了避免上面提到的依赖性问题,我们不会通过INSERT ... SELECT或RENAME TABLE填充*表(位于“真正的”db中),但是而是查询tmp表以在活动表上生成一种差异(即,为新行生成INSERT语句,为旧行生成DELETE并在必要时更新)。

This way the only queries that are replicated are exactly the updates that are required, nothing else, ie. some (most?) of the recomputation queries hapenning every fifteen minutes might not even make its way to slave, and the ones that do will be minimal and not computationally expensive at all, just simple INSERTs and DELETEs.

这样,复制的唯一查询就是所需的更新,没有别的,即。一些(大多数?)重新计算查询每十五分钟一次,甚至可能无法进入奴隶,而那些做的将是最小的并且计算成本不高,只需要简单的INSERT和DELETE。

#2


2  

In MySQL, as of 5.0 I believe, you can do table wildcards to replicate specific tables. There are a number of command-line options that can be set but you can also do this via your MySQL config file.

在MySQL中,从5.0开始,我相信,您可以使用表通配符来复制特定的表。可以设置许多命令行选项,但您也可以通过MySQL配置文件执行此操作。

[mysqld]
replicate-do-db    = db1
replicate-do-table = db2.mytbl2
replicate-wild-do-table= database_name.%
replicate-wild-do-table= another_db.%

The idea being that you tell it to not replicate any tables other than the ones you specify.

这个想法是你告诉它不要复制你指定的表以外的任何表。

#1


3  

I've come up with the solution. It makes use of replicate-do-db mentioned by Nick. Writing it down here in case somebody had a similar problem.

我想出了解决方案。它利用了Nick提到的replicate-do-db。把它写下来以防有人遇到类似的问题。

The problem with just using replicate-(wild-)do* options in this case (like I said, we use temp tables to repopulate a central table) is that either you ignore temp tables and repopulate the central one with no data (which causes further problems as all the queries relying on the central table being up-to-date will produce different results) or you ignore the central table, which has a similar problem. Not to mention, you have to restart mysql after adding any of those options to my.cnf. We wanted something that would cover all those cases (and future ones) without the need for any further restart.

在这种情况下仅使用replicate-(wild-)do *选项的问题(就像我说的,我们使用临时表来重新填充中心表)是要么忽略临时表并重新填充没有数据的中心表(这会导致进一步的问题,因为依赖于*表的所有查询都是最新的会产生不同的结果)或者你忽略了具有类似问题的*表。更不用说,你必须在将任何这些选项添加到my.cnf后重新启动mysql。我们想要的东西可以覆盖所有这些情况(以及未来的情况),而无需进一步重启。

So, what we decided to do is to split the database into the "real" and a "workarea" databases. Only the "real" database is replicated (I guess you could decide on a convention of table names to be used for replicate-wild-do-table syntax).

因此,我们决定将数据库拆分为“真实”和“工作区”数据库。只复制“真实”数据库(我猜你可以决定用于复制 - 野生表格语法的表名约定)。

All the temporary table work is happening in "workarea" db, and to avoid the dependency problem mentioned above, we won't populate the central table (which sits in "real" db) by INSERT ... SELECT or RENAME TABLE, but rather query the tmp tables to generate a sort of a diff on the live table (ie. generate INSERT statements for new rows, DELETE for the old ones and update where necessary).

所有的临时表工作都发生在“workarea”db中,为了避免上面提到的依赖性问题,我们不会通过INSERT ... SELECT或RENAME TABLE填充*表(位于“真正的”db中),但是而是查询tmp表以在活动表上生成一种差异(即,为新行生成INSERT语句,为旧行生成DELETE并在必要时更新)。

This way the only queries that are replicated are exactly the updates that are required, nothing else, ie. some (most?) of the recomputation queries hapenning every fifteen minutes might not even make its way to slave, and the ones that do will be minimal and not computationally expensive at all, just simple INSERTs and DELETEs.

这样,复制的唯一查询就是所需的更新,没有别的,即。一些(大多数?)重新计算查询每十五分钟一次,甚至可能无法进入奴隶,而那些做的将是最小的并且计算成本不高,只需要简单的INSERT和DELETE。

#2


2  

In MySQL, as of 5.0 I believe, you can do table wildcards to replicate specific tables. There are a number of command-line options that can be set but you can also do this via your MySQL config file.

在MySQL中,从5.0开始,我相信,您可以使用表通配符来复制特定的表。可以设置许多命令行选项,但您也可以通过MySQL配置文件执行此操作。

[mysqld]
replicate-do-db    = db1
replicate-do-table = db2.mytbl2
replicate-wild-do-table= database_name.%
replicate-wild-do-table= another_db.%

The idea being that you tell it to not replicate any tables other than the ones you specify.

这个想法是你告诉它不要复制你指定的表以外的任何表。