pt-table-checksum检验主从数据不一致

测试环境：主从架构，操作系统liunx

运行pt-table-checksum需要先安装以下依赖包：

yum install perl-IO-Socket-SSL perl-DBD-MySQL perl-Time-HiRes -y

1、模拟主从不一致的环境：

在主库上创建一个新表，并插入几条记录，如下：

mysql> create table t1(id int primary key  not null,name char() not null );

Query OK,  rows affected (0.00 sec)

mysql> insert into t1(id,name) values(,'a'),(,'b'),(,'c');

Query OK,  rows affected (0.00 sec)

Records:   Duplicates:   Warnings: 

mysql> select * from t1;

+----+------+

| id | name |

+----+------+

|   | a    |

|   | b    |

|   | c    |

+----+------+

 rows in set (0.00 sec)

这时已经复制到从库上了：

mysql> select * from t1;

+----+------+

| id | name |

+----+------+

|   | a    |

|   | b    |

|   | c    |

+----+------+

然后在从库中插入2条记录，模拟与主从数据不一致：

mysql> insert into t1(id,name) values(,'d'),(,'e');

Query OK,  rows affected (0.00 sec)

Records:   Duplicates:   Warnings: 

mysql> select * from t1;

+----+------+

| id | name |

+----+------+

|   | a    |

|   | b    |

|   | c    |

|   | d    |

|   | e    |

+----+------+

 rows in set (0.00 sec)

2、通过pt-table-checksums来检测数据的不一致：

pt-table-checksums实现方式：通过在主服务器上运行pt-table-checksum，它会通过一系列的MySQL函数计算每个表的散列值，利用主从复制关系，把同样的计算过程在从服务器上重放，从而就拿到了主从服务器各自的散列值，只要比较散列值是否相同就OK了。所以必须保证一个账号能同时在主库和从库都有对应的访问权限

mysql> GRANT SELECT, PROCESS, SUPER, REPLICATION SLAVE ON *.* TO 'root'@'%' IDENTIFIED BY '******';

Query OK,  rows affected (0.00 sec)

mysql> flush privileges;

Query OK,  rows affected (0.00 sec)

然后我们就可以在主库上执行如下语句：

[root@darren]# /data/mysql/bin/pt-table-checksum --nocheck-binlog-format --nocheck-replication-filters  --replicate=test.checksums  --databases=report --user=root --password='******' --port=

Diffs cannot be detected because no slaves were found.  Please read the --recursion-method documentation for information.

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE

-29T12::                                     0.003 report.t1

--nocheck-replication-filters ：不检查复制过滤器，建议启用。后面可以用--databases来指定需要检查的数据库。

--no-check-binlog-format      : 不检查复制的binlog模式，要是binlog模式是ROW，则会报错。

--replicate-check-only :只显示不同步的信息。

--replicate=test.checksums   ：把checksum的信息写入到指定表中.

--databases=   ：指定需要被检查的数据库，多个则用逗号隔开。

--tables=      ：指定需要被检查的表，多个用逗号隔开

h=127.0.0.1    ：Master的地址

u=root         ：用户名

p=******       ：密码

P=         ：端口

报错信息：Diffs cannot be detected because no slaves were found. Please read the --recursion-method documentation for information.

这个错误信息的意思是无法找到对应的从库，用参数--recursion-method 可以指定模式解决，关于--recursion-method参数的设置有：

METHOD       USES

===========  =============================================

processlist  SHOW PROCESSLIST

hosts        SHOW SLAVE HOSTS

cluster      SHOW STATUS LIKE 'wsrep\_incoming\_addresses'

dsn=DSN      DSNs from a table

none         Do not find slaves

默认是通过show processlist 找到host的值，既然默认找不到，我们设置别的模式，即hosts模式，需要到从库的my.cnf中添加如下两个参数，保险起见可以重启mysql服务：

report_host=slave_ip

report_port=slave_port

然后我们添加一个--recursion-method=hosts参数，重新在主库上运行命令：

[root@--- backup]# /data/mysql/bin/pt-table-checksum --nocheck-binlog-format  --nocheck-replication-filters --recursion-method=hosts  --replicate=test.checksums  --databases=report --user=root --password='******' --port=

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE

-29T13::                                     0.008 report.t1

这时运行正常，并且找到了数据的不一致：DIFFS=1,下面是output信息说明：

TS            ：完成检查的时间。

ERRORS        ：检查时候发生错误和警告的数量。

DIFFS         ：0表示一致，大于1表示不一致。当指定--no-replicate-check时，会一直为0，当指定--replicate-check-only会显示不同的信息。

ROWS          ：表的行数。

CHUNKS        ：被划分到表中的块的数目。

SKIPPED       ：由于错误或警告或过大，则跳过块的数目。

TIME          ：执行的时间。

TABLE         ：被检查的表名。

具体是哪里不一致呢？这时我们可以通过--replicate=test.checksums这个参数到从库中查询下这个表：

mysql> select * from checksums;

+--------+-----+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+

| db     | tbl | chunk | chunk_time | chunk_index | lower_boundary | upper_boundary | this_crc | this_cnt | master_crc | master_cnt | ts                  |

+--------+-----+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+

| report | t1  |      |   0.000416 | NULL        | NULL           | NULL           | 5ef4701a |         | 28312abb   |           | -- :: |

+--------+-----+-------+------------+-------------+----------------+----------------+----------+----------+------------+------------+---------------------+

 row in set (0.00 sec)

/*如果记录比较多，可以用下面的sql：*/
mysql>SELECT db, tbl, SUM(master_cnt) AS master_rows,SUM(this_cnt) AS slave_rows, COUNT(*) AS chunks FROM test.checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl;

标红的部分指示主库和从库数据不一致的行数，怎么修复呢？下面通过pt-table-sync来修复。

3、主从不一致修复：pt-table-sync

Usage: pt-table-sync [OPTIONS] DSN [DSN]

常用的参数：

--sync-to-master    ：指定一个DSN，即从库的IP，他会通过show processlist或show slave status 去自动的找主库。
--replicate=        ：指定通过pt-table-checksum得到的表，这2个工具差不多都会一直用。

--databases=        : 指定执行同步的数据库，多个用逗号隔开。

--tables=           ：指定执行同步的表，多个用逗号隔开。

--host=127.0.0.1    ：服务器地址，命令里有2个ip，第一次出现的是Master的地址，第2次是Slave的地址。

--user=root         ：帐号。

--password=*****    ：密码。
--port=3306          :端口

--print             ：打印，但不执行命令。

--execute           ：执行命令。

执行修复命令：

[root@DARREN]# /data/mysql/bin/pt-table-sync --replicate=test.checksums --sync-to-master  h=10.10.101.11,u=root,p='******',P=  --print

DELETE FROM `report`.`t1` WHERE `id`='' LIMIT  /*percona-toolkit src_db:report src_tbl:t1 src_dsn:P=3306,h=10.10.101.11,p=...,u=root dst_db:report dst_tbl:t1 dst_dsn:P=3307,h=10.10.101.11,p=...,u=root lock:1 transaction:1 changing_src:test.checksums replicate:test.checksums bidirectional:0 pid:5901 user:root host:10-10-101-11*/;

DELETE FROM `report`.`t1` WHERE `id`='' LIMIT  /*percona-toolkit src_db:report src_tbl:t1 src_dsn:P=3306,h=10.10.101.11,p=...,u=root dst_db:report dst_tbl:t1 dst_dsn:P=3307,h=10.10.101.11,p=...,u=root lock:1 transaction:1 changing_src:test.checksums replicate:test.checksums bidirectional:0 pid:5901 user:root host:10-10-101-11*/;

这里就会打印出修复方案，从库多了两条记录，该工具生成两条删除语句，当然如果数据量比较大的话请耐心等待，我们线上有个库大概15G，不一致数据比较多，用时40分钟左右。我们可以手工执行，也可以通过工具的--execute自动执行。

[root@DARREN]# /data/mysql/bin/pt-table-sync --replicate=test.checksums --sync-to-master  h=10.10.101.11,u=root,p='4rfv%TGB^',P=  --execute

[root@DARREN]#

执行完后没有报任何错误，再通过pt-table-checksums进行检查数据是否已经一致了:

[root@--- backup]# /data/mysql/bin/pt-table-checksum --nocheck-binlog-format  --nocheck-replication-filters --recursion-method=hosts  --replicate=test.checksums  --databases=report --user=root --password='*****' --port=

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE

-29T14::                                     0.005 report.t1

这是DIFFS=0表示数据已经达到一致了，怎么样，工具还挺好用的吧。

【PS】：由于生产环境数据量很大，数据出现不一致也很复杂，有时简单的通过工具并不能有很好的效果，建议大家还是先通过--print把修复的语句打印出来，然后自己手工执行，毕竟自动执行无法监控，容易出问题。

4、生产环境实测：

一个生产库，大概有15G数据量，下面是pt-table-checksums检验结果：

历时：10分钟左右

知道pt-table-checksums的牛逼了吧，的确领先同类工具一大截啊。这速度毫无压力啊，主要是它将表数据按chunk分比较的，所以diffs这里不一定就为1的，如果一张表数据比较大，分为几十、几百的chunk，有可能diffs的值也是不同的。

5、最后总结下使用该工具过程中出现的问题：

1）表中没有唯一索引或主键则会报错：

Can't make changes on the master because no unique index exists at /usr/local/bin/pt-table-sync line 10591.

2）如果binlog的格式是row的，需要加上--no-check-binlog-format参数，否则报错如下：

Replica MySQL- has binlog_format MIXED which could cause pt-table-checksum to break replication. Please read "Replicas using row-based replication" 
in the LIMITATIONS section of the tool's documentation. If you understand the risks, specify --no-check-binlog-format to disable this check.

秒客网

pt-table-checksum检验主从数据不一致

相关文章