周五同事监控报警,有个MySQL从库复制状态异常,让我帮忙排查下,经过排查发现是MySQL5.6并行复制的一个Bug所致,具体处理过程如下:
一、错误信息
登录mysql从库服务器,检查复制状态
mysql> show slave status \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: xx.xx.xx.xx
Master_User: repuser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.001689
Read_Master_Log_Pos: 145325388
Relay_Log_File: mysql2-relay-bin.004390
Relay_Log_Pos: 147850306
Relay_Master_Log_File: mysql-bin.001673
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1755
Last_Error: Cannot execute the current event group in the parallel mode. Encountered event Format_desc, relay-log name ./mysql2-relay-bin.004392, position 238 which prevents execution of this event group in parallel mode. Reason: possible malformed group of events from an old master.
Skip_Counter: 0
Exec_Master_Log_Pos: 147850096
Relay_Log_Space: 4440639819
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1755
Last_SQL_Error: Cannot execute the current event group in the parallel mode. Encountered event Format_desc, relay-log name ./mysql2-relay-bin.004392, position 238 which prevents execution of this event group in parallel mode. Reason: possible malformed group of events from an old master.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 46
Master_UUID: f6d575b8-6a5b-11e5-8313-0050568d23c9
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 180314 01:01:17
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: f6d575b8-6a5b-11e5-8313-0050568d23c9:9393159-70288025
Executed_Gtid_Set: 33a835bb-6a5a-11e5-8308-0050568d685b:1-146,
f6d575b8-6a5b-11e5-8313-0050568d23c9:1-69584557
Auto_Position: 1
1 row in set (0.01 sec)
mysql错误日志内容如下
018-03-14 01:01:16 26349 [Note] Event Scheduler: Loaded 0 events
2018-03-14 01:01:16 26349 [Note] /data/mysql/bin/mysqld: ready for connections.
Version: '5.6.16-log' socket: '/tmp/mysql.sock' port: 3306 Source distribution
2018-03-14 01:01:17 26349 [ERROR] Slave SQL: Cannot execute the current event group in the parallel mode. Encountered event Format_desc, relay-log name ./mysql2-relay-bin.004392, position 238 which prevents execution of this event group in parallel mode. Reason: possible malformed group of events from an old master. Error_code: 1755
2018-03-14 01:01:17 26349 [Warning] Slave: Cannot execute the current event group in the parallel mode. Encountered event Format_desc, relay-log name ./mysql2-relay-bin.004392, position 238 which prevents execution of this event group in parallel mode. Reason: possible malformed group of events from an old master. Error_code: 1755
2018-03-14 01:01:17 26349 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.001673' position 147850096
二、错误原因
从错误日志字面意思理解,出现这个错误是因为数据库开启了并行复制,一些event group在并行复制下,不支持从而导致复制失败,主从状态异常。
检查数据库版本:
mysql> select version();
+------------+
| version() |
+------------+
| 5.6.16-log |
+------------+
1 row in set (0.07 sec)
检查数据库并行复制信息:
mysql> show variables like '%parall%';
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| slave_parallel_workers | 1 |
+------------------------+-------+
1 row in set (0.05 sec)
官方搜索错误信息,发现此错误是一bug,具体bug编号为Bug #71495、Bug #72537
https://bugs.mysql.com/bug.php?id=71495
https://bugs.mysql.com/bug.php?id=72537
三、解决方案
1、既然是bug,通过分析binlog内容,然后跳过的方式跳过事物
(非gtid模式,通过set sql_slave_skip_counter=1;跳过事务)
主库分析binlog:
mysqlbinlog -vv --base64-output='decode-rows' --start-position=147850096 --stop-position=147850096 mysql-bin.001674 >a.sql
SHOW BINLOG EVENTS in 'mysql-bin.001673' from 147850096 limit 10;
从库跳过事物(gtid模式)
stop slave;
SET @@SESSION.GTID_NEXT= 'f6d575b8-6a5b-11e5-8313-0050568d23c9:9393159-70288025';
begin;
commit;
SET SESSION GTID_NEXT = AUTOMATIC;
start slave;
2、通过关闭并行复制来恢复(最佳方法)
stop slave;
set global slave_parallel_workers = 0 ;
start slave;
( 更改my.cnf中并行复制参数slave_parallel_workers = 0)
3、如果上述方法都不行,可以使用重新部署从库的方式来恢复,恢复后关闭并行复制(5.6并行复制存在bug,不推荐使用)
如果使用mysqldump备份,最好指定参数--set-gtid-purged=off,否则需要充值gtid
主库备份数据库:
innobackupexp方式:innobackupex --socket=/tmp/mysql.sock /20180316/
从库恢复数据库:
innobackupex --redo-only --apply-log /soft/2018-03-16_17-45-30
innobackupex --copy-back /soft/2018-03-16_17-45-30
查看事物应用binlog信息:cat xtrabackup_info
binlog_pos = filename 'mysql-bin.001690', position 46997344, GTID of the last change 'f6d575b8-6a5b-11e5-8313-0050568d23c9:1-70316241'
innodb_from_lsn = 0
innodb_to_lsn = 84446557453
设置从库gtid信息
set global gtid_purged='f6d575b8-6a5b-11e5-8313-0050568d23c9:1-70316241';
重新部署从库:
change master to master_host='xx.xx.xx.xx',master_user='usernamer',master_password='password',MASTER_AUTO_POSITION=84446557453;
(上述使用了reset master,否则可以直接指定从MASTER_AUTO_POSITION=1开始)