背景
线上某核心业务采用了自建MySQL 主从架构 ,因这个业务库主要是和第三方合作项目,无监控权限,DB权限需要临时申请才可操作
现象分析
- 项目组里反馈 业务数据最近两天无新数据产生。
- 推测如下:
- 此业务这两天确实无实际数据产生
- 主从同步异常
诊断
- 经查业务主库的binlog有变化,说明最近两天有数据
- 目前业务库拓扑结构如下:
排查与处理
本次无数据是mysql07-bi-prod-bj1和主库不同步导致。不同步告警如下:
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.23.70.101
Master_User: repl_user
Master_Port: 3306
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1396
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '89d024cb-8865-11e9-babc-fa83d2adf000:33387663' at master log mysql_bin.000407, end_log_pos 427666467. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
详细查看告警信息:
mysql> select * from performance_schema.replication_applier_status_by_worker limit 1\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_ERROR_NUMBER: 1396
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '89d024cb-8865-11e9-babc-fa83d2adf000:33387663' at master log mysql_bin.000407, end_log_pos 427666467; Error 'Operation CREATE USER failed for 'app_sqoop_r'@'10.33.75.%'' on query. Default database: ''. Query: 'CREATE USER 'app_sqoop_r'@'10.33.75.%' IDENTIFIED WITH 'mysql_native_password' AS '*463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA''
1 row in set (0.00 sec)
根据告警提示,查看master log mysql_bin.000407, end_log_pos 427666467前执行了什么。
解析主库mysql_bin.000407如下:
mysqlbinlog --no-defaults -vv --base64-output=decode-rows \
> /data/mysql/mysql57_3306/binlog/mysql_bin.000407 --stop-position=427666467
# at 427666247
#200224 11:59:51 server id 70101 end_log_pos 427666467 CRC32 0x3c9eaf42 Query thread_id=76726 exec_time=0 error_code=0 Xid = 57596658
SET TIMESTAMP=1582516791.092605/*!*/;
CREATE USER 'app_sqoop_r'@'10.33.75.%' IDENTIFIED WITH 'mysql_native_password' AS '*463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA'
/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
DELIMITER /*!*/;
# at 427666467
GRANT SELECT ON `xyy_crm`.* TO 'app_bi_w'@'10.33.75.%'
/*!*/;
# at 427666705
如上在create user时,从库2与主库不同步,检查从库2确实已经有'app_sqoop_r'@'10.33.75.%'用户,且权限与主库不同
查看主库的用户权限
mysql> show grants for app_sqoop_r@'10.33.75.%';
+-----------------------------------------------------------+
| Grants for app_sqoop_r@10.33.75.% |
+-----------------------------------------------------------+
| GRANT USAGE ON *.* TO `app_sqoop_r`@`10.33.75.%` |
| GRANT SELECT ON `xyy_crm`.* TO `app_sqoop_r`@`10.33.75.%` |
mysql07-bi-prod-bj1从库2的用户权限
mysql> show grants for app_sqoop_r@'10.33.75.%';
| GRANT SELECT ON *.* TO `app_sqoop_r`@`10.33.75.%` |
至此 已经定位具体的问题:用户是直接在从库2建了'app_sqoop_r'@'10.33.75.%'用户所导致的不同步,接下来只要跳过该事物即可
[root@localhost] {10:48:04} ((none)) [2]> stop slave;
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] {10:48:57} ((none)) [5]> set gtid_next='89d024cb-8865-11e9-babc-fa83d2adf000:33387663';
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] {10:49:11} ((none)) [6]> begin;commit;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] {10:49:15} ((none)) [7]> set gtid_next='AUTOMATIC';
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] {10:49:23} ((none)) [8]> start slave;
Query OK, 0 rows affected (0.04 sec)
再查看主从同步:
[root@localhost] {10:49:26} ((none)) [9]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.23.70.101
Master_User: repl_user
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql_bin.000411
Read_Master_Log_Pos: 302997670
Relay_Log_File: relay.000340
Relay_Log_Pos: 448617544
Relay_Master_Log_File: mysql_bin.000407
Slave_IO_Running: Yes ### 此时查看 主从同步已正常
Slave_SQL_Running: Yes ### 此时查看 主从同步已正常
Seconds_Behind_Master: 166878
测试验证
针对以上问题做出以下测试与验证
先说结论如下:
- 验证1:若先在从库create,然后grant;主库create用户查看主从关系。(结论是主从异常)
- 验证2:若在从库grant方式创建并授权用户;主库grant,查看主从关系(结论是主从正常)
- 验证3:若从库grant方式创建并授权用户;主库create用户,查看主从关系(结论是主从异常)
验证1:
当从库已经创建了用户,主库创建用户是否会出现Last_Errno: 1396
构造场景1
从库执行:
root@localhost] {16:04:33} ((none)) [12]> CREATE USER 'app_sqoop_r'@'10.33.75.%' IDENTIFIED by password '*463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA';
Query OK, 0 rows affected, 1 warning (0.07 sec)
[root@localhost] {16:05:18} ((none)) [13]> grant select on *.* to 'app_sqoop_r'@'10.33.75.%';
Query OK, 0 rows affected (0.00 sec)
主库执行
mysql> CREATE USER 'app_sqoop_r'@'10.33.75.%' IDENTIFIED by '*463A80AB6D77E0D99CD';
Query OK, 0 rows affected (0.00 sec)
此时查看主从关系
[root@localhost] {16:05:34} ((none)) [14]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: mha2-db-test-bj2
Master_User: repl_user
Master_Port: 3300
Connect_Retry: 60
Master_Log_File: mysql_bin.000005
Read_Master_Log_Pos: 192344
Relay_Log_File: relay.000003
Relay_Log_Pos: 192176
Relay_Master_Log_File: mysql_bin.000005
Slave_IO_Running: Yes
Slave_SQL_Running: No #### 主从关系是异常的
Last_Errno: 1396
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'fad94351-9975-11e9-9171-fa1b8143c800:5' at master log mysql_bin.000005, end_log_pos 192344. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
结论1:
主从同步异常: Slave_SQL_Running: No,Last_Errno: 1396
验证2
通过grant创建并授权用户
构造场景2
从库执行
[root@localhost] {16:21:22} ((none)) [36]> grant select on ost_test.* to app_test_r@'10.%' identified by '3A80AB6D77E0D99CD';
Query OK, 0 rows affected, 1 warning (0.00 sec)
主库执行
mysql> grant insert on `ost_test`.* TO 'app_test_r'@'10.%' identified by '5A80AB6D77E0D99CD';
Query OK, 0 rows affected, 1 warning (0.00 sec)
此时查看主从关系
[root@localhost] {16:21:53} ((none)) [37]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: mha2-db-test-bj2
Master_User: repl_user
Master_Port: 3300
Connect_Retry: 60
Master_Log_File: mysql_bin.000005
Read_Master_Log_Pos: 193391
Relay_Log_File: relay.000004
Relay_Log_Pos: 1501
Relay_Master_Log_File: mysql_bin.000005
Slave_IO_Running: Yes ### 主从同步正常
Slave_SQL_Running: Yes ### 主从同步正常
结论2
主从同步正常,不影响数据同步
验证3
在从库grant方式创建并授权用户;主库create用户
构造场景3
从库执行
[root@localhost] {16:30:45} ((none)) [41]> grant select on ost_test.* to app_test_r@'10.%' identified by '3A80AB6D77E0D99CD';
Query OK, 0 rows affected, 1 warning (0.00 sec)
主库执行
mysql> create user app_test_r@'10.%' identified by '3A80AB6D77E0D99';
Query OK, 0 rows affected (0.00 sec)
查看主从关系
[root@localhost] {16:30:59} ((none)) [42]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: mha2-db-test-bj2
Master_User: repl_user
Master_Port: 3300
Connect_Retry: 60
Slave_IO_Running: Yes
Slave_SQL_Running: No ### 主从异常
Last_Errno: 1396
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'fad94351-9975-11e9-9171-fa1b8143c800:11' at master log mysql_bin.000005, end_log_pos 193808. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
结论3
主从同步异常: Slave_SQL_Running: No,Last_Errno: 1396
总结
- 针对生产环境的情况在不清楚主从关系情况下,
- 先用show slave status\G检测是否是主库
- 用select * from information_schema.
PROCESSLIST
where state like '%binlog%'\G检测从库IP端口等信息。
- 保证执行所有的语句一定要在主库执行!!!