https://yq.aliyun.com/articles/238882?spm=5176.8067842.tagmain.18.73PjU3

摘要： MHA failover GTID 专题这里以masterha_master_switch为背景详解各种可能遇到的场景假定环境(经典三节点) host_1(host_1:3306) (current master) +--host_2(host_2:3306 slave[candidat...

MHA failover GTID 专题

这里以masterha_master_switch为背景详解各种可能遇到的场景

假定环境(经典三节点)

host_1(host_1:3306) (current master)

 +--host_2(host_2:3306 slave[candidate master])

 +--host_3(host_3:3306 etl)

一、Master : MySQL down

1.1 etl 延迟8小时

配置文件中加上no_check_delay=0 即可忽略报错

1.2 slave(候选master)比etl还要落后更多

1.2.1 当master的部分日志还没传递两个slave，这时候master 上的MySQL挂了

### 模拟现场，现场的3台DB gtid状态

* master host_2

dba:lc> show master status;

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| host_1.000002 |     2885 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

* slave (candidate master) host_1

           Retrieved_Gtid_Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446353

                Auto_Position: 1

* etl (other slave) host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353-446356

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446356

                Auto_Position: 1

### 切换日志

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Thu Nov  9 10:43:49 2017 - [info] MHA::MasterFailover version 0.56.

Thu Nov  9 10:43:49 2017 - [info] Starting master failover.

Thu Nov  9 10:43:49 2017 - [info]

Thu Nov  9 10:43:49 2017 - [info] * Phase 1: Configuration Check Phase..

Thu Nov  9 10:43:49 2017 - [info]

Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to host_2 is reachable.

Thu Nov  9 10:43:50 2017 - [info] Binlog server host_2 is reachable.

Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to host_1 is reachable.

Thu Nov  9 10:43:50 2017 - [info] Binlog server host_1 is reachable.

Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to host_3 is reachable.

Thu Nov  9 10:43:50 2017 - [info] Binlog server host_3 is reachable.

Thu Nov  9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)

Thu Nov  9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)

Thu Nov  9 10:43:51 2017 - [info] GTID failover mode = 1

Thu Nov  9 10:43:51 2017 - [info] Dead Servers:

Thu Nov  9 10:43:51 2017 - [info]   host_2(host_2:3306)

Thu Nov  9 10:43:51 2017 - [info] Checking master reachability via MySQL(double check)...

Thu Nov  9 10:43:51 2017 - [info]  ok.

Thu Nov  9 10:43:51 2017 - [info] Alive Servers:

Thu Nov  9 10:43:51 2017 - [info]   host_1(host_1:3306)

Thu Nov  9 10:43:51 2017 - [info]   host_3(host_3:3306)

Thu Nov  9 10:43:51 2017 - [info] Alive Slaves:

Thu Nov  9 10:43:51 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:43:51 2017 - [info]     GTID ON

Thu Nov  9 10:43:51 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 10:43:51 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 10:43:51 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:43:51 2017 - [info]     GTID ON

Thu Nov  9 10:43:51 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 10:43:51 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 10:43:51 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..

Thu Nov  9 10:43:51 2017 - [info]   done.

Thu Nov  9 10:43:51 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..

Thu Nov  9 10:43:51 2017 - [info]   done.

Thu Nov  9 10:43:51 2017 - [info] Starting GTID based failover.

Thu Nov  9 10:43:51 2017 - [info]

Thu Nov  9 10:43:51 2017 - [info] ** Phase 1: Configuration Check Phase completed.

Thu Nov  9 10:43:51 2017 - [info]

Thu Nov  9 10:43:51 2017 - [info] * Phase 2: Dead Master Shutdown Phase..

Thu Nov  9 10:43:51 2017 - [info]

Thu Nov  9 10:43:51 2017 - [info] HealthCheck: SSH to host_2 is reachable.

Thu Nov  9 10:43:51 2017 - [info] Forcing shutdown so that applications never connect to the current master..

Thu Nov  9 10:43:51 2017 - [info] Executing master IP deactivation script:

Thu Nov  9 10:43:51 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root

Thu Nov  9 10:43:53 2017 - [info]  done.

Thu Nov  9 10:43:53 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Thu Nov  9 10:43:53 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.

Thu Nov  9 10:43:53 2017 - [info]

Thu Nov  9 10:43:53 2017 - [info] * Phase 3: Master Recovery Phase..

Thu Nov  9 10:43:53 2017 - [info]

Thu Nov  9 10:43:53 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..

Thu Nov  9 10:43:53 2017 - [info]

Thu Nov  9 10:43:53 2017 - [info] The latest binary log file/position on all slaves is host_1.000002:1115

Thu Nov  9 10:43:53 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,

Thu Nov  9 10:43:53 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):

Thu Nov  9 10:43:53 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:43:53 2017 - [info]     GTID ON

Thu Nov  9 10:43:53 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 10:43:53 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 10:43:53 2017 - [info] The oldest binary log file/position on all slaves is host_1.000002:230

Thu Nov  9 10:43:53 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353

Thu Nov  9 10:43:53 2017 - [info] Oldest slaves:

Thu Nov  9 10:43:53 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:43:53 2017 - [info]     GTID ON

Thu Nov  9 10:43:53 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 10:43:53 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 10:43:53 2017 - [info]

Thu Nov  9 10:43:53 2017 - [info] * Phase 3.3: Determining New Master Phase..

Thu Nov  9 10:43:53 2017 - [info]

Thu Nov  9 10:43:53 2017 - [info] Searching new master from slaves..

Thu Nov  9 10:43:53 2017 - [info]  Candidate masters from the configuration file:

Thu Nov  9 10:43:53 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:43:53 2017 - [info]     GTID ON

Thu Nov  9 10:43:53 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 10:43:53 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 10:43:53 2017 - [info]  Non-candidate masters:

Thu Nov  9 10:43:53 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:43:53 2017 - [info]     GTID ON

Thu Nov  9 10:43:53 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 10:43:53 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 10:43:53 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..

Thu Nov  9 10:43:53 2017 - [info]   Not found.

Thu Nov  9 10:43:53 2017 - [info]  Searching from all candidate_master slaves..

Thu Nov  9 10:43:53 2017 - [info] New master is host_1(host_1:3306)

Thu Nov  9 10:43:53 2017 - [info] Starting master failover..

Thu Nov  9 10:43:53 2017 - [info]

Thu Nov  9 10:43:53 2017 - [info]

Thu Nov  9 10:43:53 2017 - [info] * Phase 3.3: New Master Recovery Phase..

Thu Nov  9 10:43:53 2017 - [info]

Thu Nov  9 10:43:53 2017 - [info]  Waiting all logs to be applied..

Thu Nov  9 10:43:53 2017 - [info]   done.

Thu Nov  9 10:43:53 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..

Thu Nov  9 10:43:53 2017 - [info]  Waiting all logs to be applied on the latest slave..

Thu Nov  9 10:43:53 2017 - [info]  Resetting slave host_1(host_1:3306) and starting replication from the new master host_3(host_3:3306)..

Thu Nov  9 10:43:53 2017 - [info]  Executed CHANGE MASTER.

Thu Nov  9 10:43:54 2017 - [info]  Slave started.

Thu Nov  9 10:43:54 2017 - [info]  Waiting to execute all relay logs on host_1(host_1:3306)..

Thu Nov  9 10:43:54 2017 - [info]  master_pos_wait(host_3.000049:18041) completed on host_1(host_1:3306). Executed 0 events.

Thu Nov  9 10:43:54 2017 - [info]   done.

Thu Nov  9 10:43:54 2017 - [info]   done.

Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host host_2 started, pid: 150294

Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host host_1 started, pid: 150295

Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host host_3 started, pid: 150297

Thu Nov  9 10:43:54 2017 - [info]

Thu Nov  9 10:43:54 2017 - [info] Log messages from host_1 ...

Thu Nov  9 10:43:54 2017 - [info]

Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server host_1..

Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Thu Nov  9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Thu Nov  9 10:43:54 2017 - [info] End of log messages from host_1.

Thu Nov  9 10:43:54 2017 - [warning] Got error from host_1.

Thu Nov  9 10:43:54 2017 - [info]

Thu Nov  9 10:43:54 2017 - [info] Log messages from host_3 ...

Thu Nov  9 10:43:54 2017 - [info]

Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server host_3..

Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Thu Nov  9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Thu Nov  9 10:43:54 2017 - [info] End of log messages from host_3.

Thu Nov  9 10:43:54 2017 - [warning] Got error from host_3.

Thu Nov  9 10:43:55 2017 - [info]

Thu Nov  9 10:43:55 2017 - [info] Log messages from host_2 ...

Thu Nov  9 10:43:55 2017 - [info]

Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server host_2..

Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Thu Nov  9 10:43:55 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171109104349.binlog succeeded.

Thu Nov  9 10:43:55 2017 - [info] End of log messages from host_2.

Thu Nov  9 10:43:55 2017 - [info] Saved mysqlbinlog size from host_2 is 6047 bytes.

Thu Nov  9 10:43:55 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171109104349.binlog ..

Thu Nov  9 10:43:55 2017 - [info] Differential log apply from binlog server succeeded.

Thu Nov  9 10:43:55 2017 - [info] Getting new master's binlog name and position..

Thu Nov  9 10:43:55 2017 - [info]  tjtx-126-164.000053:3624

Thu Nov  9 10:43:55 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';

Thu Nov  9 10:43:55 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: tjtx-126-164.000053, 3624, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,

Thu Nov  9 10:43:55 2017 - [info] Executing master IP activate script:

Thu Nov  9 10:43:55 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'

Thu Nov  9 10:43:57 2017 - [info]  OK.

Thu Nov  9 10:43:57 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..

Thu Nov  9 10:43:57 2017 - [info]  ok.

Thu Nov  9 10:43:57 2017 - [info] ** Finished master recovery successfully.

Thu Nov  9 10:43:57 2017 - [info] * Phase 3: Master Recovery Phase completed.

Thu Nov  9 10:43:57 2017 - [info]

Thu Nov  9 10:43:57 2017 - [info] * Phase 4: Slaves Recovery Phase..

Thu Nov  9 10:43:57 2017 - [info]

Thu Nov  9 10:43:57 2017 - [info]

Thu Nov  9 10:43:57 2017 - [info] * Phase 4.1: Starting Slaves in parallel..

Thu Nov  9 10:43:57 2017 - [info]

Thu Nov  9 10:43:57 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 155162. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171109104349.log if it takes time..

Thu Nov  9 10:43:58 2017 - [info]

Thu Nov  9 10:43:58 2017 - [info] Log messages from host_3 ...

Thu Nov  9 10:43:58 2017 - [info]

Thu Nov  9 10:43:57 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..

Thu Nov  9 10:43:57 2017 - [info]  Executed CHANGE MASTER.

Thu Nov  9 10:43:58 2017 - [info]  Slave started.

Thu Nov  9 10:43:58 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,

Thu Nov  9 10:43:58 2017 - [info] End of log messages from host_3.

Thu Nov  9 10:43:58 2017 - [info] -- Slave on host host_3(host_3:3306) started.

Thu Nov  9 10:43:58 2017 - [info] All new slave servers recovered successfully.

Thu Nov  9 10:43:58 2017 - [info]

Thu Nov  9 10:43:58 2017 - [info] * Phase 5: New master cleanup phase..

Thu Nov  9 10:43:58 2017 - [info]

Thu Nov  9 10:43:58 2017 - [info] Resetting slave info on the new master..

Thu Nov  9 10:43:58 2017 - [info]  host_1: Resetting slave info succeeded.

Thu Nov  9 10:43:58 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.

Thu Nov  9 10:43:58 2017 - [info]

Thu Nov  9 10:43:58 2017 - [info] Sending mail..

1.2.2 当master的所有日志已经传递到1个etl，这时候master 上的MySQL挂了



### 模拟现场，现场的3台DB gtid状态

* master host_1

dba:lc> show master status;

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| tjtx-126-164.000053 |     5229 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

* slave (candidate master) host_2

           Retrieved_Gtid_Set:

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362

                Auto_Position: 1

* etl (other slave) host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446357-446362

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362

                Auto_Position: 1

### 切换日志

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Thu Nov  9 10:59:14 2017 - [info] MHA::MasterFailover version 0.56.

Thu Nov  9 10:59:14 2017 - [info] Starting master failover.

Thu Nov  9 10:59:14 2017 - [info]

Thu Nov  9 10:59:14 2017 - [info] * Phase 1: Configuration Check Phase..

Thu Nov  9 10:59:14 2017 - [info]

Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to host_2 is reachable.

Thu Nov  9 10:59:15 2017 - [info] Binlog server host_2 is reachable.

Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to host_1 is reachable.

Thu Nov  9 10:59:15 2017 - [info] Binlog server host_1 is reachable.

Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to host_3 is reachable.

Thu Nov  9 10:59:16 2017 - [info] Binlog server host_3 is reachable.

Thu Nov  9 10:59:16 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)

Thu Nov  9 10:59:16 2017 - [info] GTID failover mode = 1

Thu Nov  9 10:59:16 2017 - [info] Dead Servers:

Thu Nov  9 10:59:16 2017 - [info]   host_1(host_1:3306)

Thu Nov  9 10:59:16 2017 - [info] Checking master reachability via MySQL(double check)...

Thu Nov  9 10:59:16 2017 - [info]  ok.

Thu Nov  9 10:59:16 2017 - [info] Alive Servers:

Thu Nov  9 10:59:16 2017 - [info]   host_2(host_2:3306)

Thu Nov  9 10:59:16 2017 - [info]   host_3(host_3:3306)

Thu Nov  9 10:59:16 2017 - [info] Alive Slaves:

Thu Nov  9 10:59:16 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:59:16 2017 - [info]     GTID ON

Thu Nov  9 10:59:16 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 10:59:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 10:59:16 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:59:16 2017 - [info]     GTID ON

Thu Nov  9 10:59:16 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 10:59:16 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 10:59:16 2017 - [info]  Starting SQL thread on host_2(host_2:3306) ..

Thu Nov  9 10:59:16 2017 - [info]   done.

Thu Nov  9 10:59:16 2017 - [info] Starting GTID based failover.

Thu Nov  9 10:59:16 2017 - [info]

Thu Nov  9 10:59:16 2017 - [info] ** Phase 1: Configuration Check Phase completed.

Thu Nov  9 10:59:16 2017 - [info]

Thu Nov  9 10:59:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase..

Thu Nov  9 10:59:16 2017 - [info]

Thu Nov  9 10:59:16 2017 - [info] HealthCheck: SSH to host_1 is reachable.

Thu Nov  9 10:59:16 2017 - [info] Forcing shutdown so that applications never connect to the current master..

Thu Nov  9 10:59:16 2017 - [info] Executing master IP deactivation script:

Thu Nov  9 10:59:16 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root

Thu Nov  9 10:59:20 2017 - [info]  done.

Thu Nov  9 10:59:20 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Thu Nov  9 10:59:20 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.

Thu Nov  9 10:59:20 2017 - [info]

Thu Nov  9 10:59:20 2017 - [info] * Phase 3: Master Recovery Phase..

Thu Nov  9 10:59:20 2017 - [info]

Thu Nov  9 10:59:20 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..

Thu Nov  9 10:59:20 2017 - [info]

Thu Nov  9 10:59:20 2017 - [info] The latest binary log file/position on all slaves is tjtx-126-164.000053:5229

Thu Nov  9 10:59:20 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,

Thu Nov  9 10:59:20 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):

Thu Nov  9 10:59:20 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:59:20 2017 - [info]     GTID ON

Thu Nov  9 10:59:20 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 10:59:20 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 10:59:20 2017 - [info] The oldest binary log file/position on all slaves is tjtx-126-164.000053:3624

Thu Nov  9 10:59:20 2017 - [info] Oldest slaves:

Thu Nov  9 10:59:20 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:59:20 2017 - [info]     GTID ON

Thu Nov  9 10:59:20 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 10:59:20 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 10:59:20 2017 - [info]

Thu Nov  9 10:59:20 2017 - [info] * Phase 3.3: Determining New Master Phase..

Thu Nov  9 10:59:20 2017 - [info]

Thu Nov  9 10:59:20 2017 - [info] Searching new master from slaves..

Thu Nov  9 10:59:20 2017 - [info]  Candidate masters from the configuration file:

Thu Nov  9 10:59:20 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:59:20 2017 - [info]     GTID ON

Thu Nov  9 10:59:20 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 10:59:20 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 10:59:20 2017 - [info]  Non-candidate masters:

Thu Nov  9 10:59:20 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 10:59:20 2017 - [info]     GTID ON

Thu Nov  9 10:59:20 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 10:59:20 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 10:59:20 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..

Thu Nov  9 10:59:20 2017 - [info]   Not found.

Thu Nov  9 10:59:20 2017 - [info]  Searching from all candidate_master slaves..

Thu Nov  9 10:59:20 2017 - [info] New master is host_2(host_2:3306)

Thu Nov  9 10:59:20 2017 - [info] Starting master failover..

Thu Nov  9 10:59:20 2017 - [info]

Thu Nov  9 10:59:20 2017 - [info]

Thu Nov  9 10:59:20 2017 - [info] * Phase 3.3: New Master Recovery Phase..

Thu Nov  9 10:59:20 2017 - [info]

Thu Nov  9 10:59:20 2017 - [info]  Waiting all logs to be applied..

Thu Nov  9 10:59:20 2017 - [info]   done.

Thu Nov  9 10:59:20 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..

Thu Nov  9 10:59:20 2017 - [info]  Waiting all logs to be applied on the latest slave..

Thu Nov  9 10:59:20 2017 - [info]  Resetting slave host_2(host_2:3306) and starting replication from the new master host_3(host_3:3306)..

Thu Nov  9 10:59:20 2017 - [info]  Executed CHANGE MASTER.

Thu Nov  9 10:59:21 2017 - [info]  Slave started.

Thu Nov  9 10:59:21 2017 - [info]  Waiting to execute all relay logs on host_2(host_2:3306)..

Thu Nov  9 10:59:21 2017 - [info]  master_pos_wait(host_3.000049:22035) completed on host_2(host_2:3306). Executed 0 events.

Thu Nov  9 10:59:21 2017 - [info]   done.

Thu Nov  9 10:59:21 2017 - [info]   done.

Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host host_2 started, pid: 184482

Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host host_1 started, pid: 184483

Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host host_3 started, pid: 184487

Thu Nov  9 10:59:21 2017 - [info]

Thu Nov  9 10:59:21 2017 - [info] Log messages from host_2 ...

Thu Nov  9 10:59:21 2017 - [info]

Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server host_2..

Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Thu Nov  9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Thu Nov  9 10:59:21 2017 - [info] End of log messages from host_2.

Thu Nov  9 10:59:21 2017 - [warning] Got error from host_2.

Thu Nov  9 10:59:21 2017 - [info]

Thu Nov  9 10:59:21 2017 - [info] Log messages from host_3 ...

Thu Nov  9 10:59:21 2017 - [info]

Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server host_3..

Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Thu Nov  9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Thu Nov  9 10:59:21 2017 - [info] End of log messages from host_3.

Thu Nov  9 10:59:21 2017 - [warning] Got error from host_3.

Thu Nov  9 10:59:22 2017 - [info]

Thu Nov  9 10:59:22 2017 - [info] Log messages from host_1 ...

Thu Nov  9 10:59:22 2017 - [info]

Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server host_1..

Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Thu Nov  9 10:59:22 2017 - [info] scp from root@host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_1_binlog2_20171109105914.binlog succeeded.

Thu Nov  9 10:59:22 2017 - [info] End of log messages from host_1.

Thu Nov  9 10:59:22 2017 - [info] Saved mysqlbinlog size from host_1 is 800 bytes.

Thu Nov  9 10:59:22 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_1_binlog2_20171109105914.binlog ..

Thu Nov  9 10:59:22 2017 - [info] Differential log apply from binlog server succeeded.

Thu Nov  9 10:59:22 2017 - [info] Getting new master's binlog name and position..

Thu Nov  9 10:59:22 2017 - [info]  host_1.000003:1680

Thu Nov  9 10:59:22 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';

Thu Nov  9 10:59:22 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000003, 1680, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,

Thu Nov  9 10:59:22 2017 - [info] Executing master IP activate script:

Thu Nov  9 10:59:22 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'

Thu Nov  9 10:59:24 2017 - [info]  OK.

Thu Nov  9 10:59:24 2017 - [info] Setting read_only=0 on host_2(host_2:3306)..

Thu Nov  9 10:59:24 2017 - [info]  ok.

Thu Nov  9 10:59:24 2017 - [info] ** Finished master recovery successfully.

Thu Nov  9 10:59:24 2017 - [info] * Phase 3: Master Recovery Phase completed.

Thu Nov  9 10:59:24 2017 - [info]

Thu Nov  9 10:59:24 2017 - [info] * Phase 4: Slaves Recovery Phase..

Thu Nov  9 10:59:24 2017 - [info]

Thu Nov  9 10:59:24 2017 - [info]

Thu Nov  9 10:59:24 2017 - [info] * Phase 4.1: Starting Slaves in parallel..

Thu Nov  9 10:59:24 2017 - [info]

Thu Nov  9 10:59:24 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 189393. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171109105914.log if it takes time..

Thu Nov  9 10:59:25 2017 - [info]

Thu Nov  9 10:59:25 2017 - [info] Log messages from host_3 ...

Thu Nov  9 10:59:25 2017 - [info]

Thu Nov  9 10:59:24 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..

Thu Nov  9 10:59:24 2017 - [info]  Executed CHANGE MASTER.

Thu Nov  9 10:59:25 2017 - [info]  Slave started.

Thu Nov  9 10:59:25 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,

Thu Nov  9 10:59:25 2017 - [info] End of log messages from host_3.

Thu Nov  9 10:59:25 2017 - [info] -- Slave on host host_3(host_3:3306) started.

Thu Nov  9 10:59:25 2017 - [info] All new slave servers recovered successfully.

Thu Nov  9 10:59:25 2017 - [info]

Thu Nov  9 10:59:25 2017 - [info] * Phase 5: New master cleanup phase..

Thu Nov  9 10:59:25 2017 - [info]

Thu Nov  9 10:59:25 2017 - [info] Resetting slave info on the new master..

Thu Nov  9 10:59:25 2017 - [info]  host_2: Resetting slave info succeeded.

Thu Nov  9 10:59:25 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.

Thu Nov  9 10:59:25 2017 - [info]

Thu Nov  9 10:59:25 2017 - [info] Sending mail..

1.3 slave(候选master)的日志是最新的，比etl要多

1.3.1 当master的部分日志还没传递两个slave，这时候master 上的MySQL挂了

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Tue Nov  7 17:11:29 2017 - [info] MHA::MasterFailover version 0.56.

Tue Nov  7 17:11:29 2017 - [info] Starting master failover.

Tue Nov  7 17:11:29 2017 - [info]

Tue Nov  7 17:11:29 2017 - [info] * Phase 1: Configuration Check Phase..

Tue Nov  7 17:11:29 2017 - [info]

Tue Nov  7 17:11:29 2017 - [info] HealthCheck: SSH to host_2 is reachable.

Tue Nov  7 17:11:29 2017 - [info] Binlog server host_2 is reachable.

Tue Nov  7 17:11:29 2017 - [info] HealthCheck: SSH to host_1 is reachable.

Tue Nov  7 17:11:30 2017 - [info] Binlog server host_1 is reachable.

Tue Nov  7 17:11:30 2017 - [info] HealthCheck: SSH to host_3 is reachable.

Tue Nov  7 17:11:30 2017 - [info] Binlog server host_3 is reachable.

Tue Nov  7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)

Tue Nov  7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)

Tue Nov  7 17:11:30 2017 - [info] GTID failover mode = 1

Tue Nov  7 17:11:30 2017 - [info] Dead Servers:

Tue Nov  7 17:11:30 2017 - [info]   host_1(host_1:3306)

Tue Nov  7 17:11:30 2017 - [info] Checking master reachability via MySQL(double check)...

Tue Nov  7 17:11:30 2017 - [info]  ok.

Tue Nov  7 17:11:30 2017 - [info] Alive Servers:

Tue Nov  7 17:11:30 2017 - [info]   host_2(host_2:3306)

Tue Nov  7 17:11:30 2017 - [info]   host_3(host_3:3306)

Tue Nov  7 17:11:30 2017 - [info] Alive Slaves:

Tue Nov  7 17:11:30 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 17:11:30 2017 - [info]     GTID ON

Tue Nov  7 17:11:30 2017 - [info]     Replicating from host_1(host_1:3306)

Tue Nov  7 17:11:30 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Tue Nov  7 17:11:30 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 17:11:30 2017 - [info]     GTID ON

Tue Nov  7 17:11:30 2017 - [info]     Replicating from host_1(host_1:3306)

Tue Nov  7 17:11:30 2017 - [info]     Not candidate for the new Master (no_master is set)

Tue Nov  7 17:11:30 2017 - [info]  Starting SQL thread on host_2(host_2:3306) ..

Tue Nov  7 17:11:30 2017 - [info]   done.

Tue Nov  7 17:11:30 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..

Tue Nov  7 17:11:30 2017 - [info]   done.

Tue Nov  7 17:11:30 2017 - [info] Starting GTID based failover.

Tue Nov  7 17:11:30 2017 - [info]

Tue Nov  7 17:11:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.

Tue Nov  7 17:11:30 2017 - [info]

Tue Nov  7 17:11:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..

Tue Nov  7 17:11:30 2017 - [info]

Tue Nov  7 17:11:30 2017 - [info] HealthCheck: SSH to host_1 is reachable.

Tue Nov  7 17:11:31 2017 - [info] Forcing shutdown so that applications never connect to the current master..

Tue Nov  7 17:11:31 2017 - [info] Executing master IP deactivation script:

Tue Nov  7 17:11:31 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root

Tue Nov  7 17:11:33 2017 - [info]  done.

Tue Nov  7 17:11:33 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Tue Nov  7 17:11:33 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] * Phase 3: Master Recovery Phase..

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] The latest binary log file/position on all slaves is tjtx-126-164.000051:13508

Tue Nov  7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:3-8

Tue Nov  7 17:11:33 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):

Tue Nov  7 17:11:33 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 17:11:33 2017 - [info]     GTID ON

Tue Nov  7 17:11:33 2017 - [info]     Replicating from host_1(host_1:3306)

Tue Nov  7 17:11:33 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Tue Nov  7 17:11:33 2017 - [info] The oldest binary log file/position on all slaves is tjtx-126-164.000051:11918

Tue Nov  7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:2-3,

Tue Nov  7 17:11:33 2017 - [info] Oldest slaves:

Tue Nov  7 17:11:33 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 17:11:33 2017 - [info]     GTID ON

Tue Nov  7 17:11:33 2017 - [info]     Replicating from host_1(host_1:3306)

Tue Nov  7 17:11:33 2017 - [info]     Not candidate for the new Master (no_master is set)

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] * Phase 3.3: Determining New Master Phase..

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] Searching new master from slaves..

Tue Nov  7 17:11:33 2017 - [info]  Candidate masters from the configuration file:

Tue Nov  7 17:11:33 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 17:11:33 2017 - [info]     GTID ON

Tue Nov  7 17:11:33 2017 - [info]     Replicating from host_1(host_1:3306)

Tue Nov  7 17:11:33 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Tue Nov  7 17:11:33 2017 - [info]  Non-candidate masters:

Tue Nov  7 17:11:33 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 17:11:33 2017 - [info]     GTID ON

Tue Nov  7 17:11:33 2017 - [info]     Replicating from host_1(host_1:3306)

Tue Nov  7 17:11:33 2017 - [info]     Not candidate for the new Master (no_master is set)

Tue Nov  7 17:11:33 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..

Tue Nov  7 17:11:33 2017 - [info] New master is host_2(host_2:3306)

Tue Nov  7 17:11:33 2017 - [info] Starting master failover..

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] * Phase 3.3: New Master Recovery Phase..

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info]  Waiting all logs to be applied..

Tue Nov  7 17:11:33 2017 - [info]   done.

Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host host_2 started, pid: 54677

Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host host_1 started, pid: 54681

Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host host_3 started, pid: 54683

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] Log messages from host_3 ...

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server host_3..

Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Tue Nov  7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Tue Nov  7 17:11:33 2017 - [info] End of log messages from host_3.

Tue Nov  7 17:11:33 2017 - [warning] Got error from host_3.

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] Log messages from host_2 ...

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server host_2..

Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Tue Nov  7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Tue Nov  7 17:11:33 2017 - [info] End of log messages from host_2.

Tue Nov  7 17:11:33 2017 - [warning] Got error from host_2.

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] Log messages from host_1 ...

Tue Nov  7 17:11:33 2017 - [info]

Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server host_1..

Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Tue Nov  7 17:11:33 2017 - [info] scp from root@host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_1_binlog2_20171107171129.binlog succeeded.

Tue Nov  7 17:11:33 2017 - [info] End of log messages from host_1.

Tue Nov  7 17:11:33 2017 - [info] Saved mysqlbinlog size from host_1 is 8578 bytes.

Tue Nov  7 17:11:33 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_1_binlog2_20171107171129.binlog ..

Tue Nov  7 17:11:33 2017 - [info] Differential log apply from binlog server succeeded.

Tue Nov  7 17:11:33 2017 - [info] Getting new master's binlog name and position..

Tue Nov  7 17:11:33 2017 - [info]  host_1.000001:5048

Tue Nov  7 17:11:33 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';

Tue Nov  7 17:11:33 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000001, 5048, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,

Tue Nov  7 17:11:33 2017 - [info] Executing master IP activate script:

Tue Nov  7 17:11:33 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'

Tue Nov  7 17:11:36 2017 - [info]  OK.

Tue Nov  7 17:11:36 2017 - [info] Setting read_only=0 on host_2(host_2:3306)..

Tue Nov  7 17:11:36 2017 - [info]  ok.

Tue Nov  7 17:11:36 2017 - [info] ** Finished master recovery successfully.

Tue Nov  7 17:11:36 2017 - [info] * Phase 3: Master Recovery Phase completed.

Tue Nov  7 17:11:36 2017 - [info]

Tue Nov  7 17:11:36 2017 - [info] * Phase 4: Slaves Recovery Phase..

Tue Nov  7 17:11:36 2017 - [info]

Tue Nov  7 17:11:36 2017 - [info]

Tue Nov  7 17:11:36 2017 - [info] * Phase 4.1: Starting Slaves in parallel..

Tue Nov  7 17:11:36 2017 - [info]

Tue Nov  7 17:11:36 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 58422. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171107171129.log if it takes time..

Tue Nov  7 17:11:37 2017 - [info]

Tue Nov  7 17:11:37 2017 - [info] Log messages from host_3 ...

Tue Nov  7 17:11:37 2017 - [info]

Tue Nov  7 17:11:36 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..

Tue Nov  7 17:11:36 2017 - [info]  Executed CHANGE MASTER.

Tue Nov  7 17:11:37 2017 - [info]  Slave started.

Tue Nov  7 17:11:37 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,

Tue Nov  7 17:11:37 2017 - [info] End of log messages from host_3.

Tue Nov  7 17:11:37 2017 - [info] -- Slave on host host_3(host_3:3306) started.

Tue Nov  7 17:11:37 2017 - [info] All new slave servers recovered successfully.

Tue Nov  7 17:11:37 2017 - [info]

Tue Nov  7 17:11:37 2017 - [info] * Phase 5: New master cleanup phase..

Tue Nov  7 17:11:37 2017 - [info]

Tue Nov  7 17:11:37 2017 - [info] Resetting slave info on the new master..

Tue Nov  7 17:11:37 2017 - [info]  host_2: Resetting slave info succeeded.

Tue Nov  7 17:11:37 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.

Tue Nov  7 17:11:37 2017 - [info]

Tue Nov  7 17:11:37 2017 - [info] Sending mail..

1.3.2 当master的所有日志已经传递slave，这时候master 上的MySQL挂了

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Tue Nov  7 15:56:11 2017 - [info] MHA::MasterFailover version 0.56.

Tue Nov  7 15:56:11 2017 - [info] Starting master failover.

Tue Nov  7 15:56:11 2017 - [info]

Tue Nov  7 15:56:11 2017 - [info] * Phase 1: Configuration Check Phase..

Tue Nov  7 15:56:11 2017 - [info]

Tue Nov  7 15:56:11 2017 - [info] HealthCheck: SSH to host_2 is reachable.

Tue Nov  7 15:56:12 2017 - [info] Binlog server host_2 is reachable.

Tue Nov  7 15:56:12 2017 - [info] HealthCheck: SSH to host_1 is reachable.

Tue Nov  7 15:56:12 2017 - [info] Binlog server host_1 is reachable.

Tue Nov  7 15:56:12 2017 - [info] HealthCheck: SSH to host_3 is reachable.

Tue Nov  7 15:56:13 2017 - [info] Binlog server host_3 is reachable.

Tue Nov  7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)

Tue Nov  7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)

Tue Nov  7 15:56:13 2017 - [info] GTID failover mode = 1

Tue Nov  7 15:56:13 2017 - [info] Dead Servers:

Tue Nov  7 15:56:13 2017 - [info]   host_2(host_2:3306)

Tue Nov  7 15:56:13 2017 - [info] Checking master reachability via MySQL(double check)...

Tue Nov  7 15:56:13 2017 - [info]  ok.

Tue Nov  7 15:56:13 2017 - [info] Alive Servers:

Tue Nov  7 15:56:13 2017 - [info]   host_1(host_1:3306)

Tue Nov  7 15:56:13 2017 - [info]   host_3(host_3:3306)

Tue Nov  7 15:56:13 2017 - [info] Alive Slaves:

Tue Nov  7 15:56:13 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 15:56:13 2017 - [info]     GTID ON

Tue Nov  7 15:56:13 2017 - [info]     Replicating from host_2(host_2:3306)

Tue Nov  7 15:56:13 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Tue Nov  7 15:56:13 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 15:56:13 2017 - [info]     GTID ON

Tue Nov  7 15:56:13 2017 - [info]     Replicating from host_2(host_2:3306)

Tue Nov  7 15:56:13 2017 - [info]     Not candidate for the new Master (no_master is set)

Tue Nov  7 15:56:13 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..

Tue Nov  7 15:56:13 2017 - [info]   done.

Tue Nov  7 15:56:13 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..

Tue Nov  7 15:56:13 2017 - [info]   done.

Tue Nov  7 15:56:13 2017 - [info] Starting GTID based failover.

Tue Nov  7 15:56:13 2017 - [info]

Tue Nov  7 15:56:13 2017 - [info] ** Phase 1: Configuration Check Phase completed.

Tue Nov  7 15:56:13 2017 - [info]

Tue Nov  7 15:56:13 2017 - [info] * Phase 2: Dead Master Shutdown Phase..

Tue Nov  7 15:56:13 2017 - [info]

Tue Nov  7 15:56:13 2017 - [info] HealthCheck: SSH to host_2 is reachable.

Tue Nov  7 15:56:13 2017 - [info] Forcing shutdown so that applications never connect to the current master..

Tue Nov  7 15:56:13 2017 - [info] Executing master IP deactivation script:

Tue Nov  7 15:56:13 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root

Tue Nov  7 15:56:16 2017 - [info]  done.

Tue Nov  7 15:56:16 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Tue Nov  7 15:56:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.

Tue Nov  7 15:56:16 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] * Phase 3: Master Recovery Phase..

Tue Nov  7 15:56:16 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..

Tue Nov  7 15:56:16 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] The latest binary log file/position on all slaves is host_1.000049:11291

Tue Nov  7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446352

Tue Nov  7 15:56:16 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):

Tue Nov  7 15:56:16 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 15:56:16 2017 - [info]     GTID ON

Tue Nov  7 15:56:16 2017 - [info]     Replicating from host_2(host_2:3306)

Tue Nov  7 15:56:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Tue Nov  7 15:56:16 2017 - [info] The oldest binary log file/position on all slaves is host_1.000049:10703

Tue Nov  7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446350

Tue Nov  7 15:56:16 2017 - [info] Oldest slaves:

Tue Nov  7 15:56:16 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 15:56:16 2017 - [info]     GTID ON

Tue Nov  7 15:56:16 2017 - [info]     Replicating from host_2(host_2:3306)

Tue Nov  7 15:56:16 2017 - [info]     Not candidate for the new Master (no_master is set)

Tue Nov  7 15:56:16 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] * Phase 3.3: Determining New Master Phase..

Tue Nov  7 15:56:16 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] Searching new master from slaves..

Tue Nov  7 15:56:16 2017 - [info]  Candidate masters from the configuration file:

Tue Nov  7 15:56:16 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 15:56:16 2017 - [info]     GTID ON

Tue Nov  7 15:56:16 2017 - [info]     Replicating from host_2(host_2:3306)

Tue Nov  7 15:56:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Tue Nov  7 15:56:16 2017 - [info]  Non-candidate masters:

Tue Nov  7 15:56:16 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Tue Nov  7 15:56:16 2017 - [info]     GTID ON

Tue Nov  7 15:56:16 2017 - [info]     Replicating from host_2(host_2:3306)

Tue Nov  7 15:56:16 2017 - [info]     Not candidate for the new Master (no_master is set)

Tue Nov  7 15:56:16 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..

Tue Nov  7 15:56:16 2017 - [info] New master is host_1(host_1:3306)

Tue Nov  7 15:56:16 2017 - [info] Starting master failover..

Tue Nov  7 15:56:16 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] * Phase 3.3: New Master Recovery Phase..

Tue Nov  7 15:56:16 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info]  Waiting all logs to be applied..

Tue Nov  7 15:56:16 2017 - [info]   done.

Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host host_2 started, pid: 79759

Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host host_1 started, pid: 79768

Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host host_3 started, pid: 79770

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:17 2017 - [info] Log messages from host_1 ...

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_1..

Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_1.

Tue Nov  7 15:56:17 2017 - [warning] Got error from host_1.

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:17 2017 - [info] Log messages from host_3 ...

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_3..

Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_3.

Tue Nov  7 15:56:17 2017 - [warning] Got error from host_3.

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:17 2017 - [info] Log messages from host_2 ...

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_2..

Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Tue Nov  7 15:56:17 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171107155611.binlog succeeded.

Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_2.

Tue Nov  7 15:56:17 2017 - [info] Saved mysqlbinlog size from host_2 is 768 bytes.

Tue Nov  7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171107155611.binlog ..

Tue Nov  7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.

Tue Nov  7 15:56:17 2017 - [info] Getting new master's binlog name and position..

Tue Nov  7 15:56:17 2017 - [info]  tjtx-126-164.000051:11449

Tue Nov  7 15:56:17 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';

Tue Nov  7 15:56:17 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: tjtx-126-164.000051, 11449, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,

Tue Nov  7 15:56:17 2017 - [info] Executing master IP activate script:

Tue Nov  7 15:56:17 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'

Tue Nov  7 15:56:20 2017 - [info]  OK.

Tue Nov  7 15:56:20 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..

Tue Nov  7 15:56:20 2017 - [info]  ok.

Tue Nov  7 15:56:20 2017 - [info] ** Finished master recovery successfully.

Tue Nov  7 15:56:20 2017 - [info] * Phase 3: Master Recovery Phase completed.

Tue Nov  7 15:56:20 2017 - [info]

Tue Nov  7 15:56:20 2017 - [info] * Phase 4: Slaves Recovery Phase..

Tue Nov  7 15:56:20 2017 - [info]

Tue Nov  7 15:56:20 2017 - [info]

Tue Nov  7 15:56:20 2017 - [info] * Phase 4.1: Starting Slaves in parallel..

Tue Nov  7 15:56:20 2017 - [info]

Tue Nov  7 15:56:20 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 85941. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171107155611.log if it takes time..

Tue Nov  7 15:56:21 2017 - [info]

Tue Nov  7 15:56:21 2017 - [info] Log messages from host_3 ...

Tue Nov  7 15:56:21 2017 - [info]

Tue Nov  7 15:56:20 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..

Tue Nov  7 15:56:20 2017 - [info]  Executed CHANGE MASTER.

Tue Nov  7 15:56:21 2017 - [info]  Slave started.

Tue Nov  7 15:56:21 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,

Tue Nov  7 15:56:21 2017 - [info] End of log messages from host_3.

Tue Nov  7 15:56:21 2017 - [info] -- Slave on host host_3(host_3:3306) started.

Tue Nov  7 15:56:21 2017 - [info] All new slave servers recovered successfully.

Tue Nov  7 15:56:21 2017 - [info]

Tue Nov  7 15:56:21 2017 - [info] * Phase 5: New master cleanup phase..

Tue Nov  7 15:56:21 2017 - [info]

Tue Nov  7 15:56:21 2017 - [info] Resetting slave info on the new master..

Tue Nov  7 15:56:21 2017 - [info]  host_1: Resetting slave info succeeded.

Tue Nov  7 15:56:21 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.

Tue Nov  7 15:56:21 2017 - [info]

Tue Nov  7 15:56:21 2017 - [info] Sending mail..

1.4 slave(候选master）上面有大事务在跑

1000s的大查询

无影响，正常切换

flush tables with readlock



无影响，正常切换

1.5 binlog server 不同场景的测试

dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl （这是最严苛的情况）

binlog server 写3台

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Tue Nov  7 15:56:17 2017 - [info] Log messages from host_1 ...

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_1..

Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_1.

Tue Nov  7 15:56:17 2017 - [warning] Got error from host_1.

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:17 2017 - [info] Log messages from host_3 ...

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_3..

Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Tue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_3.

Tue Nov  7 15:56:17 2017 - [warning] Got error from host_3.

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:17 2017 - [info] Log messages from host_2 ...

Tue Nov  7 15:56:17 2017 - [info]

Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_2..

Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Tue Nov  7 15:56:17 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171107155611.binlog succeeded.

Tue Nov  7 15:56:17 2017 - [info] End of log messages from host_2.

Tue Nov  7 15:56:17 2017 - [info] Saved mysqlbinlog size from host_2 is 768 bytes.

Tue Nov  7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171107155611.binlog ..

Tue Nov  7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.

binlog server 只写master

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Thu Nov  9 11:20:04 2017 - [info] -- Saving binlog from host host_2 started, pid: 117389

Thu Nov  9 11:20:05 2017 - [info]

Thu Nov  9 11:20:05 2017 - [info] Log messages from host_2 ...

Thu Nov  9 11:20:05 2017 - [info]

Thu Nov  9 11:20:04 2017 - [info] Fetching binary logs from binlog server host_2..

Thu Nov  9 11:20:04 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000004  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Thu Nov  9 11:20:05 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog to local:/var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171109111957.binlog succeeded.

Thu Nov  9 11:20:05 2017 - [info] End of log messages from host_2.

Thu Nov  9 11:20:05 2017 - [info] Saved mysqlbinlog size from host_2 is 4444 bytes.

Thu Nov  9 11:20:05 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_host_2_binlog1_20171109111957.binlog ..

Thu Nov  9 11:20:05 2017 - [info] Differential log apply from binlog server succeeded.

binlog server 只写slave

### 3台服务器的GTID状态

* master  host_1

dba:lc> show master status;

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| tjtx-126-164.000055 |     6016 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369 |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

* slave host_2

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369

                Auto_Position: 1

* etl host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446366-446369

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369

                Auto_Position: 1

### 切换日志

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Thu Nov  9 15:00:09 2017 - [info] MHA::MasterFailover version 0.56.

Thu Nov  9 15:00:09 2017 - [info] Starting master failover.

Thu Nov  9 15:00:09 2017 - [info]

Thu Nov  9 15:00:09 2017 - [info] * Phase 1: Configuration Check Phase..

Thu Nov  9 15:00:09 2017 - [info]

Thu Nov  9 15:00:09 2017 - [info] HealthCheck: SSH to host_2 is reachable.

Thu Nov  9 15:00:09 2017 - [info] Binlog server host_2 is reachable.

Thu Nov  9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)

Thu Nov  9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)

Thu Nov  9 15:00:10 2017 - [info] GTID failover mode = 1

Thu Nov  9 15:00:10 2017 - [info] Dead Servers:

Thu Nov  9 15:00:10 2017 - [info]   host_1(host_1:3306)

Thu Nov  9 15:00:10 2017 - [info] Checking master reachability via MySQL(double check)...

Thu Nov  9 15:00:10 2017 - [info]  ok.

Thu Nov  9 15:00:10 2017 - [info] Alive Servers:

Thu Nov  9 15:00:10 2017 - [info]   host_2(host_2:3306)

Thu Nov  9 15:00:10 2017 - [info]   host_3(host_3:3306)

Thu Nov  9 15:00:10 2017 - [info] Alive Slaves:

Thu Nov  9 15:00:10 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 15:00:10 2017 - [info]     GTID ON

Thu Nov  9 15:00:10 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 15:00:10 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 15:00:10 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 15:00:10 2017 - [info]     GTID ON

Thu Nov  9 15:00:10 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 15:00:10 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 15:00:10 2017 - [info]  Starting SQL thread on host_2(host_2:3306) ..

Thu Nov  9 15:00:10 2017 - [info]   done.

Thu Nov  9 15:00:10 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..

Thu Nov  9 15:00:10 2017 - [info]   done.

Thu Nov  9 15:00:10 2017 - [info] Starting GTID based failover.

Thu Nov  9 15:00:10 2017 - [info]

Thu Nov  9 15:00:10 2017 - [info] ** Phase 1: Configuration Check Phase completed.

Thu Nov  9 15:00:10 2017 - [info]

Thu Nov  9 15:00:10 2017 - [info] * Phase 2: Dead Master Shutdown Phase..

Thu Nov  9 15:00:10 2017 - [info]

Thu Nov  9 15:00:10 2017 - [info] HealthCheck: SSH to host_1 is reachable.

Thu Nov  9 15:00:10 2017 - [info] Forcing shutdown so that applications never connect to the current master..

Thu Nov  9 15:00:10 2017 - [info] Executing master IP deactivation script:

Thu Nov  9 15:00:10 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root

Thu Nov  9 15:00:17 2017 - [info]  done.

Thu Nov  9 15:00:17 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Thu Nov  9 15:00:17 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.

Thu Nov  9 15:00:17 2017 - [info]

Thu Nov  9 15:00:17 2017 - [info] * Phase 3: Master Recovery Phase..

Thu Nov  9 15:00:17 2017 - [info]

Thu Nov  9 15:00:17 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..

Thu Nov  9 15:00:17 2017 - [info]

Thu Nov  9 15:00:17 2017 - [info] The latest binary log file/position on all slaves is tjtx-126-164.000055:4090

Thu Nov  9 15:00:17 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,

Thu Nov  9 15:00:17 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):

Thu Nov  9 15:00:17 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 15:00:17 2017 - [info]     GTID ON

Thu Nov  9 15:00:17 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 15:00:17 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 15:00:17 2017 - [info] The oldest binary log file/position on all slaves is tjtx-126-164.000055:2806

Thu Nov  9 15:00:17 2017 - [info] Oldest slaves:

Thu Nov  9 15:00:17 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 15:00:17 2017 - [info]     GTID ON

Thu Nov  9 15:00:17 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 15:00:17 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 15:00:17 2017 - [info]

Thu Nov  9 15:00:17 2017 - [info] * Phase 3.3: Determining New Master Phase..

Thu Nov  9 15:00:17 2017 - [info]

Thu Nov  9 15:00:17 2017 - [info] Searching new master from slaves..

Thu Nov  9 15:00:17 2017 - [info]  Candidate masters from the configuration file:

Thu Nov  9 15:00:17 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 15:00:17 2017 - [info]     GTID ON

Thu Nov  9 15:00:17 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 15:00:17 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 15:00:17 2017 - [info]  Non-candidate masters:

Thu Nov  9 15:00:17 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 15:00:17 2017 - [info]     GTID ON

Thu Nov  9 15:00:17 2017 - [info]     Replicating from host_1(host_1:3306)

Thu Nov  9 15:00:17 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 15:00:17 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..

Thu Nov  9 15:00:17 2017 - [info]   Not found.

Thu Nov  9 15:00:17 2017 - [info]  Searching from all candidate_master slaves..

Thu Nov  9 15:00:17 2017 - [info] New master is host_2(host_2:3306)

Thu Nov  9 15:00:17 2017 - [info] Starting master failover..

Thu Nov  9 15:00:17 2017 - [info]

Thu Nov  9 15:00:17 2017 - [info]

Thu Nov  9 15:00:17 2017 - [info] * Phase 3.3: New Master Recovery Phase..

Thu Nov  9 15:00:17 2017 - [info]

Thu Nov  9 15:00:17 2017 - [info]  Waiting all logs to be applied..

Thu Nov  9 15:00:17 2017 - [info]   done.

Thu Nov  9 15:00:17 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..

Thu Nov  9 15:00:17 2017 - [info]  Waiting all logs to be applied on the latest slave..

Thu Nov  9 15:00:17 2017 - [info]  Resetting slave host_2(host_2:3306) and starting replication from the new master host_3(host_3:3306)..

Thu Nov  9 15:00:17 2017 - [info]  Executed CHANGE MASTER.

Thu Nov  9 15:00:18 2017 - [info]  Slave started.

Thu Nov  9 15:00:18 2017 - [info]  Waiting to execute all relay logs on host_2(host_2:3306)..

Thu Nov  9 15:00:18 2017 - [info]  master_pos_wait(host_3.000049:25843) completed on host_2(host_2:3306). Executed 0 events.

Thu Nov  9 15:00:18 2017 - [info]   done.

Thu Nov  9 15:00:18 2017 - [info]   done.

Thu Nov  9 15:00:18 2017 - [info] -- Saving binlog from host host_2 started, pid: 175683

Thu Nov  9 15:00:18 2017 - [info]

Thu Nov  9 15:00:18 2017 - [info] Log messages from host_2 ...

Thu Nov  9 15:00:18 2017 - [info]

Thu Nov  9 15:00:18 2017 - [info] Fetching binary logs from binlog server host_2..

Thu Nov  9 15:00:18 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000055  --start_pos=4090 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109150009.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Thu Nov  9 15:00:18 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Thu Nov  9 15:00:18 2017 - [info] End of log messages from host_2.

Thu Nov  9 15:00:18 2017 - [warning] Got error from host_2.

Thu Nov  9 15:00:18 2017 - [info] Getting new master's binlog name and position..

Thu Nov  9 15:00:18 2017 - [info]  host_1.000005:1390

Thu Nov  9 15:00:18 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';

Thu Nov  9 15:00:18 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000005, 1390, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,

Thu Nov  9 15:00:18 2017 - [info] Executing master IP activate script:

Thu Nov  9 15:00:18 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'

Thu Nov  9 15:00:22 2017 - [info]  OK.

Thu Nov  9 15:00:22 2017 - [info] Setting read_only=0 on host_2(host_2:3306)..

Thu Nov  9 15:00:22 2017 - [info]  ok.

Thu Nov  9 15:00:22 2017 - [info] ** Finished master recovery successfully.

Thu Nov  9 15:00:22 2017 - [info] * Phase 3: Master Recovery Phase completed.

Thu Nov  9 15:00:22 2017 - [info]

Thu Nov  9 15:00:22 2017 - [info] * Phase 4: Slaves Recovery Phase..

Thu Nov  9 15:00:22 2017 - [info]

Thu Nov  9 15:00:22 2017 - [info]

Thu Nov  9 15:00:22 2017 - [info] * Phase 4.1: Starting Slaves in parallel..

Thu Nov  9 15:00:22 2017 - [info]

Thu Nov  9 15:00:22 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 180681. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171109150009.log if it takes time..

Thu Nov  9 15:00:23 2017 - [info]

Thu Nov  9 15:00:23 2017 - [info] Log messages from host_3 ...

Thu Nov  9 15:00:23 2017 - [info]

Thu Nov  9 15:00:22 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..

Thu Nov  9 15:00:22 2017 - [info]  Executed CHANGE MASTER.

Thu Nov  9 15:00:23 2017 - [info]  Slave started.

Thu Nov  9 15:00:23 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,

Thu Nov  9 15:00:23 2017 - [info] End of log messages from host_3.

Thu Nov  9 15:00:23 2017 - [info] -- Slave on host host_3(host_3:3306) started.

Thu Nov  9 15:00:23 2017 - [info] All new slave servers recovered successfully.

Thu Nov  9 15:00:23 2017 - [info]

Thu Nov  9 15:00:23 2017 - [info] * Phase 5: New master cleanup phase..

Thu Nov  9 15:00:23 2017 - [info]

Thu Nov  9 15:00:23 2017 - [info] Resetting slave info on the new master..

Thu Nov  9 15:00:23 2017 - [info]  host_2: Resetting slave info succeeded.

Thu Nov  9 15:00:23 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.

Thu Nov  9 15:00:23 2017 - [info]

Thu Nov  9 15:00:23 2017 - [info] Sending mail..

结论：由于binlog server没有配置master，所以会丢失master没有传递过来的事务日志
好在，slave和etl之间会互相change master，所以尽管slave（candidate master）的日志落后，最终也还是用etl的日志补齐了slave缺失的日志。

binlog server 啥都不写

### 3台DB的GTID状态

* master host_2

dba:lc> show master status;

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| host_1.000005 |     5785 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446378 |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

* slave host_1

           Retrieved_Gtid_Set:

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369

                Auto_Position: 1

* etl host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372

                Auto_Position: 1

### 切换日志

Thu Nov  9 16:22:41 2017 - [info] MHA::MasterFailover version 0.56.

Thu Nov  9 16:22:41 2017 - [info] Starting master failover.

Thu Nov  9 16:22:41 2017 - [info]

Thu Nov  9 16:22:41 2017 - [info] * Phase 1: Configuration Check Phase..

Thu Nov  9 16:22:41 2017 - [info]

Thu Nov  9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)

Thu Nov  9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)

Thu Nov  9 16:22:41 2017 - [info] GTID failover mode = 1

Thu Nov  9 16:22:41 2017 - [info] Dead Servers:

Thu Nov  9 16:22:41 2017 - [info]   host_2(host_2:3306)

Thu Nov  9 16:22:41 2017 - [info] Checking master reachability via MySQL(double check)...

Thu Nov  9 16:22:41 2017 - [info]  ok.

Thu Nov  9 16:22:41 2017 - [info] Alive Servers:

Thu Nov  9 16:22:41 2017 - [info]   host_1(host_1:3306)

Thu Nov  9 16:22:41 2017 - [info]   host_3(host_3:3306)

Thu Nov  9 16:22:41 2017 - [info] Alive Slaves:

Thu Nov  9 16:22:41 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 16:22:41 2017 - [info]     GTID ON

Thu Nov  9 16:22:41 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 16:22:41 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 16:22:41 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 16:22:41 2017 - [info]     GTID ON

Thu Nov  9 16:22:41 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 16:22:41 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 16:22:41 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..

Thu Nov  9 16:22:41 2017 - [info]   done.

Thu Nov  9 16:22:41 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..

Thu Nov  9 16:22:41 2017 - [info]   done.

Thu Nov  9 16:22:41 2017 - [info] Starting GTID based failover.

Thu Nov  9 16:22:41 2017 - [info]

Thu Nov  9 16:22:41 2017 - [info] ** Phase 1: Configuration Check Phase completed.

Thu Nov  9 16:22:41 2017 - [info]

Thu Nov  9 16:22:41 2017 - [info] * Phase 2: Dead Master Shutdown Phase..

Thu Nov  9 16:22:41 2017 - [info]

Thu Nov  9 16:22:42 2017 - [info] HealthCheck: SSH to host_2 is reachable.

Thu Nov  9 16:22:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..

Thu Nov  9 16:22:42 2017 - [info] Executing master IP deactivation script:

Thu Nov  9 16:22:42 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root

===================    swift vip :  vip from host_2 is deleted  ==============================

--2017-11-09 16:22:42--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi

正在连接 tgw_server:80... 已连接。

已发出 HTTP 请求，正在等待回应... 200 OK

长度：未指定 [text/html]

正在保存至: “STDOUT”

     0K                                                        9.79M=0s

2017-11-09 16:22:44 (9.79 MB/s) - 已写入标准输出 [38]

Thu Nov  9 16:22:44 2017 - [info]  done.

Thu Nov  9 16:22:44 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Thu Nov  9 16:22:44 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.

Thu Nov  9 16:22:44 2017 - [info]

Thu Nov  9 16:22:44 2017 - [info] * Phase 3: Master Recovery Phase..

Thu Nov  9 16:22:44 2017 - [info]

Thu Nov  9 16:22:44 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..

Thu Nov  9 16:22:44 2017 - [info]

Thu Nov  9 16:22:44 2017 - [info] The latest binary log file/position on all slaves is host_1.000005:4015

Thu Nov  9 16:22:44 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372

Thu Nov  9 16:22:44 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):

Thu Nov  9 16:22:44 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 16:22:44 2017 - [info]     GTID ON

Thu Nov  9 16:22:44 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 16:22:44 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 16:22:44 2017 - [info] The oldest binary log file/position on all slaves is host_1.000005:3130

Thu Nov  9 16:22:44 2017 - [info] Oldest slaves:

Thu Nov  9 16:22:44 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 16:22:44 2017 - [info]     GTID ON

Thu Nov  9 16:22:44 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 16:22:44 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 16:22:44 2017 - [info]

Thu Nov  9 16:22:44 2017 - [info] * Phase 3.3: Determining New Master Phase..

Thu Nov  9 16:22:44 2017 - [info]

Thu Nov  9 16:22:44 2017 - [info] Searching new master from slaves..

Thu Nov  9 16:22:44 2017 - [info]  Candidate masters from the configuration file:

Thu Nov  9 16:22:44 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 16:22:44 2017 - [info]     GTID ON

Thu Nov  9 16:22:44 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 16:22:44 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Nov  9 16:22:44 2017 - [info]  Non-candidate masters:

Thu Nov  9 16:22:44 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Thu Nov  9 16:22:44 2017 - [info]     GTID ON

Thu Nov  9 16:22:44 2017 - [info]     Replicating from host_2(host_2:3306)

Thu Nov  9 16:22:44 2017 - [info]     Not candidate for the new Master (no_master is set)

Thu Nov  9 16:22:44 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..

Thu Nov  9 16:22:44 2017 - [info]   Not found.

Thu Nov  9 16:22:44 2017 - [info]  Searching from all candidate_master slaves..

Thu Nov  9 16:22:44 2017 - [info] New master is host_1(host_1:3306)

Thu Nov  9 16:22:44 2017 - [info] Starting master failover..

Thu Nov  9 16:22:44 2017 - [info]

From:

host_2(host_2:3306) (current master)

 +--host_1(host_1:3306)

 +--host_3(host_3:3306)

To:

host_1(host_1:3306) (new master)

 +--host_3(host_3:3306)

Thu Nov  9 16:22:44 2017 - [info]

Thu Nov  9 16:22:44 2017 - [info] * Phase 3.3: New Master Recovery Phase..

Thu Nov  9 16:22:44 2017 - [info]

Thu Nov  9 16:22:44 2017 - [info]  Waiting all logs to be applied..

Thu Nov  9 16:22:44 2017 - [info]   done.

Thu Nov  9 16:22:44 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..

Thu Nov  9 16:22:44 2017 - [info]  Waiting all logs to be applied on the latest slave..

Thu Nov  9 16:22:44 2017 - [info]  Resetting slave host_1(host_1:3306) and starting replication from the new master host_3(host_3:3306)..

Thu Nov  9 16:22:44 2017 - [info]  Executed CHANGE MASTER.

Thu Nov  9 16:22:45 2017 - [info]  Slave started.

Thu Nov  9 16:22:45 2017 - [info]  Waiting to execute all relay logs on host_1(host_1:3306)..

Thu Nov  9 16:22:45 2017 - [info]  master_pos_wait(host_3.000049:28663) completed on host_1(host_1:3306). Executed 0 events.

Thu Nov  9 16:22:45 2017 - [info]   done.

Thu Nov  9 16:22:45 2017 - [info]   done.

Thu Nov  9 16:22:45 2017 - [info] Getting new master's binlog name and position..

Thu Nov  9 16:22:45 2017 - [info]  tjtx-126-164.000056:1170

Thu Nov  9 16:22:45 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';

Thu Nov  9 16:22:45 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: tjtx-126-164.000056, 1170, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372

Thu Nov  9 16:22:45 2017 - [info] Executing master IP activate script:

Thu Nov  9 16:22:45 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'

Unknown option: new_master_user

Unknown option: new_master_password

===================    swift vip :  vip to host_1  is added  ==============================

Thu Nov  9 16:22:47 2017 - [info]  OK.

Thu Nov  9 16:22:47 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..

Thu Nov  9 16:22:47 2017 - [info]  ok.

Thu Nov  9 16:22:47 2017 - [info] ** Finished master recovery successfully.

Thu Nov  9 16:22:47 2017 - [info] * Phase 3: Master Recovery Phase completed.

Thu Nov  9 16:22:47 2017 - [info]

Thu Nov  9 16:22:47 2017 - [info] * Phase 4: Slaves Recovery Phase..

Thu Nov  9 16:22:47 2017 - [info]

Thu Nov  9 16:22:47 2017 - [info]

Thu Nov  9 16:22:47 2017 - [info] * Phase 4.1: Starting Slaves in parallel..

Thu Nov  9 16:22:47 2017 - [info]

Thu Nov  9 16:22:47 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 112317. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171109162241.log if it takes time..

Thu Nov  9 16:22:48 2017 - [info]

Thu Nov  9 16:22:48 2017 - [info] Log messages from host_3 ...

Thu Nov  9 16:22:48 2017 - [info]

Thu Nov  9 16:22:47 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..

Thu Nov  9 16:22:47 2017 - [info]  Executed CHANGE MASTER.

Thu Nov  9 16:22:48 2017 - [info]  Slave started.

Thu Nov  9 16:22:48 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372) completed on host_3(host_3:3306). Executed 0 events.

Thu Nov  9 16:22:48 2017 - [info] End of log messages from host_3.

Thu Nov  9 16:22:48 2017 - [info] -- Slave on host host_3(host_3:3306) started.

Thu Nov  9 16:22:48 2017 - [info] All new slave servers recovered successfully.

Thu Nov  9 16:22:48 2017 - [info]

Thu Nov  9 16:22:48 2017 - [info] * Phase 5: New master cleanup phase..

Thu Nov  9 16:22:48 2017 - [info]

Thu Nov  9 16:22:48 2017 - [info] Resetting slave info on the new master..

Thu Nov  9 16:22:49 2017 - [info]  host_1: Resetting slave info succeeded.

Thu Nov  9 16:22:49 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.

Thu Nov  9 16:22:49 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded

Master host_2(host_2:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.

Invalidated master IP address on host_2(host_2:3306)

Selected host_1(host_1:3306) as a new master.

host_1(host_1:3306): OK: Applying all logs succeeded.

host_1(host_1:3306): OK: Activated master IP address.

host_3(host_3:3306): OK: Slave started, replicating from host_1(host_1:3306)

host_1(host_1:3306): Resetting slave info succeeded.

Master failover to host_1(host_1:3306) completed successfully.

1.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

99%的场景都是可以重新执行的
1%的场景不能再次执行，执行会报错

一般这种场景就是：已经failover到最后的change master阶段，这样主从结构已经变更，MHA无法重新走一遍。
不过，即便到这步骤失败了，表示master的日志已经补完，由于是gtid模式，自己再让slave change master到最新的master即可，最后ACTIVE new ip和readonly=1就好了

Thu Nov  9 16:49:39 2017 - [info] MHA::MasterFailover version 0.56.

Thu Nov  9 16:49:39 2017 - [info] Starting master failover.

Thu Nov  9 16:49:39 2017 - [info]

Thu Nov  9 16:49:39 2017 - [info] * Phase 1: Configuration Check Phase..

Thu Nov  9 16:49:39 2017 - [info]

Thu Nov  9 16:49:39 2017 - [info] GTID failover mode = 1

Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln169] Detected dead master host_1(host_1:3306) does not match with specified dead master host_2(host_2:3306)!

Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/bin/masterha_master_switch line 53

1.7 Master：MySQL down小结

1. failover最终命令

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

2. binlog server建议

配置master就可以了

[binlog1]

$master_ip

只配置slave，或者没有配置，会导致丢失部分没有从master传递过来的日志事务

二、Master : Server down

2.1 etl 延迟8小时

 同1.1 结论

2.2 slave(候选master)比etl还要落后更多

2.2.1 当master的部分日志还没传递两个slave，这时候master server挂了

### 3台DB的GTID状态

* master host_2

dba:lc> show master status;

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| host_1.000008 |     5445 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

* slave host_1

           Retrieved_Gtid_Set:

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446385

                Auto_Position: 1

* etl host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388

                Auto_Position: 1

### 模拟故障场景

* 隔离master的网络，让其等同于down机

master> iptables -A INPUT -p tcp -s monitor_ip --dport 22 -j ACCEPT

master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP

### 切换日志

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Fri Nov 10 11:12:38 2017 - [info] MHA::MasterFailover version 0.56.

Fri Nov 10 11:12:38 2017 - [info] Starting master failover.

Fri Nov 10 11:12:38 2017 - [info]

Fri Nov 10 11:12:38 2017 - [info] * Phase 1: Configuration Check Phase..

Fri Nov 10 11:12:38 2017 - [info]

Fri Nov 10 11:13:28 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.

Fri Nov 10 11:13:28 2017 - [warning] Failed to SSH to binlog server host_2

Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to host_1 is reachable.

Fri Nov 10 11:13:29 2017 - [info] Binlog server host_1 is reachable.

Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to host_3 is reachable.

Fri Nov 10 11:13:29 2017 - [info] Binlog server host_3 is reachable.

Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)

Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)

Fri Nov 10 11:13:29 2017 - [info] GTID failover mode = 1

Fri Nov 10 11:13:29 2017 - [info] Dead Servers:

Fri Nov 10 11:13:29 2017 - [info]   host_2(host_2:3306)

Fri Nov 10 11:13:29 2017 - [info] Checking master reachability via MySQL(double check)...

Fri Nov 10 11:13:30 2017 - [info]  ok.

Fri Nov 10 11:13:30 2017 - [info] Alive Servers:

Fri Nov 10 11:13:30 2017 - [info]   host_1(host_1:3306)

Fri Nov 10 11:13:30 2017 - [info]   host_3(host_3:3306)

Fri Nov 10 11:13:30 2017 - [info] Alive Slaves:

Fri Nov 10 11:13:30 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 11:13:30 2017 - [info]     GTID ON

Fri Nov 10 11:13:30 2017 - [info]     Replicating from host_2(host_2:3306)

Fri Nov 10 11:13:30 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Fri Nov 10 11:13:30 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 11:13:30 2017 - [info]     GTID ON

Fri Nov 10 11:13:30 2017 - [info]     Replicating from host_2(host_2:3306)

Fri Nov 10 11:13:30 2017 - [info]     Not candidate for the new Master (no_master is set)

Fri Nov 10 11:13:30 2017 - [info]  Starting SQL thread on host_1(host_1:3306) ..

Fri Nov 10 11:13:30 2017 - [info]   done.

Fri Nov 10 11:13:30 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..

Fri Nov 10 11:13:30 2017 - [info]   done.

Fri Nov 10 11:13:30 2017 - [info] Starting GTID based failover.

Fri Nov 10 11:13:30 2017 - [info]

Fri Nov 10 11:13:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.

Fri Nov 10 11:13:30 2017 - [info]

Fri Nov 10 11:13:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..

Fri Nov 10 11:13:30 2017 - [info]

Fri Nov 10 11:14:20 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.

Fri Nov 10 11:14:20 2017 - [info] Forcing shutdown so that applications never connect to the current master..

Fri Nov 10 11:14:20 2017 - [info] Executing master IP deactivation script:

Fri Nov 10 11:14:20 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stop

ssh: connect to host host_2 port 22: Connection timed out

===================    swift vip :  vip from host_2 is deleted  ==============================

--2017-11-10 11:14:27--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi

正在连接 tgw_server:80... 已连接。

已发出 HTTP 请求，正在等待回应... 200 OK

长度：未指定 [text/html]

正在保存至: “STDOUT”

     0K                                                        11.4M=0s

2017-11-10 11:16:27 (11.4 MB/s) - 已写入标准输出 [38]

Fri Nov 10 11:16:27 2017 - [info]  done.

Fri Nov 10 11:16:27 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Fri Nov 10 11:16:27 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.

Fri Nov 10 11:16:27 2017 - [info]

Fri Nov 10 11:16:27 2017 - [info] * Phase 3: Master Recovery Phase..

Fri Nov 10 11:16:27 2017 - [info]

Fri Nov 10 11:16:27 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..

Fri Nov 10 11:16:27 2017 - [info]

Fri Nov 10 11:16:27 2017 - [info] The latest binary log file/position on all slaves is host_1.000008:4265

Fri Nov 10 11:16:27 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388

Fri Nov 10 11:16:27 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):

Fri Nov 10 11:16:27 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 11:16:27 2017 - [info]     GTID ON

Fri Nov 10 11:16:27 2017 - [info]     Replicating from host_2(host_2:3306)

Fri Nov 10 11:16:27 2017 - [info]     Not candidate for the new Master (no_master is set)

Fri Nov 10 11:16:27 2017 - [info] The oldest binary log file/position on all slaves is host_1.000008:3380

Fri Nov 10 11:16:27 2017 - [info] Oldest slaves:

Fri Nov 10 11:16:27 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 11:16:27 2017 - [info]     GTID ON

Fri Nov 10 11:16:27 2017 - [info]     Replicating from host_2(host_2:3306)

Fri Nov 10 11:16:27 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Fri Nov 10 11:16:27 2017 - [info]

Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: Determining New Master Phase..

Fri Nov 10 11:16:27 2017 - [info]

Fri Nov 10 11:16:27 2017 - [info] Searching new master from slaves..

Fri Nov 10 11:16:27 2017 - [info]  Candidate masters from the configuration file:

Fri Nov 10 11:16:27 2017 - [info]   host_1(host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 11:16:27 2017 - [info]     GTID ON

Fri Nov 10 11:16:27 2017 - [info]     Replicating from host_2(host_2:3306)

Fri Nov 10 11:16:27 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Fri Nov 10 11:16:27 2017 - [info]  Non-candidate masters:

Fri Nov 10 11:16:27 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 11:16:27 2017 - [info]     GTID ON

Fri Nov 10 11:16:27 2017 - [info]     Replicating from host_2(host_2:3306)

Fri Nov 10 11:16:27 2017 - [info]     Not candidate for the new Master (no_master is set)

Fri Nov 10 11:16:27 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..

Fri Nov 10 11:16:27 2017 - [info]   Not found.

Fri Nov 10 11:16:27 2017 - [info]  Searching from all candidate_master slaves..

Fri Nov 10 11:16:27 2017 - [info] New master is host_1(host_1:3306)

Fri Nov 10 11:16:27 2017 - [info] Starting master failover..

Fri Nov 10 11:16:27 2017 - [info]

From:

host_2(host_2:3306) (current master)

 +--host_1(host_1:3306)

 +--host_3(host_3:3306)

To:

host_1(host_1:3306) (new master)

 +--host_3(host_3:3306)

Fri Nov 10 11:16:27 2017 - [info]

Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: New Master Recovery Phase..

Fri Nov 10 11:16:27 2017 - [info]

Fri Nov 10 11:16:27 2017 - [info]  Waiting all logs to be applied..

Fri Nov 10 11:16:27 2017 - [info]   done.

Fri Nov 10 11:16:27 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..

Fri Nov 10 11:16:27 2017 - [info]  Waiting all logs to be applied on the latest slave..

Fri Nov 10 11:16:27 2017 - [info]  Resetting slave host_1(host_1:3306) and starting replication from the new master host_3(host_3:3306)..

Fri Nov 10 11:16:27 2017 - [info]  Executed CHANGE MASTER.

Fri Nov 10 11:16:28 2017 - [info]  Slave started.

Fri Nov 10 11:16:28 2017 - [info]  Waiting to execute all relay logs on host_1(host_1:3306)..

Fri Nov 10 11:16:28 2017 - [info]  master_pos_wait(host_3.000049:40136) completed on host_1(host_1:3306). Executed 0 events.

Fri Nov 10 11:16:28 2017 - [info]   done.

Fri Nov 10 11:16:28 2017 - [info]   done.

Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host host_2 started, pid: 43038

Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host host_1 started, pid: 43039

Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host host_3 started, pid: 43041

Fri Nov 10 11:16:28 2017 - [info]

Fri Nov 10 11:16:28 2017 - [info] Log messages from host_2 ...

Fri Nov 10 11:16:28 2017 - [info] End of log messages from host_2.

Fri Nov 10 11:16:28 2017 - [warning] SSH is not reachable on host_2. Skipping

Fri Nov 10 11:16:28 2017 - [info]

Fri Nov 10 11:16:28 2017 - [info] Log messages from host_1 ...

Fri Nov 10 11:16:28 2017 - [info]

Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server host_1..

Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000008  --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.

 at /usr/bin/save_binary_logs line 123

    eval {...} called at /usr/bin/save_binary_logs line 70

    main::main() called at /usr/bin/save_binary_logs line 66

Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Fri Nov 10 11:16:28 2017 - [info] End of log messages from host_1.

Fri Nov 10 11:16:28 2017 - [warning] Got error from host_1.

Fri Nov 10 11:16:28 2017 - [info]

Fri Nov 10 11:16:28 2017 - [info] Log messages from host_3 ...

Fri Nov 10 11:16:28 2017 - [info]

Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server host_3..

Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=host_1.000008  --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.

 at /usr/bin/save_binary_logs line 123

    eval {...} called at /usr/bin/save_binary_logs line 70

    main::main() called at /usr/bin/save_binary_logs line 66

Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Fri Nov 10 11:16:28 2017 - [info] End of log messages from host_3.

Fri Nov 10 11:16:28 2017 - [warning] Got error from host_3.

Fri Nov 10 11:16:28 2017 - [info] Getting new master's binlog name and position..

Fri Nov 10 11:16:28 2017 - [info]  tjtx-126-164.000058:4059

Fri Nov 10 11:16:28 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';

Fri Nov 10 11:16:28 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: tjtx-126-164.000058, 4059, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388

Fri Nov 10 11:16:28 2017 - [info] Executing master IP activate script:

Fri Nov 10 11:16:28 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'

Unknown option: new_master_user

Unknown option: new_master_password

===================    swift vip :  vip to host_1  is added  ==============================

Fri Nov 10 11:16:30 2017 - [info]  OK.

Fri Nov 10 11:16:30 2017 - [info] ** Finished master recovery successfully.

Fri Nov 10 11:16:30 2017 - [info] * Phase 3: Master Recovery Phase completed.

Fri Nov 10 11:16:30 2017 - [info]

Fri Nov 10 11:16:30 2017 - [info] * Phase 4: Slaves Recovery Phase..

Fri Nov 10 11:16:30 2017 - [info]

Fri Nov 10 11:16:30 2017 - [info]

Fri Nov 10 11:16:30 2017 - [info] * Phase 4.1: Starting Slaves in parallel..

Fri Nov 10 11:16:30 2017 - [info]

Fri Nov 10 11:16:30 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 46878. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171110111238.log if it takes time..

Fri Nov 10 11:16:31 2017 - [info]

Fri Nov 10 11:16:31 2017 - [info] Log messages from host_3 ...

Fri Nov 10 11:16:31 2017 - [info]

Fri Nov 10 11:16:30 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..

Fri Nov 10 11:16:30 2017 - [info]  Executed CHANGE MASTER.

Fri Nov 10 11:16:31 2017 - [info]  Slave started.

Fri Nov 10 11:16:31 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388) completed on host_3(host_3:3306). Executed 0 events.

Fri Nov 10 11:16:31 2017 - [info] End of log messages from host_3.

Fri Nov 10 11:16:31 2017 - [info] -- Slave on host host_3(host_3:3306) started.

Fri Nov 10 11:16:31 2017 - [info] All new slave servers recovered successfully.

Fri Nov 10 11:16:31 2017 - [info]

Fri Nov 10 11:16:31 2017 - [info] * Phase 5: New master cleanup phase..

Fri Nov 10 11:16:31 2017 - [info]

Fri Nov 10 11:16:31 2017 - [info] Resetting slave info on the new master..

Fri Nov 10 11:16:31 2017 - [info]  host_1: Resetting slave info succeeded.

Fri Nov 10 11:16:31 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.

Fri Nov 10 11:16:31 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded

Master host_2(host_2:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.

Invalidated master IP address on host_2(host_2:3306)

Selected host_1(host_1:3306) as a new master.

host_1(host_1:3306): OK: Applying all logs succeeded.

host_1(host_1:3306): OK: Activated master IP address.

host_3(host_3:3306): OK: Slave started, replicating from host_1(host_1:3306)

host_1(host_1:3306): Resetting slave info succeeded.

Master failover to host_1(host_1:3306) completed successfully.

Fri Nov 10 11:16:31 2017 - [info] Sending mail..

### 最后一步很重要

如果dead master之后又活过来了，那么这一步要做

dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c

http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md  ==》 tgw章节详细描述

结论：由于master 已挂，然而最后的日志没有传递到其他服务器，所以会丢失master没有传递过来的事务日志
好在，slave和etl之间会互相change master，所以尽管slave（candidate master）的日志落后，最终也还是用etl的日志补齐了slave缺失的日志。

2.2.2 当master的所有日志已经传递到1个etl，这时候master server挂了

测试省略，和2.2.1基本一样

结论：由于master上的所有日志全部传递到etl，所以最后是不会丢失master上任何数据的。

2.3 slave(候选master)的日志是最新的，比etl要多

2.3.1 当master的部分日志还没传递两个slave，这时候master server挂了

测试省略，和2.2.1基本一样

2.3.2 当master的所有日志已经传递slave，这时候master server挂了

测试省略，和2.2.1基本一样

结论：由于master上的所有日志全部传递到slave，所以最后是不会丢失master上任何数据的。

2.4 slave(候选master）上面有大事务在跑

1000s的大查询

同1.4结论

flush tables with readlock



同1.4结论

2.5 binlog server 不同场景的测试

dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl （这是最严苛的情况）

binlog server 写3台

2.2.1 测试的就是这种情况，详细日志切换请看2.2.1

结论：由于binlog server配置了3台，但是由于master server已经挂掉，无法从master的binlog server上获取日志，所以会丢失master上没有传递的日志事务

binlog server 只写master



### 3台DB的gtid 状态

* master host_1

dba:lc> show master status;

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| tjtx-126-164.000058 |     8517 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

* slave host_2

           Retrieved_Gtid_Set:

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392

                Auto_Position: 1

* etl host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392

                Auto_Position: 1

### 模拟故障

master> iptables -A INPUT -p tcp -s monitor_ip --dport 22 -j ACCEPT

master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP

### 故障切换

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

Fri Nov 10 14:15:51 2017 - [info] MHA::MasterFailover version 0.56.

Fri Nov 10 14:15:51 2017 - [info] Starting master failover.

Fri Nov 10 14:15:51 2017 - [info]

Fri Nov 10 14:15:51 2017 - [info] * Phase 1: Configuration Check Phase..

Fri Nov 10 14:15:51 2017 - [info]

Fri Nov 10 14:16:41 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.

Fri Nov 10 14:16:41 2017 - [warning] Failed to SSH to binlog server host_1

Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln239] Binlog Server is defined but there is no alive server.

Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/share/perl5/vendor_perl/MHA/MasterFailover.pm line 2082

结论： binlog server 必须要配置一个活的 server，如果只配置master，如果master挂了，那么就等于一个都没有，MHA不会切换

binlog server 只写slave



### 3台DB的gtid 状态

* master host_1

dba:lc> show master status;

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

| tjtx-126-164.000058 |     8517 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |

+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

* slave host_2

           Retrieved_Gtid_Set:

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392

                Auto_Position: 1

* etl host_3

           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392

            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392

                Auto_Position: 1

### 模拟故障

master> iptables -A INPUT -p tcp -s monitor_ip --dport 22 -j ACCEPT

master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP

### 故障切换

Fri Nov 10 14:29:50 2017 - [info] MHA::MasterFailover version 0.56.

Fri Nov 10 14:29:50 2017 - [info] Starting master failover.

Fri Nov 10 14:29:50 2017 - [info]

Fri Nov 10 14:29:50 2017 - [info] * Phase 1: Configuration Check Phase..

Fri Nov 10 14:29:50 2017 - [info]

Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to host_2 is reachable.

Fri Nov 10 14:29:50 2017 - [info] Binlog server host_2 is reachable.

Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to host_3 is reachable.

Fri Nov 10 14:29:50 2017 - [info] Binlog server host_3 is reachable.

Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)

Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)

Fri Nov 10 14:29:50 2017 - [info] GTID failover mode = 1

Fri Nov 10 14:29:50 2017 - [info] Dead Servers:

Fri Nov 10 14:29:50 2017 - [info]   host_1(host_1:3306)

Fri Nov 10 14:29:50 2017 - [info] Checking master reachability via MySQL(double check)...

Fri Nov 10 14:29:51 2017 - [info]  ok.

Fri Nov 10 14:29:51 2017 - [info] Alive Servers:

Fri Nov 10 14:29:51 2017 - [info]   host_2(host_2:3306)

Fri Nov 10 14:29:51 2017 - [info]   host_3(host_3:3306)

Fri Nov 10 14:29:51 2017 - [info] Alive Slaves:

Fri Nov 10 14:29:51 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 14:29:51 2017 - [info]     GTID ON

Fri Nov 10 14:29:51 2017 - [info]     Replicating from host_1(host_1:3306)

Fri Nov 10 14:29:51 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Fri Nov 10 14:29:51 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 14:29:51 2017 - [info]     GTID ON

Fri Nov 10 14:29:51 2017 - [info]     Replicating from host_1(host_1:3306)

Fri Nov 10 14:29:51 2017 - [info]     Not candidate for the new Master (no_master is set)

Fri Nov 10 14:29:51 2017 - [info]  Starting SQL thread on host_2(host_2:3306) ..

Fri Nov 10 14:29:51 2017 - [info]   done.

Fri Nov 10 14:29:51 2017 - [info]  Starting SQL thread on host_3(host_3:3306) ..

Fri Nov 10 14:29:52 2017 - [info]   done.

Fri Nov 10 14:29:52 2017 - [info] Starting GTID based failover.

Fri Nov 10 14:29:52 2017 - [info]

Fri Nov 10 14:29:52 2017 - [info] ** Phase 1: Configuration Check Phase completed.

Fri Nov 10 14:29:52 2017 - [info]

Fri Nov 10 14:29:52 2017 - [info] * Phase 2: Dead Master Shutdown Phase..

Fri Nov 10 14:29:52 2017 - [info]

Fri Nov 10 14:30:42 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.

Fri Nov 10 14:30:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..

Fri Nov 10 14:30:42 2017 - [info] Executing master IP deactivation script:

Fri Nov 10 14:30:42 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stop

ssh: connect to host host_1 port 22: Connection timed out

===================    swift vip :  vip from host_1 is deleted  ==============================

--2017-11-10 14:30:49--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi

正在连接 tgw_server:80... 已连接。

已发出 HTTP 请求，正在等待回应... 200 OK

长度：未指定 [text/html]

正在保存至: “STDOUT”

     0K                                                        12.1M=0s

2017-11-10 14:32:47 (12.1 MB/s) - 已写入标准输出 [38]

Fri Nov 10 14:32:47 2017 - [info]  done.

Fri Nov 10 14:32:47 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.

Fri Nov 10 14:32:47 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.

Fri Nov 10 14:32:47 2017 - [info]

Fri Nov 10 14:32:47 2017 - [info] * Phase 3: Master Recovery Phase..

Fri Nov 10 14:32:47 2017 - [info]

Fri Nov 10 14:32:47 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..

Fri Nov 10 14:32:47 2017 - [info]

Fri Nov 10 14:32:47 2017 - [info] The latest binary log file/position on all slaves is tjtx-126-164.000058:6912

Fri Nov 10 14:32:47 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392

Fri Nov 10 14:32:47 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):

Fri Nov 10 14:32:47 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 14:32:47 2017 - [info]     GTID ON

Fri Nov 10 14:32:47 2017 - [info]     Replicating from host_1(host_1:3306)

Fri Nov 10 14:32:47 2017 - [info]     Not candidate for the new Master (no_master is set)

Fri Nov 10 14:32:47 2017 - [info] The oldest binary log file/position on all slaves is tjtx-126-164.000058:5307

Fri Nov 10 14:32:47 2017 - [info] Oldest slaves:

Fri Nov 10 14:32:47 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 14:32:47 2017 - [info]     GTID ON

Fri Nov 10 14:32:47 2017 - [info]     Replicating from host_1(host_1:3306)

Fri Nov 10 14:32:47 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Fri Nov 10 14:32:47 2017 - [info]

Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: Determining New Master Phase..

Fri Nov 10 14:32:47 2017 - [info]

Fri Nov 10 14:32:47 2017 - [info] Searching new master from slaves..

Fri Nov 10 14:32:47 2017 - [info]  Candidate masters from the configuration file:

Fri Nov 10 14:32:47 2017 - [info]   host_2(host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 14:32:47 2017 - [info]     GTID ON

Fri Nov 10 14:32:47 2017 - [info]     Replicating from host_1(host_1:3306)

Fri Nov 10 14:32:47 2017 - [info]     Primary candidate for the new Master (candidate_master is set)

Fri Nov 10 14:32:47 2017 - [info]  Non-candidate masters:

Fri Nov 10 14:32:47 2017 - [info]   host_3(host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabled

Fri Nov 10 14:32:47 2017 - [info]     GTID ON

Fri Nov 10 14:32:47 2017 - [info]     Replicating from host_1(host_1:3306)

Fri Nov 10 14:32:47 2017 - [info]     Not candidate for the new Master (no_master is set)

Fri Nov 10 14:32:47 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..

Fri Nov 10 14:32:47 2017 - [info]   Not found.

Fri Nov 10 14:32:47 2017 - [info]  Searching from all candidate_master slaves..

Fri Nov 10 14:32:47 2017 - [info] New master is host_2(host_2:3306)

Fri Nov 10 14:32:47 2017 - [info] Starting master failover..

Fri Nov 10 14:32:47 2017 - [info]

From:

host_1(host_1:3306) (current master)

 +--host_2(host_2:3306)

 +--host_3(host_3:3306)

To:

host_2(host_2:3306) (new master)

 +--host_3(host_3:3306)

Fri Nov 10 14:32:47 2017 - [info]

Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: New Master Recovery Phase..

Fri Nov 10 14:32:47 2017 - [info]

Fri Nov 10 14:32:47 2017 - [info]  Waiting all logs to be applied..

Fri Nov 10 14:32:47 2017 - [info]   done.

Fri Nov 10 14:32:47 2017 - [info]  Replicating from the latest slave host_3(host_3:3306) and waiting to apply..

Fri Nov 10 14:32:47 2017 - [info]  Waiting all logs to be applied on the latest slave..

Fri Nov 10 14:32:47 2017 - [info]  Resetting slave host_2(host_2:3306) and starting replication from the new master host_3(host_3:3306)..

Fri Nov 10 14:32:47 2017 - [info]  Executed CHANGE MASTER.

Fri Nov 10 14:32:48 2017 - [info]  Slave started.

Fri Nov 10 14:32:48 2017 - [info]  Waiting to execute all relay logs on host_2(host_2:3306)..

Fri Nov 10 14:32:48 2017 - [info]  master_pos_wait(host_3.000049:42954) completed on host_2(host_2:3306). Executed 0 events.

Fri Nov 10 14:32:48 2017 - [info]   done.

Fri Nov 10 14:32:48 2017 - [info]   done.

Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host host_2 started, pid: 76664

Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host host_3 started, pid: 76665

Fri Nov 10 14:32:48 2017 - [info]

Fri Nov 10 14:32:48 2017 - [info] Log messages from host_2 ...

Fri Nov 10 14:32:48 2017 - [info]

Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server host_2..

Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000058  --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.

 at /usr/bin/save_binary_logs line 123

    eval {...} called at /usr/bin/save_binary_logs line 70

    main::main() called at /usr/bin/save_binary_logs line 66

Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Fri Nov 10 14:32:48 2017 - [info] End of log messages from host_2.

Fri Nov 10 14:32:48 2017 - [warning] Got error from host_2.

Fri Nov 10 14:32:48 2017 - [info]

Fri Nov 10 14:32:48 2017 - [info] Log messages from host_3 ...

Fri Nov 10 14:32:48 2017 - [info]

Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server host_3..

Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=tjtx-126-164.000058  --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.bin

Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.

 at /usr/bin/save_binary_logs line 123

    eval {...} called at /usr/bin/save_binary_logs line 70

    main::main() called at /usr/bin/save_binary_logs line 66

Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?

Fri Nov 10 14:32:48 2017 - [info] End of log messages from host_3.

Fri Nov 10 14:32:48 2017 - [warning] Got error from host_3.

Fri Nov 10 14:32:48 2017 - [info] Getting new master's binlog name and position..

Fri Nov 10 14:32:48 2017 - [info]  host_1.000008:6895

Fri Nov 10 14:32:48 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';

Fri Nov 10 14:32:48 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000008, 6895, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392

Fri Nov 10 14:32:48 2017 - [info] Executing master IP activate script:

Fri Nov 10 14:32:48 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='xxx' --new_master_password='xxx'

Unknown option: new_master_user

Unknown option: new_master_password

===================    swift vip :  vip to host_2  is added  ==============================

Fri Nov 10 14:32:51 2017 - [info]  OK.

Fri Nov 10 14:32:51 2017 - [info] ** Finished master recovery successfully.

Fri Nov 10 14:32:51 2017 - [info] * Phase 3: Master Recovery Phase completed.

Fri Nov 10 14:32:51 2017 - [info]

Fri Nov 10 14:32:51 2017 - [info] * Phase 4: Slaves Recovery Phase..

Fri Nov 10 14:32:51 2017 - [info]

Fri Nov 10 14:32:51 2017 - [info]

Fri Nov 10 14:32:51 2017 - [info] * Phase 4.1: Starting Slaves in parallel..

Fri Nov 10 14:32:51 2017 - [info]

Fri Nov 10 14:32:51 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 80398. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171110142950.log if it takes time..

Fri Nov 10 14:32:52 2017 - [info]

Fri Nov 10 14:32:52 2017 - [info] Log messages from host_3 ...

Fri Nov 10 14:32:52 2017 - [info]

Fri Nov 10 14:32:51 2017 - [info]  Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..

Fri Nov 10 14:32:51 2017 - [info]  Executed CHANGE MASTER.

Fri Nov 10 14:32:52 2017 - [info]  Slave started.

Fri Nov 10 14:32:52 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,

ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392) completed on host_3(host_3:3306). Executed 0 events.

Fri Nov 10 14:32:52 2017 - [info] End of log messages from host_3.

Fri Nov 10 14:32:52 2017 - [info] -- Slave on host host_3(host_3:3306) started.

Fri Nov 10 14:32:52 2017 - [info] All new slave servers recovered successfully.

Fri Nov 10 14:32:52 2017 - [info]

Fri Nov 10 14:32:52 2017 - [info] * Phase 5: New master cleanup phase..

Fri Nov 10 14:32:52 2017 - [info]

Fri Nov 10 14:32:52 2017 - [info] Resetting slave info on the new master..

Fri Nov 10 14:32:52 2017 - [info]  host_2: Resetting slave info succeeded.

Fri Nov 10 14:32:52 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.

Fri Nov 10 14:32:52 2017 - [info]

----- Failover Report -----

bak_mha_test: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded

Master host_1(host_1:3306) is down!

Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.

Started automated(non-interactive) failover.

Invalidated master IP address on host_1(host_1:3306)

Selected host_2(host_2:3306) as a new master.

host_2(host_2:3306): OK: Applying all logs succeeded.

host_2(host_2:3306): OK: Activated master IP address.

host_3(host_3:3306): OK: Slave started, replicating from host_2(host_2:3306)

host_2(host_2:3306): Resetting slave info succeeded.

Master failover to host_2(host_2:3306) completed successfully.

Fri Nov 10 14:32:52 2017 - [info] Sending mail..

结论： binlog server 配置成多台slave，这是正确的方案。由于master 挂了，master没有传递过来的binlog会丢失，这是没办法的. 好在，其余slave自动补齐现有日志

binlog server 啥都不写



会切换成功，由于master 挂了，master没有传递过来的binlog会丢失

好在，其余slave自动补齐现有日志

2.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

同1.6结论

三、遇到的坑

3.1 交互模式下，如果没有及时敲'YES'，则终止切换

四、总结

MHA + GTID 模式，重点配置和用法如下：

1. command

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host=host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

2. binlog server

在配置文件中对 master，slave，etl 都写在binlog server中。对MySQL down 和 DB server down  综合考虑下，建议这样配置。

3. tgw 清理

dead master 如果还可以起来，那么必须在上面执行： /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c

原因可参看：http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==> TGW 章节

本文为云栖社区原创内容，未经允许不得转载，如需转载请发送邮件至yqeditor@list.alibaba-inc.com；如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件至：yqgroup@service.aliyun.com 进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容

秒客网

MHA failover GTID 专题

MHA failover GTID 专题

假定环境(经典三节点)

一、Master : MySQL down

1.1 etl 延迟8小时

1.2 slave(候选master)比etl还要落后更多

1.3 slave(候选master)的日志是最新的，比etl要多

1.4 slave(候选master）上面有大事务在跑

1.5 binlog server 不同场景的测试

1.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

1.7 Master：MySQL down小结

二、Master : Server down

2.1 etl 延迟8小时

2.2 slave(候选master)比etl还要落后更多

2.3 slave(候选master)的日志是最新的，比etl要多

2.4 slave(候选master）上面有大事务在跑

2.5 binlog server 不同场景的测试

2.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

三、遇到的坑

3.1 交互模式下，如果没有及时敲'YES'，则终止切换

四、总结

相关文章