转https://blogs.oracle.com/jsmyth/what-causes-replication-lag
What Causes Replication Lag?
Mark LewinMYSQL CURRICULUM DEVELOPER
Replication lag occurs when the slaves (or secondaries) cannot keep up with the updates occuring on the master (or primary). Unapplied changes accumulate in the slaves‘ relay logs and the version of the database on the slaves becomes increasingly different from that of the master.
To work out what‘s causing the lag, you must determine which replication thread is getting backed up. Replication relies on three threads per master/slave connection: one is created on the master and two are created on the slave.
- The Slave I/O Thread. When you issue START SLAVE on a slave server, the slave creates this thread which connects to the master and requests a copy of the master‘s binary log.
- The Binlog Dump Thread. When the slave connects to the master, the master uses this thread to send the slave the contents of its binary log.
- The Slave SQL Thread. The slaves creates this SQL (or applier) thread to read the contents of the retrieved binary log and apply its contents.
Replication lag is caused when either the I/O Thread or SQL Thread cannot cope with the demands placed upon it.
If the I/O Thread is suffering, this means that the network connection between the master and its slaves is slow. You might want to consider enabling the slave_compressed_protocol to compress network traffic or speaking to your network administrator.
If it‘s the SQL thread then your problem is probably due to poorly-optimized queries that are taking the slave too long to apply. There may be long-running transactions or too much I/O activity. Having no primary key on the slave tables when using the ROW or MIXED replication format is also a common cause of lag on this thread: check that both your master and slave versions of tables have a primary key.
But how do you know which thread is affected? This depends on whether you are using binary log file names and positions or GTIDs.
If you are using binary log file names and positions, then executing SHOW MASTER STATUS and SHOW SLAVE STATUS gives you all you need to work out which thread is causing the problems:
master> SHOW MASTER STATUS; --------------- ----------- | File | Position |... --------------- ----------- | binlog.000002 | 121752008 |... --------------- ----------- slave> SHOW SLAVE STATUSG ********************* 1. row ********************* Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: replication Master_Port: 22808 Connect_Retry: 60 Master_Log_File: binlog.000002 Read_Master_Log_Pos: 121409852 Relay_Log_File: relaylog.000002 Relay_Log_Pos: 119819329 Relay_Master_Log_File: binlog.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Exec_Master_Log_Pos: 120003004 Relay_Log_Space: 121226377
You can calculate any I/O thread lag by subtracting the value of Read_Master_Log_Pos from Position.
Bytes behind master = Position - Read_Master_Log_Pos = 121752008 - 121409852 = 342156 bytes
Note that this assumes that the binary log file name is the same on both the slave and the master, otherwise the calculation becomes more complex because you must take into account the size of the binary log files.
The following calculation tells you if the slave SQL thread is lagging behind the replication master:
Bytes behind master = Position - Exec_Master_Log_Pos = 121752008 - 120003004 = 1749004 bytes
Or, if the slave‘s SQL thread is lagging behind its I/O thread, then this will be visible in the results of the following calculation:
Bytes behind master = Read_Master_Log_Pos - Exec_Master_Log_Pos = 121409852 - 120003004 = 1406848 bytes
The Seconds_Behind_Master column in the output of SHOW SLAVE STATUS also gives you a rough idea of the extent of the delay. Don‘t rely on it too much though, because it doesn‘t take into account long-running transactions or any operations that it performs on the master that is not required on the slave. Its value is best monitored over a period.
If your replication setup uses GTIDs, then you‘ll use the following information to diagnose replication lag:
- Executed GTIDs. Shows which transactions have been executed on the instance. Available on both master and slave. You can also get the GTID of the latest transaction executed from the global gtid_executed status variable.
- Retrieved GTIDs. Shows which transactions the I/O thread has received. Available only on the slave.
- Purged GTIDs. Shows which transactions have been purged from the binary log after completion. You‘re only interested in the transactions purged by the slave in this context.
Once you have the GTIDs you can use the GTID_SUBTRACT() function to calculate the difference in the slave‘s view of the world from that of the master. For example, the following query on the slave shows the GTIDs read from the binary log that have not yet been applied (SQL thread lag):
slave> SELECT GTID_SUBTRACT(‘96985d6f-2ebc-11e7-84df-08002715584a:5-133‘, ‘96985d6f-2ebc-11e7-84df-08002715584a:26-132‘) AS MissingGTIDs; ----------------------------------------------- | MissingGTIDs | ----------------------------------------------- | 96985d6f-2ebc-11e7-84df-08002715584a:5-25:133 | ----------------------------------------------- 1 row in set (0.00 sec)