作者:任坤
现居珠海,先后担任专职 Oracle 和 MySQL DBA,现在主要负责 MySQL、mongoDB 和 Redis 维护工作。
本文来源:原创投稿
背景
线上有一套6节点 redis cluster ,6分片 * 2副本,每个节点上2个实例,端口号分别为7000和7001。
诊断
登录该节点,查看 redis 的日志
22996:S 20 Jan 2023 07:27:15.091 * Connecting to MASTER x.x.x.46:7001
22996:S 20 Jan 2023 07:27:15.091 * MASTER <‐> REPLICA sync started
22996:S 20 Jan 2023 07:27:15.106 * Non blocking connect for SYNC fired the event.
22996:S 20 Jan 2023 07:27:15.106 * Master replied to PING, replication can continue...
22996:S 20 Jan 2023 07:27:15.106 * Trying a partial resynchronization (request 174e5c92c731090d3c9a05f6364ffff5a70e61d9:7180528579709).
22996:S 20 Jan 2023 07:35:29.263 * Full resync from master: 174e5c92c731090d3c9a05f6364ffff5a70e61d9:7180734380451
22996:S 20 Jan 2023 07:35:29.263 * Discarding previously cached master state.
22996:S 20 Jan 2023 07:44:47.717 * MASTER <‐> REPLICA sync: receiving 22930214160 bytes from master
实例启动后和主库进行连接,先尝试 partial resync 失败,后进行 full resync
22996:S 20 Jan 2023 07:48:07.305 * MASTER <‐> REPLICA sync: Flushing old data
22996:S 20 Jan 2023 07:53:24.576 * MASTER <‐> REPLICA sync: Loading DB in memory
22996:S 20 Jan 2023 07:59:59.491 * MASTER <‐> REPLICA sync: Finished with success
耗时11分钟完成旧数据清理和新 rdb 加载,此时却发现和主库的连接中断
22996:S 20 Jan 2023 07:59:59.521 # Connection with master lost.
22996:S 20 Jan 2023 07:59:59.521 * Caching the disconnected master state.
于是乎又发起和主库的连接
22996:S 20 Jan 2023 08:00:00.404 * Connecting to MASTER x.x.x.46:7001
22996:S 20 Jan 2023 08:00:00.404 * MASTER <‐> REPLICA sync started
22996:S 20 Jan 2023 08:00:00.405 * Non blocking connect for SYNC fired the event.
22996:S 20 Jan 2023 08:00:00.406 * Master replied to PING, replication can continue...
22996:S 20 Jan 2023 08:00:00.408 * Trying a partial resynchronization (request 174e5c92c731090d3c9a05f6364ffff5a70e61d9:7180736029100).
22996:S 20 Jan 2023 08:08:21.849 * Full resync from master: 174e5c92c731090d3c9a05f6364ffff5a70e61d9:7180922115631
有2个疑问:
查看主库日志,每9分钟发起1次 bgsave ,每次 bgsave 期间新产生的内存有2600M之多,repl-backlog-size 默认只有100M,而从库节点从宕机到完成启动耗时大约15分钟,此时缓冲区中的复制位点早被覆盖,难怪 partial resync 会失败。
38241:C 20 Jan 2023 07:35:25.836 * DB saved on disk
38241:C 20 Jan 2023 07:35:26.552 * RDB: 2663 MB of memory used by copy‐on‐write
40362:M 20 Jan 2023 07:35:27.950 * Background saving terminated with success
40362:M 20 Jan 2023 07:35:27.950 * Starting BGSAVE for SYNC with target:disk
40362:M 20 Jan 2023 07:35:29.263 * Background saving started by pid 11680
11680:C 20 Jan 2023 07:44:44.585 * DB saved on disk
11680:C 20 Jan 2023 07:44:45.811 * RDB: 2681 MB of memory used by copy‐on‐write
继续看日志,
40362:M 20 Jan 2023 07:48:03.104 * Synchronization with replica x.x.x.45:7000 succeeded
40362:M 20 Jan 2023 07:48:17.100 * FAIL message received from 8e2a54fbaac768a5cc0e717f4aa93c6be8683ffe about ccb7589e3240bc95557ffb282435afd5dc13e4c9
40362:M 20 Jan 2023 07:50:17.109 # Disconnecting timedout replica:x.x.x.45:7000
40362:M 20 Jan 2023 07:50:17.109 # Connection with replica x.x.x.45:7000 lost.
40362:M 20 Jan 2023 07:53:26.114 * Clear FAIL state for node ccb7589e3240bc95557ffb282435afd5dc13e4c9: replica is reachable again.
40362:M 20 Jan 2023 08:00:00.408 * Replica x.x.x.45:7000 asks for sync hronization
解决方案
本文关键字:#repl-timeout# #连接超时#