解决Redis/Codis Connection with master lost(复制超时)问题

今天在线上环境中遇到了codis-server报警，按照常规处理流程进行处理，报错步骤如下：

首先将codis-slave的rdb文件移除，并重启codis-slave
在codis-dashbord中将codis-slave移除问题codis group
将codis-slave重新加入codis group，并测试在codis-master中写入数据，查看codis-slave中能否正常读取数据

没想到在新加入codis group同步数据时发生以下报错：

[13029] 15 Oct 13:56:29.063 # Client id=8443510 addr=10.24.193.69:30377 fd=6 name= age=187 idle=187 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16365 oll=3917 omem=100541448 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.

[13029] 15 Oct 13:56:29.160 # Connection with slave 10.24.193.69:6379 lost.

[13029] 15 Oct 13:56:30.607 * Slave 10.24.193.69:6379 asks for synchronization

[13029] 15 Oct 13:56:30.607 * Full resync requested by slave 10.24.193.69:6379

[13029] 15 Oct 13:56:30.607 * Starting BGSAVE for SYNC with target: disk

[13029] 15 Oct 13:56:30.856 * Background saving started by pid 17765

[17765] 15 Oct 13:58:26.910 * DB saved on disk

[17765] 15 Oct 13:58:27.093 * RDB: 969 MB of memory used by copy-on-write

[13029] 15 Oct 13:58:27.492 * Background saving terminated with success

出现以上报错的原因是codis/redis默认配置中```repl-timeout```的时间为60s，当复制数据的时间超过60s时，codis/redis master就会认为连接超时主动断开连接，也就是```Connection with master lost```报错。当然简单的理解，复制的过程中肯定有两个参数，一个是复制时长，另一个就是每秒/每分钟复制数据占用服务器资源的大小```client-output-buffer-limit```参数就决定了客户端输出缓冲区内存使用量，所以我们可以通过调整这两个参数来解决此次问题。

解决Redis/Codis同步超时问题

我们的codis部分配置文件如下：
repl-timeout 60

client-output-buffer-limit slave 256mb 64mb 60
上面是master上的slave客户端，默认缓冲区大小限制:当缓冲区使用超过256mb,master会尽快杀掉它；当缓冲区使用大于64mb,且小于256mb的soft limit值时，并持续时间达60秒，也会被Master尽快杀掉。

综上所述

解决超时问题有两种方式：

修改超时时间长短repl-timeout 60

修改缓冲区占用内容大小限制client-output-buffer-limit

当数据同步完成后最好将配置修改为原配置，避免占用服务器资源过高引起其他问题