业务日志监控中报告, 每天会有大约250次连接redis失败.
通过strace追踪发现.故障的时间点时写磁盘时间超过了10s.一般在10-15s之间. redis第二次重试使用的是10s.
这个实例所有的操作都是incr, fdatasync 会block写.
strace -ttt -f -p 11302 -t -e trace=fdatasync 11309 10:21:31.153900 fdatasync(116) = 0 <0.034295> 11309 10:21:32.078747 fdatasync(116) = 0 <7.592478> 11309 10:21:39.774959 fdatasync(116) = 0 <10.098802> 11309 10:21:49.990623 fdatasync(116) = 0 <2.026147> 11309 10:21:52.129676 fdatasync(116) = 0 <0.002802>
治标:
超时时间改为15s.
治本:
正在用watchdog抓一下超过5s的堆栈.
堆栈:
[11302 | signal handler] (1499754857) — watchdog timer expired — /usr/local/bin/redis-server-2.8 10.160.86.216:6699(logstacktrace 0x3e)[0x445ace] /lib64/libpthread.so.0(write 0x2d)[0x7f19ef3b06fd] /lib64/libpthread.so.0( 0xf710)[0x7f19ef3b1710] /lib64/libpthread.so.0(write 0x2d)[0x7f19ef3b06fd] /usr/local/bin/redis-server-2.8 10.160.86.216:6699(flushappendonlyfile 0x4e)[0x44116e] /usr/local/bin/redis-server-2.8 10.160.86.216:6699(servercron 0x3b7)[0x41bb17] /usr/local/bin/redis-server-2.8 10.160.86.216:6699(aeprocessevents 0x1e9)[0x416b69] /usr/local/bin/redis-server-2.8 10.160.86.216:6699(aemain 0x2b)[0x416deb] /usr/local/bin/redis-server-2.8 10.160.86.216:6699(main 0x31d)[0x41e49d] /lib64/libc.so.6(__libc_start_main 0xfd)[0x7f19ef02cd5d] /usr/local/bin/redis-server-2.8 10.160.86.216:6699[0x415bd9]
[11302 | signal handler] (1499754857) ——–
fdatasync会在某个时间点超过10s.
看来因为写磁盘堵塞了, 把机械硬盘换成了ssd, 解决了。