一次docker中的nginx进程响应慢问题定位记录

时间:2022-03-12 01:17:20

有个ft测试的环境,其中nginx使用docker发布的。测试用例是curl的时候,没有获得nginx的响应。

docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e39be26bc976 cdn-release-docker.artnj.zte.com.cn/img-for-ottci/python:latest "/bin/bash -c /home/s" 12 minutes ago Up 12 minutes 22/tcp, 9220/tcp RF_THUMBNAIL
a7f8ff2c2fac ft_thumbnail:V5.01.01.01 "/usr/sbin/init" 12 minutes ago Up 12 minutes FT_thumbnail
4881975f8e10 cdn-release-docker.artnj.zte.com.cn/img-for-ottci/python:latest "/bin/bash -c /home/s" 31 minutes ago Up 31 minutes 22/tcp, 9220/tcp RF_filepackager
bfdc3beaf7e4 ft_filepackager:V5.01.01.01 "/usr/sbin/init bash" 31 minutes ago Up 31 minutes

 进入容器,看一下具体信息:

docker exec -it bfdc3beaf7e4 bash

  敲一下top命令,可以发现多个nginx的work进程的cpu很高。

bash-4.2# top
TERM environment variable not set.
bash-4.2#
bash-4.2# echo TERM
TERM
bash-4.2# echo $TERM
dumb
bash-4.2# export M
bash-4.2# export TERM=dumb
bash-4.2# top
top - 09:28:13 up 81 days, 7:01, 0 users, load average: 17.46, 44.24, 34.20
Tasks: 49 total, 2 running, 47 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.2 us, 4.4 sy, 0.0 ni, 93.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 26321361+total, 48650644 free, 24579464 used, 18998350+buff/cache
KiB Swap: 0 total, 0 free, 0 used. 23644137+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1435 root 20 0 9855.0m 44528 3236 R 93.3 0.0 24:27.53 nginx

bash-4.2# strace -p 1435
bash: strace: command not found

  

由于并没有将strace命令一起打包到这个镜像中,所以需要在容器外和容器内的进程对应起来。然后查看对应的进程信息:

[root@localhost ~]# docker top bfdc3beaf7e4 |grep -i nginx
root 26807 22109 0 17:33 ? 00:00:00 wget -T 10 -t 1 -O /home/zxcdn/agent/plugin/stat/../../crontab/nginxstatus.tmp http://127.0.0.1:6620/NGINXSTATUS
root 30753 16121 0 16:53 ? 00:00:00 nginx: master process /home/zxcdn/ottcache/nginx/sbin/nginx -p /home/zxcdn/ottcache/nginx
root 30754 30753 72 16:53 ? 00:29:28 nginx: worker process

  有很多nginx的work进程,而且进程名相同,只截取部分。nginx开启的work进程是64个,所以有一大片。那怎么对应呢?有个小技巧,就是在容器里运行ps -ef |grep -i nginx |grep -i worker,然后对应的顺序,在容器外和docker top |grep -i nginx |grep -i worker是一致的。

在容器外找到对应的cpu高的进程滞后,先看一下这个进程是用户态cpu高还是内核态cpu高,可以使用

pidstat -p 30754 1 来确定,然后本文刚好是内核态cpu高,则可以使用strace 来看为什么该进程cpu高了。本文是因为对应的lua代码的覆盖率的频率设置太低,而且lua脚本有个地方触发了循环,导致了频繁操作read和write。

open("/home/devops/luacov/luacov.stats.out", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 18
fstat(18, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5178ddc000
write(18, "398:/home/zxcdn/ottcac 0 0 0 0 0"..., 4096) = 4096
write(18, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 4096) = 4096
write(18, "0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 "..., 4096) = 4096
write(18, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 4096) = 4096
write(18, "1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "..., 4096) = 4096
write(18, "0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 "..., 4096) = 4096
write(18, "28 0 0 0 0 0 114 114 114 0 114 0"..., 4096) = 4096
write(18, "0 0 0 0 0 0 0 0 0 0 0 0 889 0 0 "..., 4096) = 4096
write(18, " 0 0 0 0 0 0 0 0 0 0 480 0 0 480"..., 4096) = 4096
write(18, " 0 0 0 0 354 354 0 0 0 0 0 0 0 0"..., 4096) = 4096
write(18, "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "..., 204) = 204
close(18) = 0
munmap(0x7f5178ddc000, 4096) = 0
close(20) = 0
close(22) = 0
open("/home/devops/luacov/luacov.stats.out", O_RDONLY) = 18
fstat(18, {st_mode=S_IFREG|0644, st_size=41164, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5178ddc000
read(18, "398:/home/zxcdn/ottcac 0 0 0 0 0"..., 4096) = 4096
read(18, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 4096) = 4096
read(18, "0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 "..., 4096) = 4096
read(18, " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"..., 4096) = 4096
read(18, "1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "..., 4096) = 4096
read(18, "0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 "..., 4096) = 4096
read(18, "28 0 0 0 0 0 114 114 114 0 114 0"..., 4096) = 4096
read(18, "0 0 0 0 0 0 0 0 0 0 0 0 889 0 0 "..., 4096) = 4096
read(18, " 0 0 0 0 0 0 0 0 0 0 480 0 0 480"..., 4096) = 4096
read(18, " 0 0 0 0 354 354 0 0 0 0 0 0 0 0"..., 4096) = 4096
read(18, "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "..., 4096) = 204