RabbitMQ导致的java服务占用cpu过高从而导致系统负载过高

时间:2022-11-18 10:03:23

生产环境服务器上搭建的 RabbitMQ 在并发量不是很大的情况下 CPU 占用竟然达到了 40% 左右,如下图:

1.通过 top 命令查看到 CPU 和 内存使用情况

RabbitMQ导致的java服务占用cpu过高从而导致系统负载过高

2.error_logger

然后可以通过 RabbitMQ 提供的插件 rabbitmq_top 来查看详细的各个线程内存和 CPU 的使用情况。首先我们通过命令 rabbitmq-plugins list 查看所有插件及开启情况,然后使用命令 rabbitmq-plugins enable rabbitmq_top 开启插件,关闭插件的命令是 rabbitmq-plugins disable rabbitmq_top 。这样开启后就能在 RabbitMQ 的管理后台查看了。如下图:

然后点击 Reductions / sec 排序,这个值越大表示的 CPU 的使用越高。会发现 error_logger 程序的 CPU 使用最高,从描述上看是和日志相关。

RabbitMQ导致的java服务占用cpu过高从而导致系统负载过高

自然地要去看日志输出有没有什么异常,通过查看日志文件发现并没有什么异常,而且日志输出的内容很少。然后改变了日志等级为 debug (rabbitmqctl set_log_level debug)看下详细的日志,发现有start_link 日志。

2020-05-21 10:43:01.177 [debug] <0.21161.3240> Supervisor {<0.21161.3240>,rabbit_channel_sup} started rabbit_writer:start_link(#Port<0.176195>, 1571, 131072, rabbit_framing_amqp_0_9_1, <0.29275.7908>, {<<"10.244.5.77:58214 -> 10.244.3.9:5672">>,1571}, true) at pid <0.20273.3240>
2020-05-21 10:43:01.177 [debug] <0.21161.3240> Supervisor {<0.21161.3240>,rabbit_channel_sup} started rabbit_limiter:start_link({<<"10.244.5.77:58214 -> 10.244.3.9:5672">>,1571}) at pid <0.21143.3240>
2020-05-21 10:43:01.177 [debug] <0.21161.3240> Supervisor {<0.21161.3240>,rabbit_channel_sup} started rabbit_channel:start_link(1571, <0.29275.7908>, <0.20273.3240>, <0.29275.7908>, <<"10.244.5.77:58214 -> 10.244.3.9:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"...">>, [{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentica...">>,...},...], <0.29654.7908>, <0.21143.3240>) at pid <0.20766.3240>
2020-05-21 10:43:01.225 [debug] <0.21123.3240> Supervisor {<0.21123.3240>,rabbit_channel_sup} started rabbit_writer:start_link(#Port<0.80537>, 41, 131072, rabbit_framing_amqp_0_9_1, <0.28490.3357>, {<<"10.244.2.171:55900 -> 10.244.3.9:5672">>,41}, true) at pid <0.21071.3240>
2020-05-21 10:43:01.225 [debug] <0.21123.3240> Supervisor {<0.21123.3240>,rabbit_channel_sup} started rabbit_limiter:start_link({<<"10.244.2.171:55900 -> 10.244.3.9:5672">>,41}) at pid <0.20379.3240>
2020-05-21 10:43:01.225 [debug] <0.21123.3240> Supervisor {<0.21123.3240>,rabbit_channel_sup} started rabbit_channel:start_link(41, <0.28490.3357>, <0.21071.3240>, <0.28490.3357>, <<"10.244.2.171:55900 -> 10.244.3.9:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"...">>, [{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentica...">>,...},...], <0.28524.3357>, <0.20379.3240>) at pid <0.21072.3240>
2020-05-21 10:43:01.229 [debug] <0.21280.3240> Supervisor {<0.21280.3240>,rabbit_channel_sup} started rabbit_writer:start_link(#Port<0.176195>, 1571, 131072, rabbit_framing_amqp_0_9_1, <0.29275.7908>, {<<"10.244.5.77:58214 -> 10.244.3.9:5672">>,1571}, true) at pid <0.21014.3240>
2020-05-21 10:43:01.229 [debug] <0.21280.3240> Supervisor {<0.21280.3240>,rabbit_channel_sup} started rabbit_limiter:start_link({<<"10.244.5.77:58214 -> 10.244.3.9:5672">>,1571}) at pid <0.20939.3240>
2020-05-21 10:43:01.230 [debug] <0.21280.3240> Supervisor {<0.21280.3240>,rabbit_channel_sup} started rabbit_channel:start_link(1571, <0.29275.7908>, <0.21014.3240>, <0.29275.7908>, <<"10.244.5.77:58214 -> 10.244.3.9:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"...">>, [{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentica...">>,...},...], <0.29654.7908>, <0.20939.3240>) at pid <0.21039.3240>
2020-05-21 10:43:01.253 [debug] <0.20564.3240> Supervisor {<0.20564.3240>,rabbit_channel_sup} started rabbit_writer:start_link(#Port<0.80537>, 41, 131072, rabbit_framing_amqp_0_9_1, <0.28490.3357>, {<<"10.244.2.171:55900 -> 10.244.3.9:5672">>,41}, true) at pid <0.20612.3240>
2020-05-21 10:43:01.253 [debug] <0.20564.3240> Supervisor {<0.20564.3240>,rabbit_channel_sup} started rabbit_limiter:start_link({<<"10.244.2.171:55900 -> 10.244.3.9:5672">>,41}) at pid <0.20962.3240>
2020-05-21 10:43:01.254 [debug] <0.20564.3240> Supervisor {<0.20564.3240>,rabbit_channel_sup} started rabbit_channel:start_link(41, <0.28490.3357>, <0.20612.3240>, <0.28490.3357>, <<"10.244.2.171:55900 -> 10.244.3.9:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"....">>, [{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentica...">>,...},...], <0.28524.3357>, <0.20962.3240>) at pid <0.20158.3240>
2020-05-21 10:43:01.260 [debug] <0.21185.3240> Supervisor {<0.21185.3240>,rabbit_channel_sup} started rabbit_writer:start_link(#Port<0.176195>, 1571, 131072, rabbit_framing_amqp_0_9_1, <0.29275.7908>, {<<"10.244.5.77:58214 -> 10.244.3.9:5672">>,1571}, true) at pid <0.20278.3240>
2020-05-21 10:43:01.260 [debug] <0.21185.3240> Supervisor {<0.21185.3240>,rabbit_channel_sup} started rabbit_limiter:start_link({<<"10.244.5.77:58214 -> 10.244.3.9:5672">>,1571}) at pid <0.20461.3240>
2020-05-21 10:43:01.261 [debug] <0.21185.3240> Supervisor {<0.21185.3240>,rabbit_channel_sup} started rabbit_channel:start_link(1571, <0.29275.7908>, <0.20278.3240>, <0.29275.7908>, <<"10.244.5.77:58214 -> 10.244.3.9:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"....">>, [{<<"exchange_exchange_bindings">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentica...">>,...},...], <0.29654.7908>, <0.20461.3240>) at pid <0.20990.3240>
2020-05-21 10:43:01.268 [debug] <0.21033.3240> Supervisor {<0.21033.3240>,rabbit_channel_sup} started rabbit_writer:start_link(#Port<0.80537>, 41, 131072, rabbit_framing_amqp_0_9_1, <0.28490.3357>, {<<"10.244.2.171:55900 -> 10.244.3.9:5672">>,41}, true) at pid <0.20840.3240>
2020-05-21 10:43:01.269 [debug] <0.21033.3240> Supervisor {<0.21033.3240>,rabbit_channel_sup} started rabbit_limiter:start_link({<<"10.244.2.171:55900 -> 10.244.3.9:5672">>,41}) at pid <0.21042.3240>

后来经过一系列的查找,在生产者发生完消息后,执行下面语句能够减少日志输出:

RabbitUtils.isPhysicalCloseRequired();

修改完客户端的代码后,发现rabbitMQ 的 CPU 降了几个百分点,但是还是很高。那么就表明 CPU 的占用高不是因为 error_logger 引起的。

3.erlang 调度器问题

通过 top -H -p PID 命令查看 beam.smp 进程里的每个线程的 CPU 使用情况:

RabbitMQ导致的java服务占用cpu过高从而导致系统负载过高

发现主要是 erlang 语言的两个调度器(调度器有关文档:​​https://hamidreza-s.github.io/erlang/scheduling/real-time/preemptive/migration/2016/02/09/erlang-scheduler-details.html)占用​​​ CPU。通过命令 rabbitmq-diagnostics runtime_thread_stats 查看线程的运行状况,发现两个调度器的 sleep 时间在 99 % ,也就是说调度器大部分时间是空闲的。在官方文档中有一节讲到了 CPU 相关的内容:​​https://www.rabbitmq.com/runtime.html#cpu。里面提到了调度器在从​​ sleep 状态转换成工作状态是需要一定的成本的,因此在调度器没有任务调度时默认会延迟一段时间才进入 sleep 状态,调度器的忙等待是会占用 CPU的。因此我们可以通过禁用忙等待来降低 CPU 的消耗:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none +sbwtdcpu none +sbwtdio none"

这条配置需要添加到文件 /etc/rabbitmq/rabbitmq-env.conf 中,如果没有则创建该文件。

文档上还给出了,在共享或CPU受限的环境(包括容器化的环境)中,可能需要显式配置调度程序个数。我们也可以通过改变调度器个数来降低 CPU 的消耗:

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+S 1:1"

上面的两种方式的修改经过压测都可以有效的降低 CPU 的占用,效果上相差不大,在多核 CPU 的情况下多个调度器可能 CPU 的利用率更高。在实际的应用中要根据实际情况进行选择。

4.总结

上面主要介绍了 RabbitMQ 占用 CPU 高的排查过程和解决方案,一种方案是禁止调度器忙等,另一种是通过减少调度器的个数。这两种方案的选择还要根据实际情况进行选择,比如消息的生产/消费并发量,或着服务器的配置等。

官方链接:​https://www.rabbitmq.com/runtime.html#cpu-reduce-idle-usage​