为什么我的Python应用程序因'系统'/内核CPU时间而停滞不前

时间:2022-02-14 00:01:34

First off I wasn't sure if I should post this as a Ubuntu question or here. But I'm guessing it's more of an Python question than a OS one.

首先,我不确定是否应将此作为Ubuntu问题发布或在此处。但我猜它更像是一个Python问题,而不是一个OS问题。

My Python application is running on top of Ubuntu on a 64 core AMD server. It pulls images from 5 GigE cameras over the network by calling out to a .so through ctypes and then processes them. I am seeing frequent pauses in my application causing frames from the cameras to be dropped by the external camera library.

我的Python应用程序在64核AMD服务器上运行在Ubuntu之上。它通过网络通过ctypes调用.so来从5 GigE摄像机通过网络提取图像,然后处理它们。我看到我的应用程序经常暂停,导致相机的帧被外部相机库丢弃。

To debug this I've used the popular psutil Python package with which I log out CPU stats every 0.2 seconds in a separate thread. I sleep for 0.2 seconds in that thread and when that sleep takes substantially longer I also see camera frames being dropped. I have seen pauses up to 17 seconds long! Most of my processing is either in OpenCV or Numpy (both of which release the GIL) or in one part of the app a multiprocessing.Pool with 59 processes (this it to get around the Python GIL).

为了调试这个,我使用了流行的psutil Python包,我在一个单独的线程中每0.2秒注销一次CPU统计数据。我在那个线程中睡了0.2秒,当睡眠时间长得多时,我也看到相机帧被丢弃了。我看到长达17秒的停顿!我的大多数处理是在OpenCV或Numpy(两者都发布GIL)或应用程序的一部分中的多处理.Pool有59个进程(这是为了绕过Python GIL)。

My debug logging shows very high 'system' (i.e. kernel) CPU time on many of my process' threads when the pauses happen.

当暂停发生时,我的调试日志记录在我的许多进程'线程上显示非常高的'系统'(即内核)CPU时间。

For example. I see CPU times as follows (usually every 0.2 seconds) and then suddenly a big jump ('Process' numbers are in CPU utilization, i.e. 1 CPU fully used would be 1, Linux top showing 123% would be 1.2):

例如。我看到CPU时间如下(通常每0.2秒),然后突然大跳('进程'数字在CPU利用率,即1个CPU完全使用将是1,Linux顶部显示123%将是1.2):

Process user | Process system | OS system % | OS idle %
19.9         | 10.5           | 6           | 74 
5.6          | 2.3            | 4           | 87
6.8          | 1.7            | 11          | 75
4.6          | 5.5            | 43          | 52
0.5          | 26.4           | 4           | 90

I don't know why the high OS system usage is reported one line before matching high process system usage. The two match up since 26.4 of 64 cores = 41%. At that point my application experienced an approximately 3.5 second pause (as determined by my CPU info logging thread using OpenCV's cv2.getTickCount() and also the jump in time stamps in the Python logging output) causing multiple camera frames to be dropped.

我不知道为什么在匹配高流程系统使用之前报告一行高OS系统使用情况。两者相比,64核中的26.4 = 41%。此时,我的应用程序经历了大约3.5秒的暂停(由我的CPU信息记录线程使用OpenCV的cv2.getTickCount()以及Python日志记录输出中的时间戳跳转确定)导致多个相机帧被丢弃。

When this happens I have also logged the CPU info for each thread of my process. For the example above 25 threads were running at a 'system' CPU utilization of 0.9 and a few more at 0.6, which matches the total for the process of 26.4 above. At that point there were about 183 threads running.

发生这种情况时,我还记录了我的进程的每个线程的CPU信息。对于上面的示例,25个线程在“系统”CPU利用率为0.9时运行,并且在0.6处运行一些,这与上面26.4的进程的总数相匹配。那时大约有183个线程正在运行。

This pause usually seems to happen close after the multiprocessing pool is used (it's used for short bursts) but by no means happens every time the pool is used. Also, if I halve the amount of processing that needs to happen outside the pool then no camera skipping happens.

在使用多处理池(它用于短突发)之后,这种暂停通常似乎很接近,但每次使用池时都不会发生。此外,如果我将需要在池外进行的处理量减半,则不会发生相机跳过。

Question: how can I determine why OS 'system' / kernel time suddenly goes through the roof? Why would that happen in a Python app?

问题:如何确定操作系统'系统'/内核时间突然出现的原因?为什么会在Python应用程序中发生?

And more importantly: any ideas why this is happening and how to avoid it?

更重要的是:任何想法为什么会发生这种情况以及如何避免它?

Notes:

  • This runs as root (it has to for the camera library unfortunately) from upstart
  • 这是以root用户身份运行的(不幸的是,它必须用于相机库)来自新贵

  • When the cameras are turned off the app restarts (using respawn in upstart) and this happens multiple times a day so it's not due to being long running, I have also seen this happen very soon after the process starts
  • 当相机关闭时,应用程序重新启动(使用新手中的respawn)并且这种情况每天发生多次,因此不是由于长时间运行,我也看到这在进程开始后很快就会发生

  • It is the same code being run over and over, it's not due to running a different branch of my code
  • 它是一遍又一遍地运行的相同代码,它不是由于运行我的代码的不同分支

  • Currently has a nice of -2, I have tried removing the nice with no affect
  • 目前有一个很好的-2,我已经尝试删除没有影响的好

  • Ubuntu 12.04.5 LTS
  • Ubuntu 12.04.5 LTS

  • Python 2.7
  • Machine has 128GB of memory which I am no where near using
  • 机器有128GB的内存,我不在附近使用

1 个解决方案

#1


8  

OK. I have the answer to my own question. Yes, it's taken me over 3 months to get this far.

好。我有自己的问题的答案。是的,我需要3个多月的时间才能做到这一点。

It appears to be GIL thrashing in Python that is the reason for the massive 'system' CPU spikes and associated pauses. Here is a good explanation of where the thrashing comes from. That presentation also pointed me in the right direction.

它似乎是Python中的GIL颠簸,这是大规模“系统”CPU峰值和相关暂停的原因。这是对颠簸来自哪里的一个很好的解释。那次演讲也指出了我正确的方向。

Python 3.2 introduced a new GIL implementation to avoid this thrashing. The result can be shown with a simple threaded example (taken from the presentation above):

Python 3.2引入了一个新的GIL实现来避免这种颠簸。结果可以通过一个简单的线程示例显示(摘自上面的演示文稿):

from threading import Thread
import psutil

def countdown():
    n = 100000000
    while n > 0:
        n -= 1

t1 = Thread(target=countdown)
t2 = Thread(target=countdown)
t1.start(); t2.start()
t1.join(); t2.join()

print(psutil.Process().cpu_times())    

On my Macbook Pro with Python 2.7.9 this uses 14.7s of 'user' CPU and 13.2s of 'system' CPU.

在我的Macbook Pro with Python 2.7.9中,它使用14.7秒的“用户”CPU和13.2秒的“系统”CPU。

Python 3.4 uses 15.0s of 'user' (slightly more) but only 0.2s of 'system'.

Python 3.4使用15.0的'用户'(略多)但只有0.2s的'系统'。

So, the GIL is still in place, it still only runs as fast as when the code is single threaded, but it avoids all the GIL contention of Python 2 that manifests as kernel ('system') CPU time. This contention, I believe, is what was causing the issues of the original question.

因此,GIL仍然存在,它仍然只运行与代码是单线程时一样快,但它避免了Python 2的所有GIL争用,表现为内核('系统')CPU时间。我认为,这种争论正是造成原始问题的原因。

Update

An additional cause to the CPU problem was found to be with OpenCV/TBB. Fully documented in this SO question.

发现CPU问题的另一个原因是OpenCV / TBB。完整记录在这个SO问题中。

#1


8  

OK. I have the answer to my own question. Yes, it's taken me over 3 months to get this far.

好。我有自己的问题的答案。是的,我需要3个多月的时间才能做到这一点。

It appears to be GIL thrashing in Python that is the reason for the massive 'system' CPU spikes and associated pauses. Here is a good explanation of where the thrashing comes from. That presentation also pointed me in the right direction.

它似乎是Python中的GIL颠簸,这是大规模“系统”CPU峰值和相关暂停的原因。这是对颠簸来自哪里的一个很好的解释。那次演讲也指出了我正确的方向。

Python 3.2 introduced a new GIL implementation to avoid this thrashing. The result can be shown with a simple threaded example (taken from the presentation above):

Python 3.2引入了一个新的GIL实现来避免这种颠簸。结果可以通过一个简单的线程示例显示(摘自上面的演示文稿):

from threading import Thread
import psutil

def countdown():
    n = 100000000
    while n > 0:
        n -= 1

t1 = Thread(target=countdown)
t2 = Thread(target=countdown)
t1.start(); t2.start()
t1.join(); t2.join()

print(psutil.Process().cpu_times())    

On my Macbook Pro with Python 2.7.9 this uses 14.7s of 'user' CPU and 13.2s of 'system' CPU.

在我的Macbook Pro with Python 2.7.9中,它使用14.7秒的“用户”CPU和13.2秒的“系统”CPU。

Python 3.4 uses 15.0s of 'user' (slightly more) but only 0.2s of 'system'.

Python 3.4使用15.0的'用户'(略多)但只有0.2s的'系统'。

So, the GIL is still in place, it still only runs as fast as when the code is single threaded, but it avoids all the GIL contention of Python 2 that manifests as kernel ('system') CPU time. This contention, I believe, is what was causing the issues of the original question.

因此,GIL仍然存在,它仍然只运行与代码是单线程时一样快,但它避免了Python 2的所有GIL争用,表现为内核('系统')CPU时间。我认为,这种争论正是造成原始问题的原因。

Update

An additional cause to the CPU problem was found to be with OpenCV/TBB. Fully documented in this SO question.

发现CPU问题的另一个原因是OpenCV / TBB。完整记录在这个SO问题中。