In my program, whose rss is 65G, when call fork
, sys_clone->dup_mm->copy_page_range
will consume more than 2 seconds. In this case, one cpu will 100% sys when execute fork, at the same time, one thread cannot get cpu time until fork finish. The machine has 16 CPUs, the other CPUs is idle.
在我的程序中,其rss为65G,当调用fork时,sys_clone-> dup_mm-> copy_page_range将消耗超过2秒。在这种情况下,一个cpu在执行fork时会100%sys,同时,一个线程在fork完成之前无法获得cpu时间。机器有16个CPU,其他CPU空闲。
So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu? In general, when and how the scheduler migrate process between cpus?
所以我的问题是一个cpu忙于fork,为什么调度程序不会将等待这个cpu的进程迁移到其他空闲cpu?一般来说,调度程序何时以及如何在cpus之间迁移进程?
I search this site, and the existing threads cannot answer my question.
我搜索这个网站,现有的线程无法回答我的问题。
- How Linux scheduler schedules processes on multi-core processors?
- Can a multi-core processor run multiple processes at the same time?
Linux调度程序如何在多核处理器上调度进程?
多核处理器可以同时运行多个进程吗?
1 个解决方案
#1
2
rss is 65G, when call fork, sys_clone->dup_mm->copy_page_range will consume more than 2 seconds
rss是65G,当调用fork时,sys_clone-> dup_mm-> copy_page_range将消耗2秒以上
While doing fork
(or clone
) the vmas of existing process should be copied into vmas of new process. dup_mm
function (kernel/fork.c) creates new mm
and do actual copy. There are no direct calls to copy_page_range
, but I think, static function dup_mmap
may be inlined into dup_mm
and it has calls to copy_page_range
.
在执行fork(或clone)时,应将现有进程的vmas复制到新进程的vmas中。 dup_mm函数(kernel / fork.c)创建新mm并执行实际复制。没有直接调用copy_page_range,但我认为,静态函数dup_mmap可以内联到dup_mm中,并且它调用了copy_page_range。
In the dup_mmap
there are several locks locked, both in new mm
and old oldmm
:
在dup_mmap中,锁定了几个锁,包括新mm和旧oldmm:
356 down_write(&oldmm->mmap_sem);
After taking the mmap_sem
reader/writer semaphore, there is a loop over all mmaps to copy their metainformation:
在获取mmap_sem读取器/写入器信号量之后,在所有mmaps上都有一个循环来复制它们的元信息:
381 for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next)
Only after the loop (it is long in your case), mmap_sem
is unlocked:
只有在循环之后(在你的情况下很长),mmap_sem才会被解锁:
465 out:
468 up_write(&oldmm->mmap_sem);
While the rwlock mmap_sep
is down by writer, no any other reader or writer can do anything with mmaps in oldmm
.
虽然rwlock mmap_sep被编写器关闭,但没有任何其他读者或编写者可以使用oldmm中的mmaps做任何事情。
one thread cannot get cpu time until fork finish So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu?
一个线程无法获得cpu时间,直到fork完成所以我的问题是一个cpu忙于fork,为什么调度程序不会将等待这个cpu的进程迁移到其他空闲cpu?
Are you sure, that other thread is ready to run and not wanting to do anything with mmaps, like:
您确定,其他线程已准备好运行且不想对mmaps执行任何操作,例如:
- mmaping something new or unmapping something not needed,
- growing or shrinking its heap (
brk
), - growing its stack,
- pagefaulting
- or many other activities...?
捣乱新事物或取消不需要的东西,
增长或缩小其堆(brk),
增长它的堆栈,
或许多其他活动......?
Actually, the wait-cpu thread is my IO thread, which send/receive package from client, in my observation, the package always exist, but the IO thread cannot receive it.
实际上,wait-cpu线程是我的IO线程,它从客户端发送/接收包,在我的观察中,包始终存在,但IO线程无法接收它。
You should check stack of your wait-cpu thread (there is even SysRq for this), and kind of I/O. mmap
ing of file is the variant of I/O which will be blocked on mmap_sem
by fork.
你应该检查你的wait-cpu线程的堆栈(甚至还有SysRq)和I / O。 mmaping文件是I / O的变种,它将在fork上阻塞在mmap_sem上。
Also you can check the "last used CPU" of the wait-cpu thread, e.g. in the top
monitoring utility, by enabling the thread view (H
key) and adding "Last used CPU" column to output (fj
in older; f
scroll to P
, enter in newer). I think it is possible that your wait-cpu thread already was on the other CPU, just not allowed (not ready) to run.
您还可以检查wait-cpu线程的“上次使用的CPU”,例如在顶部监视实用程序中,通过启用线程视图(H键)并添加“上次使用的CPU”列来输出(fj在较旧; f滚动到P,输入更新)。我认为你的wait-cpu线程可能已经在另一个CPU上,只是不允许(未准备好)运行。
If you are using fork only to make exec
, it can be useful to:
如果你只使用fork来创建exec,它可能对以下内容有用:
- either switch to
vfork
+exec
(or just toposix_spawn
).vfork
will suspend your process (but may not suspend your other threads, it is dangerous) until new process will doexec
orexit
, but execing may be faster than waiting for 65 GB of mmaps to be copied. - or not doing fork from the multithreaded process with several active threads and multi-GB virtual memory. You can create small (without multi-GB mmaped) helper process, communicate with it using ipc or sockets or pipes and ask it to fork and do everything you want.
要么切换到vfork + exec(或只是切换到posix_spawn)。 vfork将暂停你的进程(但可能不会暂停你的其他线程,这是危险的),直到新进程执行exec或退出,但执行可能比等待65 GB的mmaps被复制更快。
或者不使用多个活动线程和多GB虚拟内存从多线程进程执行fork。您可以创建小型(没有多GB mmaped)帮助程序进程,使用ipc或套接字或管道与它进行通信,并要求它进行分叉并执行您想要的任何操作。
#1
2
rss is 65G, when call fork, sys_clone->dup_mm->copy_page_range will consume more than 2 seconds
rss是65G,当调用fork时,sys_clone-> dup_mm-> copy_page_range将消耗2秒以上
While doing fork
(or clone
) the vmas of existing process should be copied into vmas of new process. dup_mm
function (kernel/fork.c) creates new mm
and do actual copy. There are no direct calls to copy_page_range
, but I think, static function dup_mmap
may be inlined into dup_mm
and it has calls to copy_page_range
.
在执行fork(或clone)时,应将现有进程的vmas复制到新进程的vmas中。 dup_mm函数(kernel / fork.c)创建新mm并执行实际复制。没有直接调用copy_page_range,但我认为,静态函数dup_mmap可以内联到dup_mm中,并且它调用了copy_page_range。
In the dup_mmap
there are several locks locked, both in new mm
and old oldmm
:
在dup_mmap中,锁定了几个锁,包括新mm和旧oldmm:
356 down_write(&oldmm->mmap_sem);
After taking the mmap_sem
reader/writer semaphore, there is a loop over all mmaps to copy their metainformation:
在获取mmap_sem读取器/写入器信号量之后,在所有mmaps上都有一个循环来复制它们的元信息:
381 for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next)
Only after the loop (it is long in your case), mmap_sem
is unlocked:
只有在循环之后(在你的情况下很长),mmap_sem才会被解锁:
465 out:
468 up_write(&oldmm->mmap_sem);
While the rwlock mmap_sep
is down by writer, no any other reader or writer can do anything with mmaps in oldmm
.
虽然rwlock mmap_sep被编写器关闭,但没有任何其他读者或编写者可以使用oldmm中的mmaps做任何事情。
one thread cannot get cpu time until fork finish So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu?
一个线程无法获得cpu时间,直到fork完成所以我的问题是一个cpu忙于fork,为什么调度程序不会将等待这个cpu的进程迁移到其他空闲cpu?
Are you sure, that other thread is ready to run and not wanting to do anything with mmaps, like:
您确定,其他线程已准备好运行且不想对mmaps执行任何操作,例如:
- mmaping something new or unmapping something not needed,
- growing or shrinking its heap (
brk
), - growing its stack,
- pagefaulting
- or many other activities...?
捣乱新事物或取消不需要的东西,
增长或缩小其堆(brk),
增长它的堆栈,
或许多其他活动......?
Actually, the wait-cpu thread is my IO thread, which send/receive package from client, in my observation, the package always exist, but the IO thread cannot receive it.
实际上,wait-cpu线程是我的IO线程,它从客户端发送/接收包,在我的观察中,包始终存在,但IO线程无法接收它。
You should check stack of your wait-cpu thread (there is even SysRq for this), and kind of I/O. mmap
ing of file is the variant of I/O which will be blocked on mmap_sem
by fork.
你应该检查你的wait-cpu线程的堆栈(甚至还有SysRq)和I / O。 mmaping文件是I / O的变种,它将在fork上阻塞在mmap_sem上。
Also you can check the "last used CPU" of the wait-cpu thread, e.g. in the top
monitoring utility, by enabling the thread view (H
key) and adding "Last used CPU" column to output (fj
in older; f
scroll to P
, enter in newer). I think it is possible that your wait-cpu thread already was on the other CPU, just not allowed (not ready) to run.
您还可以检查wait-cpu线程的“上次使用的CPU”,例如在顶部监视实用程序中,通过启用线程视图(H键)并添加“上次使用的CPU”列来输出(fj在较旧; f滚动到P,输入更新)。我认为你的wait-cpu线程可能已经在另一个CPU上,只是不允许(未准备好)运行。
If you are using fork only to make exec
, it can be useful to:
如果你只使用fork来创建exec,它可能对以下内容有用:
- either switch to
vfork
+exec
(or just toposix_spawn
).vfork
will suspend your process (but may not suspend your other threads, it is dangerous) until new process will doexec
orexit
, but execing may be faster than waiting for 65 GB of mmaps to be copied. - or not doing fork from the multithreaded process with several active threads and multi-GB virtual memory. You can create small (without multi-GB mmaped) helper process, communicate with it using ipc or sockets or pipes and ask it to fork and do everything you want.
要么切换到vfork + exec(或只是切换到posix_spawn)。 vfork将暂停你的进程(但可能不会暂停你的其他线程,这是危险的),直到新进程执行exec或退出,但执行可能比等待65 GB的mmaps被复制更快。
或者不使用多个活动线程和多GB虚拟内存从多线程进程执行fork。您可以创建小型(没有多GB mmaped)帮助程序进程,使用ipc或套接字或管道与它进行通信,并要求它进行分叉并执行您想要的任何操作。