谁“杀死”了我的过程,为什么?

时间:2022-08-26 20:25:06

My application runs as a background process on Linux. It is currently started at the command line in a Terminal window.

我的应用程序在Linux上作为后台进程运行。它现在在终端窗口中的命令行中启动。

Recently a user was executing the application for a while and it died mysteriously. The text:

最近,一个用户在执行这个应用程序一段时间后,神秘地死去了。文本:

Killed

杀了

was on the terminal. This happened two times. I asked if someone at a different Terminal used the kill command to kill the process? No.

在终端。这发生了两次。我问在不同终端的人是否使用kill命令来杀死这个过程?不。

Under what conditions would Linux decide to kill my process? I believe the shell displayed "Killed" because the process died after receiving the kill(9) signal. If Linux sent the kill signal should there be a message in a system log somewhere that explains why it was killed?

在什么情况下Linux会决定终止我的程序?我相信,在收到杀死(9)信号后,外壳会显示“死亡”。如果Linux发送了kill信号,系统日志中是否应该有一个消息,说明它为什么会被杀死?

12 个解决方案

#1


296  

If the user or sysadmin did not kill the program the kernel may have. The kernel would only kill a process under exceptional circumstances such as extreme resource starvation (think mem+swap exhaustion).

如果用户或sysadmin没有杀死内核可能有的程序。内核只会在异常情况下(例如,考虑mem+swap耗尽)的特殊情况下杀死一个进程。

#2


146  

Try:

试一试:

dmesg -T| grep -E -i -B100 'killed process'

Where -B100 signifies the number of lines before the kill happened.

其中-B100表示在kill发生之前的行数。

#3


144  

This looks like a good article on the subject: Taming the OOM killer.

这篇文章看起来像一篇关于这个主题的好文章:驯服OOM杀手。

The gist is that Linux overcommits memory. When a process asks for more space, Linux will give it that space, even if it is claimed by another process, under the assumption that nobody actually uses all of the memory they ask for. The process will get exclusive use of the memory it has allocated when it actually uses it, not when it asks for it. This makes allocation quick, and might allow you to "cheat" and allocate more memory than you really have. However, once processes start using this memory, Linux might realize that it has been too generous in allocating memory it doesn't have, and will have to kill off a process to free some up. The process to be killed is based on a score taking into account runtime (long-running processes are safer), memory usage (greedy processes are less safe), and a few other factors, including a value you can adjust to make a process less likely to be killed. It's all described in the article in a lot more detail.

要点是Linux过度提交内存。当一个进程要求更多的空间时,Linux将给它空间,即使它是由另一个进程所声称的,假设没有人真正使用他们所要求的所有内存。该进程将获得它在实际使用时分配的内存的独占使用,而不是当它请求时。这使得分配更快,并且可能允许您“欺骗”并分配比您实际拥有的更多的内存。然而,一旦进程开始使用这个内存,Linux可能会意识到它在分配内存方面过于慷慨,并且将不得不终止一个进程来释放一些内存。被杀死的过程基于考虑运行时(长时间运行的过程更安全)、内存使用(贪婪过程不安全)和其他一些因素,包括您可以调整以使过程不太可能被杀死的一些其他因素。这篇文章中详细描述了这一切。

Edit: And here is another article that explains pretty well how a process is chosen (annotated with some kernel code examples). The great thing about this is that it includes some commentary on the reasoning behind the various badness() rules.

编辑:这里是另一篇文章,它很好地解释了如何选择一个过程(用一些内核代码示例进行注释)。最重要的是,它包含了一些关于各种不良规则背后的推理的评论。

#4


23  

Let me first explain when and why OOMKiller get invoked?

Say you have 512 RAM + 1GB Swap memory. So in theory, your CPU has access to total of 1.5GB of virtual memory.

假设您有512 RAM + 1GB的交换内存。从理论上讲,你的CPU可以访问总计1.5GB的虚拟内存。

Now, for some time everything is running fine within 1.5GB of total memory. But all of sudden (or gradually) your system has started consuming more and more memory and it reached at a point around 95% of total memory used.

现在,一段时间内,所有的东西都在1.5GB的内存中运行。但是突然之间(或逐渐),你的系统开始消耗越来越多的内存,它达到了总内存的95%。

Now say any process has requested large chunck of memory from the kernel. Kernel check for the available memory and find that there is no way it can allocate your process more memory. So it will try to free some memory calling/invoking OOMKiller (http://linux-mm.org/OOM).

现在说,任何进程都要求从内核中获取大量内存。内核检查可用内存,发现它无法分配您的进程更多内存。因此,它将尝试释放一些内存调用/调用OOMKiller (http://linux-mm.org/OOM)。

OOMKiller has its own algorithm to score the rank for every process. Typically which process uses more memory becomes the victim to be killed.

OOMKiller有自己的算法来为每个进程打分。通常,哪个进程使用更多的内存成为被杀死的受害者。

Where can I find logs of OOMKiller?

Typically in /var/log directory. Either /var/log/kern.log or /var/log/dmesg

通常在/var/log目录中。要么/var/log/kern.日志或/var/log/dmesg

Hope this will help you.

希望这对你有帮助。

Some typical solutions:

  1. Increase memory (not swap)
  2. 增加内存(而不是交换)
  3. Find the memory leaks in your program and fix them
  4. 查找程序中的内存泄漏并修复它们。
  5. Restrict memory any process can consume (for example JVM memory can be restricted using JAVA_OPTS)
  6. 限制任何进程可以使用的内存(例如,可以使用JAVA_OPTS限制JVM内存)
  7. See the logs and google :)
  8. 查看日志和谷歌:)

#5


11  

As dwc and Adam Jaskiewicz have stated, the culprit is likely the OOM Killer. However, the next question that follows is: How do I prevent this?

正如dwc和Adam Jaskiewicz所说,罪魁祸首可能是OOM杀手。然而,接下来的问题是:如何预防?

There are several ways:

有几种方法:

  1. Give your system more RAM if you can (easy if its a VM)
  2. 如果可以的话,给你的系统更多的内存(如果它是VM的话就简单了)
  3. Make sure the OOM killer chooses a different process.
  4. 确保OOM杀手选择一个不同的过程。
  5. Disable the OOM Killer
  6. 禁用伯父杀手
  7. Choose a Linux distro which ships with the OOM Killer disabled.
  8. 选择一个Linux发行版,它与OOM杀手一起使用。

I found (2) to be especially easy to implement, thanks to this article.

由于这篇文章,我发现(2)实现起来特别容易。

#6


9  

This is the Linux out of memory manager (OOM). Your process was selected due to 'badness' - a combination of recentness, resident size (memory in use, rather than just allocated) and other factors.

这是内存管理器(OOM)中的Linux。您的进程是由于“badness”而被选中的——最近的一个组合,驻留的大小(使用的内存,而不是刚刚分配的)和其他因素。

sudo journalctl -xb

You'll see a message like:

你会看到这样一条信息:

Jul 20 11:05:00 someapp kernel: Mem-Info:
Jul 20 11:05:00 someapp kernel: Node 0 DMA per-cpu:
Jul 20 11:05:00 someapp kernel: CPU    0: hi:    0, btch:   1 usd:   0
Jul 20 11:05:00 someapp kernel: Node 0 DMA32 per-cpu:
Jul 20 11:05:00 someapp kernel: CPU    0: hi:  186, btch:  31 usd:  30
Jul 20 11:05:00 someapp kernel: active_anon:206043 inactive_anon:6347 isolated_anon:0
                                    active_file:722 inactive_file:4126 isolated_file:0
                                    unevictable:0 dirty:5 writeback:0 unstable:0
                                    free:12202 slab_reclaimable:3849 slab_unreclaimable:14574
                                    mapped:792 shmem:12802 pagetables:1651 bounce:0
                                    free_cma:0
Jul 20 11:05:00 someapp kernel: Node 0 DMA free:4576kB min:708kB low:884kB high:1060kB active_anon:10012kB inactive_anon:488kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present
Jul 20 11:05:00 someapp kernel: lowmem_reserve[]: 0 968 968 968
Jul 20 11:05:00 someapp kernel: Node 0 DMA32 free:44232kB min:44344kB low:55428kB high:66516kB active_anon:814160kB inactive_anon:24900kB active_file:2884kB inactive_file:16500kB unevictable:0kB isolated(anon):0kB isolated
Jul 20 11:05:00 someapp kernel: lowmem_reserve[]: 0 0 0 0
Jul 20 11:05:00 someapp kernel: Node 0 DMA: 17*4kB (UEM) 22*8kB (UEM) 15*16kB (UEM) 12*32kB (UEM) 8*64kB (E) 9*128kB (UEM) 2*256kB (UE) 3*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 4580kB
Jul 20 11:05:00 someapp kernel: Node 0 DMA32: 216*4kB (UE) 601*8kB (UE) 448*16kB (UE) 311*32kB (UEM) 135*64kB (UEM) 74*128kB (UEM) 5*256kB (EM) 0*512kB 0*1024kB 1*2048kB (R) 0*4096kB = 44232kB
Jul 20 11:05:00 someapp kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jul 20 11:05:00 someapp kernel: 17656 total pagecache pages
Jul 20 11:05:00 someapp kernel: 0 pages in swap cache
Jul 20 11:05:00 someapp kernel: Swap cache stats: add 0, delete 0, find 0/0
Jul 20 11:05:00 someapp kernel: Free swap  = 0kB
Jul 20 11:05:00 someapp kernel: Total swap = 0kB
Jul 20 11:05:00 someapp kernel: 262141 pages RAM
Jul 20 11:05:00 someapp kernel: 7645 pages reserved
Jul 20 11:05:00 someapp kernel: 264073 pages shared
Jul 20 11:05:00 someapp kernel: 240240 pages non-shared
Jul 20 11:05:00 someapp kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jul 20 11:05:00 someapp kernel: [  241]     0   241    13581     1610      26        0             0 systemd-journal
Jul 20 11:05:00 someapp kernel: [  246]     0   246    10494      133      22        0         -1000 systemd-udevd
Jul 20 11:05:00 someapp kernel: [  264]     0   264    29174      121      26        0         -1000 auditd
Jul 20 11:05:00 someapp kernel: [  342]     0   342    94449      466      67        0             0 NetworkManager
Jul 20 11:05:00 someapp kernel: [  346]     0   346   137495     3125      88        0             0 tuned
Jul 20 11:05:00 someapp kernel: [  348]     0   348    79595      726      60        0             0 rsyslogd
Jul 20 11:05:00 someapp kernel: [  353]    70   353     6986       72      19        0             0 avahi-daemon
Jul 20 11:05:00 someapp kernel: [  362]    70   362     6986       58      18        0             0 avahi-daemon
Jul 20 11:05:00 someapp kernel: [  378]     0   378     1621       25       8        0             0 iprinit
Jul 20 11:05:00 someapp kernel: [  380]     0   380     1621       26       9        0             0 iprupdate
Jul 20 11:05:00 someapp kernel: [  384]    81   384     6676      142      18        0          -900 dbus-daemon
Jul 20 11:05:00 someapp kernel: [  385]     0   385     8671       83      21        0             0 systemd-logind
Jul 20 11:05:00 someapp kernel: [  386]     0   386    31573      153      15        0             0 crond
Jul 20 11:05:00 someapp kernel: [  391]   999   391   128531     2440      48        0             0 polkitd
Jul 20 11:05:00 someapp kernel: [  400]     0   400     9781       23       8        0             0 iprdump
Jul 20 11:05:00 someapp kernel: [  419]     0   419    27501       32      10        0             0 agetty
Jul 20 11:05:00 someapp kernel: [  855]     0   855    22883      258      43        0             0 master
Jul 20 11:05:00 someapp kernel: [  862]    89   862    22926      254      44        0             0 qmgr
Jul 20 11:05:00 someapp kernel: [23631]     0 23631    20698      211      43        0         -1000 sshd
Jul 20 11:05:00 someapp kernel: [12884]     0 12884    81885     3754      80        0             0 firewalld
Jul 20 11:05:00 someapp kernel: [18130]     0 18130    33359      291      65        0             0 sshd
Jul 20 11:05:00 someapp kernel: [18132]  1000 18132    33791      748      64        0             0 sshd
Jul 20 11:05:00 someapp kernel: [18133]  1000 18133    28867      122      13        0             0 bash
Jul 20 11:05:00 someapp kernel: [18428]    99 18428   208627    42909     151        0             0 node
Jul 20 11:05:00 someapp kernel: [18486]    89 18486    22909      250      46        0             0 pickup
Jul 20 11:05:00 someapp kernel: [18515]  1000 18515   352905   141851     470        0             0 npm
Jul 20 11:05:00 someapp kernel: [18520]     0 18520    33359      291      66        0             0 sshd
Jul 20 11:05:00 someapp kernel: [18522]  1000 18522    33359      294      64        0             0 sshd
Jul 20 11:05:00 someapp kernel: [18523]  1000 18523    28866      115      12        0             0 bash
Jul 20 11:05:00 someapp kernel: Out of memory: Kill process 18515 (npm) score 559 or sacrifice child
Jul 20 11:05:00 someapp kernel: Killed process 18515 (npm) total-vm:1411620kB, anon-rss:567404kB, file-rss:0kB

#7


7  

A tool like systemtap (or a tracer) can monitor kernel signal-transmission logic and report. e.g., https://sourceware.org/systemtap/examples/process/sigmon.stp

systemtap(或跟踪器)这样的工具可以监视内核信号传输逻辑和报告。例如,https://sourceware.org/systemtap/examples/process/sigmon.stp

# stap .../sigmon.stp -x 31994 SIGKILL
   SPID     SNAME            RPID  RNAME            SIGNUM SIGNAME
   5609     bash             31994 find             9      SIGKILL

The filtering if block in that script can be adjusted to taste, or eliminated to trace systemwide signal traffic. Causes can be further isolated by collecting backtraces (add a print_backtrace() and/or print_ubacktrace() to the probe, for kernel- and userspace- respectively).

如果该脚本中的块可以被调整为味道,或消除了跟踪系统范围的信号流量。通过收集回溯(将print_backtrace()和/或print_ubacktrace()添加到探针,分别用于内核和用户空间),可以进一步隔离原因。

#8


6  

The PAM module to limit resources caused exactly the results you described: My process died mysteriously with the text Killed on the console window. No log output, neither in syslog nor in kern.log. The top program helped me to discover that exactly after one minute of CPU usage my process gets killed.

PAM模块限制资源导致了您所描述的结果:我的进程在控制台窗口中被杀死的文本神秘死亡。没有日志输出,在syslog和kern.log中都没有。上面的程序帮助我发现,在CPU使用一分钟后,我的进程就被杀死了。

#9


4  

In an lsf environment (interactive or otherwise) if the application exceeds memory utilization beyond some preset threshold by the admins on the queue or the resource request in submit to the queue the processes will be killed so other users don't fall victim to a potential run away. It doesn't always send an email when it does so, depending on how its set up.

在lsf环境中(交互式的或其他的),如果应用程序超过了某个预先设置的阈值,而在队列上的管理员或提交给队列的资源请求超过了某个预先设置的阈值,那么进程将被杀死,这样其他用户就不会成为潜在的流失的受害者。它并不总是在发送电子邮件的时候发送,这取决于它的设置。

One solution in this case is to find a queue with larger resources or define larger resource requirements in the submission.

在这种情况下,一个解决方案是找到一个具有更大资源的队列,或者在提交中定义更大的资源需求。

You may also want to review man ulimit

你可能也想回顾一下人类的极限。

Although I don't remember ulimit resulting in Killed its been a while since I needed that.

虽然我不记得ulimit导致了它的死亡,但我需要它。

#10


2  

We have had recurring problems under Linux at a customer site (Red Hat, I think), with OOMKiller (out-of-memory killer) killing both our principle application (i.e. the reason the server exists) and it's data base processes.

我们在一个客户站点(我认为是Red Hat)的Linux下出现了反复出现的问题,使用OOMKiller(内存不足杀手)杀死了我们的原则应用程序(即服务器存在的原因)和它的数据库进程。

In each case OOMKiller simply decided that the processes were using to much resources... the machine wasn't even about to fail for lack of resources. Neither the application nor it's database has problems with memory leaks (or any other resource leak).

在每一种情况下,OOMKiller都简单地认为这些过程占用了大量资源……这台机器甚至不会因为缺乏资源而失败。应用程序和它的数据库都不存在内存泄漏(或任何其他资源泄漏)的问题。

I am not a Linux expert, but I rather gathered it's algorithm for deciding when to kill something and what to kill is complex. Also, I was told (I can't speak as to the accuracy of this) that OOMKiller is baked into the Kernel and you can't simply not run it.

我不是一个Linux专家,但我更喜欢收集它的算法来决定什么时候杀死什么东西,什么东西是复杂的。而且,我被告知(我不能说这一点的准确性),因为OOMKiller是被放进内核的,你不能简单地运行它。

#11


0  

The user has the ability to kill his own programs, using kill or Control+C, but I get the impression that's not what happened, and that the user complained to you.

用户有能力杀死自己的程序,使用kill或Control+C,但我得到的印象不是发生了什么,用户向你抱怨。

root has the ability to kill programs of course, but if someone has root on your machine and is killing stuff you have bigger problems.

root有能力杀死程序,但是如果某人在你的机器上有根,并且正在杀死你有更大的问题。

If you are not the sysadmin, the sysadmin may have set up quotas on CPU, RAM, ort disk usage and auto-kills processes that exceed them.

如果您不是sysadmin,那么sysadmin可能在CPU、RAM、磁盘使用和自动杀伤过程中设置了超过它们的配额。

Other than those guesses, I'm not sure without more info about the program.

除了那些猜测,我不确定没有更多关于这个程序的信息。

#12


0  

I encountered this problem lately. Finally, I found my processes were killed just after Opensuse zypper update was called automatically. To disable zypper update solved my problem.

我最近遇到了这个问题。最后,我发现在Opensuse zypper更新被自动调用之后,我的进程被杀死了。禁用zypper更新解决了我的问题。

#1


296  

If the user or sysadmin did not kill the program the kernel may have. The kernel would only kill a process under exceptional circumstances such as extreme resource starvation (think mem+swap exhaustion).

如果用户或sysadmin没有杀死内核可能有的程序。内核只会在异常情况下(例如,考虑mem+swap耗尽)的特殊情况下杀死一个进程。

#2


146  

Try:

试一试:

dmesg -T| grep -E -i -B100 'killed process'

Where -B100 signifies the number of lines before the kill happened.

其中-B100表示在kill发生之前的行数。

#3


144  

This looks like a good article on the subject: Taming the OOM killer.

这篇文章看起来像一篇关于这个主题的好文章:驯服OOM杀手。

The gist is that Linux overcommits memory. When a process asks for more space, Linux will give it that space, even if it is claimed by another process, under the assumption that nobody actually uses all of the memory they ask for. The process will get exclusive use of the memory it has allocated when it actually uses it, not when it asks for it. This makes allocation quick, and might allow you to "cheat" and allocate more memory than you really have. However, once processes start using this memory, Linux might realize that it has been too generous in allocating memory it doesn't have, and will have to kill off a process to free some up. The process to be killed is based on a score taking into account runtime (long-running processes are safer), memory usage (greedy processes are less safe), and a few other factors, including a value you can adjust to make a process less likely to be killed. It's all described in the article in a lot more detail.

要点是Linux过度提交内存。当一个进程要求更多的空间时,Linux将给它空间,即使它是由另一个进程所声称的,假设没有人真正使用他们所要求的所有内存。该进程将获得它在实际使用时分配的内存的独占使用,而不是当它请求时。这使得分配更快,并且可能允许您“欺骗”并分配比您实际拥有的更多的内存。然而,一旦进程开始使用这个内存,Linux可能会意识到它在分配内存方面过于慷慨,并且将不得不终止一个进程来释放一些内存。被杀死的过程基于考虑运行时(长时间运行的过程更安全)、内存使用(贪婪过程不安全)和其他一些因素,包括您可以调整以使过程不太可能被杀死的一些其他因素。这篇文章中详细描述了这一切。

Edit: And here is another article that explains pretty well how a process is chosen (annotated with some kernel code examples). The great thing about this is that it includes some commentary on the reasoning behind the various badness() rules.

编辑:这里是另一篇文章,它很好地解释了如何选择一个过程(用一些内核代码示例进行注释)。最重要的是,它包含了一些关于各种不良规则背后的推理的评论。

#4


23  

Let me first explain when and why OOMKiller get invoked?

Say you have 512 RAM + 1GB Swap memory. So in theory, your CPU has access to total of 1.5GB of virtual memory.

假设您有512 RAM + 1GB的交换内存。从理论上讲,你的CPU可以访问总计1.5GB的虚拟内存。

Now, for some time everything is running fine within 1.5GB of total memory. But all of sudden (or gradually) your system has started consuming more and more memory and it reached at a point around 95% of total memory used.

现在,一段时间内,所有的东西都在1.5GB的内存中运行。但是突然之间(或逐渐),你的系统开始消耗越来越多的内存,它达到了总内存的95%。

Now say any process has requested large chunck of memory from the kernel. Kernel check for the available memory and find that there is no way it can allocate your process more memory. So it will try to free some memory calling/invoking OOMKiller (http://linux-mm.org/OOM).

现在说,任何进程都要求从内核中获取大量内存。内核检查可用内存,发现它无法分配您的进程更多内存。因此,它将尝试释放一些内存调用/调用OOMKiller (http://linux-mm.org/OOM)。

OOMKiller has its own algorithm to score the rank for every process. Typically which process uses more memory becomes the victim to be killed.

OOMKiller有自己的算法来为每个进程打分。通常,哪个进程使用更多的内存成为被杀死的受害者。

Where can I find logs of OOMKiller?

Typically in /var/log directory. Either /var/log/kern.log or /var/log/dmesg

通常在/var/log目录中。要么/var/log/kern.日志或/var/log/dmesg

Hope this will help you.

希望这对你有帮助。

Some typical solutions:

  1. Increase memory (not swap)
  2. 增加内存(而不是交换)
  3. Find the memory leaks in your program and fix them
  4. 查找程序中的内存泄漏并修复它们。
  5. Restrict memory any process can consume (for example JVM memory can be restricted using JAVA_OPTS)
  6. 限制任何进程可以使用的内存(例如,可以使用JAVA_OPTS限制JVM内存)
  7. See the logs and google :)
  8. 查看日志和谷歌:)

#5


11  

As dwc and Adam Jaskiewicz have stated, the culprit is likely the OOM Killer. However, the next question that follows is: How do I prevent this?

正如dwc和Adam Jaskiewicz所说,罪魁祸首可能是OOM杀手。然而,接下来的问题是:如何预防?

There are several ways:

有几种方法:

  1. Give your system more RAM if you can (easy if its a VM)
  2. 如果可以的话,给你的系统更多的内存(如果它是VM的话就简单了)
  3. Make sure the OOM killer chooses a different process.
  4. 确保OOM杀手选择一个不同的过程。
  5. Disable the OOM Killer
  6. 禁用伯父杀手
  7. Choose a Linux distro which ships with the OOM Killer disabled.
  8. 选择一个Linux发行版,它与OOM杀手一起使用。

I found (2) to be especially easy to implement, thanks to this article.

由于这篇文章,我发现(2)实现起来特别容易。

#6


9  

This is the Linux out of memory manager (OOM). Your process was selected due to 'badness' - a combination of recentness, resident size (memory in use, rather than just allocated) and other factors.

这是内存管理器(OOM)中的Linux。您的进程是由于“badness”而被选中的——最近的一个组合,驻留的大小(使用的内存,而不是刚刚分配的)和其他因素。

sudo journalctl -xb

You'll see a message like:

你会看到这样一条信息:

Jul 20 11:05:00 someapp kernel: Mem-Info:
Jul 20 11:05:00 someapp kernel: Node 0 DMA per-cpu:
Jul 20 11:05:00 someapp kernel: CPU    0: hi:    0, btch:   1 usd:   0
Jul 20 11:05:00 someapp kernel: Node 0 DMA32 per-cpu:
Jul 20 11:05:00 someapp kernel: CPU    0: hi:  186, btch:  31 usd:  30
Jul 20 11:05:00 someapp kernel: active_anon:206043 inactive_anon:6347 isolated_anon:0
                                    active_file:722 inactive_file:4126 isolated_file:0
                                    unevictable:0 dirty:5 writeback:0 unstable:0
                                    free:12202 slab_reclaimable:3849 slab_unreclaimable:14574
                                    mapped:792 shmem:12802 pagetables:1651 bounce:0
                                    free_cma:0
Jul 20 11:05:00 someapp kernel: Node 0 DMA free:4576kB min:708kB low:884kB high:1060kB active_anon:10012kB inactive_anon:488kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present
Jul 20 11:05:00 someapp kernel: lowmem_reserve[]: 0 968 968 968
Jul 20 11:05:00 someapp kernel: Node 0 DMA32 free:44232kB min:44344kB low:55428kB high:66516kB active_anon:814160kB inactive_anon:24900kB active_file:2884kB inactive_file:16500kB unevictable:0kB isolated(anon):0kB isolated
Jul 20 11:05:00 someapp kernel: lowmem_reserve[]: 0 0 0 0
Jul 20 11:05:00 someapp kernel: Node 0 DMA: 17*4kB (UEM) 22*8kB (UEM) 15*16kB (UEM) 12*32kB (UEM) 8*64kB (E) 9*128kB (UEM) 2*256kB (UE) 3*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 4580kB
Jul 20 11:05:00 someapp kernel: Node 0 DMA32: 216*4kB (UE) 601*8kB (UE) 448*16kB (UE) 311*32kB (UEM) 135*64kB (UEM) 74*128kB (UEM) 5*256kB (EM) 0*512kB 0*1024kB 1*2048kB (R) 0*4096kB = 44232kB
Jul 20 11:05:00 someapp kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jul 20 11:05:00 someapp kernel: 17656 total pagecache pages
Jul 20 11:05:00 someapp kernel: 0 pages in swap cache
Jul 20 11:05:00 someapp kernel: Swap cache stats: add 0, delete 0, find 0/0
Jul 20 11:05:00 someapp kernel: Free swap  = 0kB
Jul 20 11:05:00 someapp kernel: Total swap = 0kB
Jul 20 11:05:00 someapp kernel: 262141 pages RAM
Jul 20 11:05:00 someapp kernel: 7645 pages reserved
Jul 20 11:05:00 someapp kernel: 264073 pages shared
Jul 20 11:05:00 someapp kernel: 240240 pages non-shared
Jul 20 11:05:00 someapp kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jul 20 11:05:00 someapp kernel: [  241]     0   241    13581     1610      26        0             0 systemd-journal
Jul 20 11:05:00 someapp kernel: [  246]     0   246    10494      133      22        0         -1000 systemd-udevd
Jul 20 11:05:00 someapp kernel: [  264]     0   264    29174      121      26        0         -1000 auditd
Jul 20 11:05:00 someapp kernel: [  342]     0   342    94449      466      67        0             0 NetworkManager
Jul 20 11:05:00 someapp kernel: [  346]     0   346   137495     3125      88        0             0 tuned
Jul 20 11:05:00 someapp kernel: [  348]     0   348    79595      726      60        0             0 rsyslogd
Jul 20 11:05:00 someapp kernel: [  353]    70   353     6986       72      19        0             0 avahi-daemon
Jul 20 11:05:00 someapp kernel: [  362]    70   362     6986       58      18        0             0 avahi-daemon
Jul 20 11:05:00 someapp kernel: [  378]     0   378     1621       25       8        0             0 iprinit
Jul 20 11:05:00 someapp kernel: [  380]     0   380     1621       26       9        0             0 iprupdate
Jul 20 11:05:00 someapp kernel: [  384]    81   384     6676      142      18        0          -900 dbus-daemon
Jul 20 11:05:00 someapp kernel: [  385]     0   385     8671       83      21        0             0 systemd-logind
Jul 20 11:05:00 someapp kernel: [  386]     0   386    31573      153      15        0             0 crond
Jul 20 11:05:00 someapp kernel: [  391]   999   391   128531     2440      48        0             0 polkitd
Jul 20 11:05:00 someapp kernel: [  400]     0   400     9781       23       8        0             0 iprdump
Jul 20 11:05:00 someapp kernel: [  419]     0   419    27501       32      10        0             0 agetty
Jul 20 11:05:00 someapp kernel: [  855]     0   855    22883      258      43        0             0 master
Jul 20 11:05:00 someapp kernel: [  862]    89   862    22926      254      44        0             0 qmgr
Jul 20 11:05:00 someapp kernel: [23631]     0 23631    20698      211      43        0         -1000 sshd
Jul 20 11:05:00 someapp kernel: [12884]     0 12884    81885     3754      80        0             0 firewalld
Jul 20 11:05:00 someapp kernel: [18130]     0 18130    33359      291      65        0             0 sshd
Jul 20 11:05:00 someapp kernel: [18132]  1000 18132    33791      748      64        0             0 sshd
Jul 20 11:05:00 someapp kernel: [18133]  1000 18133    28867      122      13        0             0 bash
Jul 20 11:05:00 someapp kernel: [18428]    99 18428   208627    42909     151        0             0 node
Jul 20 11:05:00 someapp kernel: [18486]    89 18486    22909      250      46        0             0 pickup
Jul 20 11:05:00 someapp kernel: [18515]  1000 18515   352905   141851     470        0             0 npm
Jul 20 11:05:00 someapp kernel: [18520]     0 18520    33359      291      66        0             0 sshd
Jul 20 11:05:00 someapp kernel: [18522]  1000 18522    33359      294      64        0             0 sshd
Jul 20 11:05:00 someapp kernel: [18523]  1000 18523    28866      115      12        0             0 bash
Jul 20 11:05:00 someapp kernel: Out of memory: Kill process 18515 (npm) score 559 or sacrifice child
Jul 20 11:05:00 someapp kernel: Killed process 18515 (npm) total-vm:1411620kB, anon-rss:567404kB, file-rss:0kB

#7


7  

A tool like systemtap (or a tracer) can monitor kernel signal-transmission logic and report. e.g., https://sourceware.org/systemtap/examples/process/sigmon.stp

systemtap(或跟踪器)这样的工具可以监视内核信号传输逻辑和报告。例如,https://sourceware.org/systemtap/examples/process/sigmon.stp

# stap .../sigmon.stp -x 31994 SIGKILL
   SPID     SNAME            RPID  RNAME            SIGNUM SIGNAME
   5609     bash             31994 find             9      SIGKILL

The filtering if block in that script can be adjusted to taste, or eliminated to trace systemwide signal traffic. Causes can be further isolated by collecting backtraces (add a print_backtrace() and/or print_ubacktrace() to the probe, for kernel- and userspace- respectively).

如果该脚本中的块可以被调整为味道,或消除了跟踪系统范围的信号流量。通过收集回溯(将print_backtrace()和/或print_ubacktrace()添加到探针,分别用于内核和用户空间),可以进一步隔离原因。

#8


6  

The PAM module to limit resources caused exactly the results you described: My process died mysteriously with the text Killed on the console window. No log output, neither in syslog nor in kern.log. The top program helped me to discover that exactly after one minute of CPU usage my process gets killed.

PAM模块限制资源导致了您所描述的结果:我的进程在控制台窗口中被杀死的文本神秘死亡。没有日志输出,在syslog和kern.log中都没有。上面的程序帮助我发现,在CPU使用一分钟后,我的进程就被杀死了。

#9


4  

In an lsf environment (interactive or otherwise) if the application exceeds memory utilization beyond some preset threshold by the admins on the queue or the resource request in submit to the queue the processes will be killed so other users don't fall victim to a potential run away. It doesn't always send an email when it does so, depending on how its set up.

在lsf环境中(交互式的或其他的),如果应用程序超过了某个预先设置的阈值,而在队列上的管理员或提交给队列的资源请求超过了某个预先设置的阈值,那么进程将被杀死,这样其他用户就不会成为潜在的流失的受害者。它并不总是在发送电子邮件的时候发送,这取决于它的设置。

One solution in this case is to find a queue with larger resources or define larger resource requirements in the submission.

在这种情况下,一个解决方案是找到一个具有更大资源的队列,或者在提交中定义更大的资源需求。

You may also want to review man ulimit

你可能也想回顾一下人类的极限。

Although I don't remember ulimit resulting in Killed its been a while since I needed that.

虽然我不记得ulimit导致了它的死亡,但我需要它。

#10


2  

We have had recurring problems under Linux at a customer site (Red Hat, I think), with OOMKiller (out-of-memory killer) killing both our principle application (i.e. the reason the server exists) and it's data base processes.

我们在一个客户站点(我认为是Red Hat)的Linux下出现了反复出现的问题,使用OOMKiller(内存不足杀手)杀死了我们的原则应用程序(即服务器存在的原因)和它的数据库进程。

In each case OOMKiller simply decided that the processes were using to much resources... the machine wasn't even about to fail for lack of resources. Neither the application nor it's database has problems with memory leaks (or any other resource leak).

在每一种情况下,OOMKiller都简单地认为这些过程占用了大量资源……这台机器甚至不会因为缺乏资源而失败。应用程序和它的数据库都不存在内存泄漏(或任何其他资源泄漏)的问题。

I am not a Linux expert, but I rather gathered it's algorithm for deciding when to kill something and what to kill is complex. Also, I was told (I can't speak as to the accuracy of this) that OOMKiller is baked into the Kernel and you can't simply not run it.

我不是一个Linux专家,但我更喜欢收集它的算法来决定什么时候杀死什么东西,什么东西是复杂的。而且,我被告知(我不能说这一点的准确性),因为OOMKiller是被放进内核的,你不能简单地运行它。

#11


0  

The user has the ability to kill his own programs, using kill or Control+C, but I get the impression that's not what happened, and that the user complained to you.

用户有能力杀死自己的程序,使用kill或Control+C,但我得到的印象不是发生了什么,用户向你抱怨。

root has the ability to kill programs of course, but if someone has root on your machine and is killing stuff you have bigger problems.

root有能力杀死程序,但是如果某人在你的机器上有根,并且正在杀死你有更大的问题。

If you are not the sysadmin, the sysadmin may have set up quotas on CPU, RAM, ort disk usage and auto-kills processes that exceed them.

如果您不是sysadmin,那么sysadmin可能在CPU、RAM、磁盘使用和自动杀伤过程中设置了超过它们的配额。

Other than those guesses, I'm not sure without more info about the program.

除了那些猜测,我不确定没有更多关于这个程序的信息。

#12


0  

I encountered this problem lately. Finally, I found my processes were killed just after Opensuse zypper update was called automatically. To disable zypper update solved my problem.

我最近遇到了这个问题。最后,我发现在Opensuse zypper更新被自动调用之后,我的进程被杀死了。禁用zypper更新解决了我的问题。