I have an intermittent problem on a build server where a Java process in the build somehow fails to terminate and seems to continue running (using 100% of the CPU) forever (I've seen it run for 2+ days over the weekend where it normally takes about 10 minutes). kill -9 pid seems to be the only way to stop the process.

我在构建服务器上遇到了一个断断续续的问题,在构建服务器上,在构建过程中,Java进程无法终止,并且似乎要继续运行(使用100%的CPU)(我已经看到它在周末运行了2+天,通常需要10分钟)。杀死-9 pid似乎是唯一停止这个过程的方法。

I have tried calling kill -QUIT pid on the process, but it does not seem to produce any stack trace to STDOUT (maybe it's not responding to the signal?). jstack without the -F force option appears to be unable to connect to the running JVM, but with the force option it does produce the output included below.

我已经尝试在进程中调用kill -QUIT pid,但是它似乎没有生成任何STDOUT的堆栈跟踪(可能它对信号没有响应?)没有-F force选项的jstack似乎无法连接到正在运行的JVM,但是使用force选项,它确实产生了下面的输出。

Unfortunately, even with that stack trace I can't see any obvious path for further investigation.


As far as I can tell it shows two 'BLOCKED' threads which have run Object.wait (their stacks appear to contain only core Java code, nothing of ours) and a third which is 'IN_VM' with no stack output.


What steps should I take to gather more information about the cause of the problem (or better yet, how can I solve it)?


$ /opt/jdk1.6.0_29/bin/jstack -l -F 5546
Attaching to process ID 5546, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.4-b02
Deadlock Detection:

No deadlocks found.

Finding object size using Printezis bits and skipping over...
Thread 5555: (state = BLOCKED)

Locked ownable synchronizers:
    - None

Thread 5554: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=118 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove() @bci=2, line=134 (Interpreted frame)
 - java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 (Interpreted frame)

Locked ownable synchronizers:
    - None

Thread 5553: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.Object.wait() @bci=2, line=485 (Interpreted frame)
 - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 (Interpreted frame)

Locked ownable synchronizers:
    - None

Thread 5548: (state = IN_VM)

Locked ownable synchronizers:
    - None

(Java version 1.6.0 update 29, running on Scientific Linux release 6.0)

(Java version 1.6.0 update 29,运行在Scientific Linux release 6.0上)



Running strace -f -p 894 produces a seemingly endless stream of...

奔跑的strace -f - p894产生了似乎无穷无尽的……

[pid   900] sched_yield()               = 0
[pid   900] sched_yield()               = 0

and then when Ctrl-Cd


Process 894 detached
Process 900 detached
Process 909 detached

jmap -histo 894 does not connection but jmap -F -histo 894 returns...

jmap -他的894没有连接,但是jmap -F -他的894返回…

Attaching to process ID 894, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.4-b02
Iterating over heap. This may take a while...
Finding object size using Printezis bits and skipping over...
Finding object size using Printezis bits and skipping over...
Object Histogram:

num       #instances    #bytes  Class description
1:      11356   1551744 * MethodKlass
2:      11356   1435944 * ConstMethodKlass
3:      914 973488  * ConstantPoolKlass
4:      6717    849032  char[]
5:      16987   820072  * SymbolKlass
6:      2305    686048  byte[]
7:      914 672792  * InstanceKlassKlass
8:      857 650312  * ConstantPoolCacheKlass
9:      5243    167776  java.lang.String
10:     1046    108784  java.lang.Class
11:     1400    87576   short[]
12:     1556    84040   * System ObjArray
13:     1037    64584   int[]
14:     103 60152   * ObjArrayKlassKlass
15:     622 54736   java.lang.reflect.Method
16:     1102    49760   java.lang.Object[]
17:     937 37480   java.util.TreeMap$Entry
18:     332 27960   java.util.HashMap$Entry[]
19:     579 27792   java.nio.HeapByteBuffer
20:     578 27744   java.nio.HeapCharBuffer
21:     1021    24504   java.lang.StringBuilder
22:     1158    24176   java.lang.Class[]
23:     721 23072   java.util.HashMap$Entry
24:     434 20832   java.util.TreeMap
25:     689 18936   java.lang.String[]
26:     238 17440   java.lang.reflect.Method[]
27:     29  16800   * MethodDataKlass
28:     204 14688   java.lang.reflect.Field
29:     330 13200   java.util.LinkedHashMap$Entry
30:     264 12672   java.util.HashMap
585:        1   16  java.util.LinkedHashSet
586:        1   16  sun.rmi.runtime.NewThreadAction$2
587:        1   16  java.util.Hashtable$EmptyIterator
588:        1   16  java.util.Collections$EmptySet
Total :     79700   8894800
Heap traversal took 1.288 seconds.

7 个解决方案



You can always do a strace -f -p pid to see what the Java process is doing. From the look of it (you cannot get a jstack without -F, and Thread 5548 shows no call stack and is IN_VM), it looks like thread 5548 is taking too much to do something, or is possibly in some infinite loop.

您总是可以使用strace -f -p pid来查看Java进程在做什么。从它的外观来看(没有-F就不能得到jstack,而线程5548没有调用堆栈,并且是IN_VM),看起来线程5548做一些事情花费了太多的时间,或者可能在某个无限循环中。



this might be caused by an Out Of memory too. I would try two things :


  • Enable automatic heap dump on OutOfMemory by addingJVM parameters


    -XX:+HeapDumpOnOutOfMemoryError XX:HeapDumpPath=/tmp

    - XX:+ HeapDumpOnOutOfMemoryError XX:HeapDumpPath = / tmp

  • Try to connect to your JVM with JConsole and see if there is any unusual pattern




I would suspect a memory issue. You may want to watch the process using jstat and take a heap dump using jmap around the time you need to kill the process. See if jstat indicates continuous GC. Also, you may want to check the system's health in general (open file descriptors, network etc). Memory would be the easiest, so I would strongly recommend starting with it.




Take a snapshot while the process is running normally via jstack -F (-F has to be present it produces different snapshot than just jstack). The thread numbers are not Thread.id but system one. 5548 seems to be created prior to Finalizer and RefCounter (they are not the source of the issue), so it should be either a GC thread or some compiler one.

在进程正常运行的时候,通过jstack -F (-F必须显示,它生成的快照不同于jstack)。线程号不是线程。但系统一个id。5548似乎是在终结器和RefCounter之前创建的(它们不是问题的根源),所以它应该是一个GC线程或某个编译器线程。

100% probably means some bug in monitor. Java (hotspot) monitors use very simple spin locking mechanism to ensure ownership.

100%可能意味着监视器中的一些错误。Java (hotspot)监视器使用非常简单的自旋锁定机制来确保所有权。

And of course, attach a debugger - GDB to check where exactly the process has stuck.

当然,请附加一个调试器- GDB来检查进程的确切位置。



Thread 5554 might indicate that you have a lot of Objects with finalize methods, and/or some problem with a finalize method. It might be worthwhile to look at that.


I wasn't familiar with jstack, but it looks like it outputs less information that the thread dumps I am more familiar with. Might be useful to try to get a thread dump: kill -QUIT java_pid. Note that the dump goes to stdout which might be to console or to log file depending on your setup.

我不熟悉jstack,但是看起来它输出的信息更少,我更熟悉的线程转储。尝试获取线程转储:kill -QUIT java_pid可能有用。注意,转储进入stdout,根据您的设置,stdout可能是控制台或日志文件。

If it's hard to figure out where stdout is being directed to, and assuming it is going to a file, you could use find by recent modification time to identify candidate files. This is suggested in a comment to this blog post:

如果很难确定stdout指向何处,并且假设它指向一个文件,您可以使用find by recent modified time来标识候选文件。这是在这篇博文的评论中提出的:

you can run find[2] command at your root directory and find out what changed in the last x seconds. I've usually used find to help me access to all the logs that changed in the last 10 minutes eg : find /var/tomcat -mmin -3 -print (prints out all the files modified under /var/tomcat in hte last 3 minutes).

您可以在根目录上运行[2]命令,并找出在最后x秒内更改的内容。我通常使用find来帮助我访问过去10分钟内更改的所有日志,例如:查找/var/tomcat -mmin -3 -print(在最后3分钟内打印在/var/tomcat下修改的所有文件)。

Note that if you are running your JVM with -Xrs, this means that the SIGQUIT signal handler will not be installed and you will not be able to use that means of requesting a thread dump.




I am encountering similar issue, my JBOSS jvm get an infinite loop, eventually it get OutOfMemory, I can't kill the process but Kill -9. I suspect the memory issue in most cases.

我遇到了类似的问题,我的JBOSS jvm得到一个无限循环,最终它得到OutOfMemory,我不能杀死进程但是杀死-9。我怀疑大多数情况下记忆问题。



Here are some tools which you can use to localize the part of the process consuming CPU:


  • perf / oprofile, especially opannotate -- great for seeing what the hell code is consuming cycles
  • perf / oprofile,特别是opannotate——非常有利于了解什么是地狱代码在消耗循环
  • strace, gstack / gdb (as mentioned by others)
  • strace, gstack / gdb(别人提到)
  • systemtap is enormously powerful, but limited in some of the same ways as the ptrace based tools (if your problem doesn't involve a syscall, it's much less effective).
  • systemtap非常强大,但在某些方面与基于ptrace的工具相同(如果您的问题不涉及syscall,那么它的效率就会低得多)。



