I have an intermittent problem on a build server where a Java process in the build somehow fails to terminate and seems to continue running (using 100% of the CPU) forever (I've seen it run for 2+ days over the weekend where it normally takes about 10 minutes). kill -9 pid
seems to be the only way to stop the process.
我在构建服务器上遇到了一个断断续续的问题,在构建服务器上,在构建过程中,Java进程无法终止,并且似乎要继续运行(使用100%的CPU)(我已经看到它在周末运行了2+天,通常需要10分钟)。杀死-9 pid似乎是唯一停止这个过程的方法。
I have tried calling kill -QUIT pid
on the process, but it does not seem to produce any stack trace to STDOUT (maybe it's not responding to the signal?). jstack without the -F force option appears to be unable to connect to the running JVM, but with the force option it does produce the output included below.
我已经尝试在进程中调用kill -QUIT pid,但是它似乎没有生成任何STDOUT的堆栈跟踪(可能它对信号没有响应?)没有-F force选项的jstack似乎无法连接到正在运行的JVM,但是使用force选项,它确实产生了下面的输出。
Unfortunately, even with that stack trace I can't see any obvious path for further investigation.
不幸的是,即使有了堆栈跟踪,我也看不到任何明显的路径供进一步研究。
As far as I can tell it shows two 'BLOCKED' threads which have run Object.wait (their stacks appear to contain only core Java code, nothing of ours) and a third which is 'IN_VM' with no stack output.
据我所知,它显示了两个运行对象的“阻塞”线程。等待(它们的堆栈似乎只包含核心Java代码,没有我们的),第三个是“IN_VM”,没有堆栈输出。
What steps should I take to gather more information about the cause of the problem (or better yet, how can I solve it)?
我应该采取什么步骤来收集更多关于问题原因的信息(或者更好的是,我如何解决它)?
$ /opt/jdk1.6.0_29/bin/jstack -l -F 5546 Attaching to process ID 5546, please wait... Debugger attached successfully. Server compiler detected. JVM version is 20.4-b02 Deadlock Detection: No deadlocks found. Finding object size using Printezis bits and skipping over... Thread 5555: (state = BLOCKED) Locked ownable synchronizers: - None Thread 5554: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=118 (Interpreted frame) - java.lang.ref.ReferenceQueue.remove() @bci=2, line=134 (Interpreted frame) - java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 (Interpreted frame) Locked ownable synchronizers: - None Thread 5553: (state = BLOCKED) - java.lang.Object.wait(long) @bci=0 (Interpreted frame) - java.lang.Object.wait() @bci=2, line=485 (Interpreted frame) - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 (Interpreted frame) Locked ownable synchronizers: - None Thread 5548: (state = IN_VM) Locked ownable synchronizers: - None
(Java version 1.6.0 update 29, running on Scientific Linux release 6.0)
(Java version 1.6.0 update 29,运行在Scientific Linux release 6.0上)
Update:
更新:
Running strace -f -p 894
produces a seemingly endless stream of...
奔跑的strace -f - p894产生了似乎无穷无尽的……
[pid 900] sched_yield() = 0
[pid 900] sched_yield() = 0
...
and then when Ctrl-Cd
然后当Ctrl-Cd
Process 894 detached
...
Process 900 detached
...
Process 909 detached
jmap -histo 894
does not connection but jmap -F -histo 894
returns...
jmap -他的894没有连接,但是jmap -F -他的894返回…
Attaching to process ID 894, please wait... Debugger attached successfully. Server compiler detected. JVM version is 20.4-b02 Iterating over heap. This may take a while... Finding object size using Printezis bits and skipping over... Finding object size using Printezis bits and skipping over... Object Histogram: num #instances #bytes Class description -------------------------------------------------------------------------- 1: 11356 1551744 * MethodKlass 2: 11356 1435944 * ConstMethodKlass 3: 914 973488 * ConstantPoolKlass 4: 6717 849032 char[] 5: 16987 820072 * SymbolKlass 6: 2305 686048 byte[] 7: 914 672792 * InstanceKlassKlass 8: 857 650312 * ConstantPoolCacheKlass 9: 5243 167776 java.lang.String 10: 1046 108784 java.lang.Class 11: 1400 87576 short[] 12: 1556 84040 * System ObjArray 13: 1037 64584 int[] 14: 103 60152 * ObjArrayKlassKlass 15: 622 54736 java.lang.reflect.Method 16: 1102 49760 java.lang.Object[] 17: 937 37480 java.util.TreeMap$Entry 18: 332 27960 java.util.HashMap$Entry[] 19: 579 27792 java.nio.HeapByteBuffer 20: 578 27744 java.nio.HeapCharBuffer 21: 1021 24504 java.lang.StringBuilder 22: 1158 24176 java.lang.Class[] 23: 721 23072 java.util.HashMap$Entry 24: 434 20832 java.util.TreeMap 25: 689 18936 java.lang.String[] 26: 238 17440 java.lang.reflect.Method[] 27: 29 16800 * MethodDataKlass 28: 204 14688 java.lang.reflect.Field 29: 330 13200 java.util.LinkedHashMap$Entry 30: 264 12672 java.util.HashMap ... 585: 1 16 java.util.LinkedHashSet 586: 1 16 sun.rmi.runtime.NewThreadAction$2 587: 1 16 java.util.Hashtable$EmptyIterator 588: 1 16 java.util.Collections$EmptySet Total : 79700 8894800 Heap traversal took 1.288 seconds.
7 个解决方案
#1
3
You can always do a strace -f -p pid
to see what the Java process is doing. From the look of it (you cannot get a jstack
without -F
, and Thread 5548 shows no call stack and is IN_VM), it looks like thread 5548 is taking too much to do something, or is possibly in some infinite loop.
您总是可以使用strace -f -p pid来查看Java进程在做什么。从它的外观来看(没有-F就不能得到jstack,而线程5548没有调用堆栈,并且是IN_VM),看起来线程5548做一些事情花费了太多的时间,或者可能在某个无限循环中。
#2
2
this might be caused by an Out Of memory too. I would try two things :
这可能是由于内存不足引起的。我会尝试两件事:
-
Enable automatic heap dump on OutOfMemory by addingJVM parameters
通过添加jvm参数,在OutOfMemory上启用自动堆转储
-XX:+HeapDumpOnOutOfMemoryError XX:HeapDumpPath=/tmp
- XX:+ HeapDumpOnOutOfMemoryError XX:HeapDumpPath = / tmp
-
Try to connect to your JVM with JConsole and see if there is any unusual pattern
尝试使用JConsole连接JVM,看看是否存在任何异常模式
#3
2
I would suspect a memory issue. You may want to watch the process using jstat and take a heap dump using jmap around the time you need to kill the process. See if jstat indicates continuous GC. Also, you may want to check the system's health in general (open file descriptors, network etc). Memory would be the easiest, so I would strongly recommend starting with it.
我怀疑是记忆问题。您可能希望使用jstat监视进程,并在需要终止进程时使用jmap获取堆转储。看看jstat是否表示连续GC。此外,您可能还想检查系统的总体健康状况(打开文件描述符、网络等)。内存是最简单的,所以我强烈建议从它开始。
#4
2
Take a snapshot while the process is running normally via jstack -F (-F has to be present it produces different snapshot than just jstack). The thread numbers are not Thread.id but system one. 5548 seems to be created prior to Finalizer and RefCounter (they are not the source of the issue), so it should be either a GC thread or some compiler one.
在进程正常运行的时候,通过jstack -F (-F必须显示,它生成的快照不同于jstack)。线程号不是线程。但系统一个id。5548似乎是在终结器和RefCounter之前创建的(它们不是问题的根源),所以它应该是一个GC线程或某个编译器线程。
100% probably means some bug in monitor. Java (hotspot) monitors use very simple spin locking mechanism to ensure ownership.
100%可能意味着监视器中的一些错误。Java (hotspot)监视器使用非常简单的自旋锁定机制来确保所有权。
And of course, attach a debugger - GDB to check where exactly the process has stuck.
当然,请附加一个调试器- GDB来检查进程的确切位置。
#5
1
Thread 5554 might indicate that you have a lot of Objects with finalize methods, and/or some problem with a finalize method. It might be worthwhile to look at that.
线程5554可能表明您有许多具有finalize方法的对象,以及/或finalize方法的一些问题。这也许值得一看。
I wasn't familiar with jstack, but it looks like it outputs less information that the thread dumps I am more familiar with. Might be useful to try to get a thread dump: kill -QUIT java_pid
. Note that the dump goes to stdout which might be to console or to log file depending on your setup.
我不熟悉jstack,但是看起来它输出的信息更少,我更熟悉的线程转储。尝试获取线程转储:kill -QUIT java_pid可能有用。注意,转储进入stdout,根据您的设置,stdout可能是控制台或日志文件。
If it's hard to figure out where stdout is being directed to, and assuming it is going to a file, you could use find
by recent modification time to identify candidate files. This is suggested in a comment to this blog post:
如果很难确定stdout指向何处,并且假设它指向一个文件,您可以使用find by recent modified time来标识候选文件。这是在这篇博文的评论中提出的:
you can run find[2] command at your root directory and find out what changed in the last x seconds. I've usually used find to help me access to all the logs that changed in the last 10 minutes eg : find /var/tomcat -mmin -3 -print (prints out all the files modified under /var/tomcat in hte last 3 minutes).
您可以在根目录上运行[2]命令,并找出在最后x秒内更改的内容。我通常使用find来帮助我访问过去10分钟内更改的所有日志,例如:查找/var/tomcat -mmin -3 -print(在最后3分钟内打印在/var/tomcat下修改的所有文件)。
Note that if you are running your JVM with -Xrs
, this means that the SIGQUIT
signal handler will not be installed and you will not be able to use that means of requesting a thread dump.
注意,如果您正在使用-Xrs运行JVM,这意味着SIGQUIT信号处理程序将不会被安装,您将无法使用请求线程转储的方式。
#6
1
I am encountering similar issue, my JBOSS jvm get an infinite loop, eventually it get OutOfMemory, I can't kill the process but Kill -9. I suspect the memory issue in most cases.
我遇到了类似的问题,我的JBOSS jvm得到一个无限循环,最终它得到OutOfMemory,我不能杀死进程但是杀死-9。我怀疑大多数情况下记忆问题。
#7
0
Here are some tools which you can use to localize the part of the process consuming CPU:
这里有一些工具,您可以用来本地化进程消耗CPU的部分:
-
perf
/oprofile
, especiallyopannotate
-- great for seeing what the hell code is consuming cycles - perf / oprofile,特别是opannotate——非常有利于了解什么是地狱代码在消耗循环
-
strace
,gstack
/gdb
(as mentioned by others) - strace, gstack / gdb(别人提到)
-
systemtap
is enormously powerful, but limited in some of the same ways as theptrace
based tools (if your problem doesn't involve a syscall, it's much less effective). - systemtap非常强大,但在某些方面与基于ptrace的工具相同(如果您的问题不涉及syscall,那么它的效率就会低得多)。
#1
3
You can always do a strace -f -p pid
to see what the Java process is doing. From the look of it (you cannot get a jstack
without -F
, and Thread 5548 shows no call stack and is IN_VM), it looks like thread 5548 is taking too much to do something, or is possibly in some infinite loop.
您总是可以使用strace -f -p pid来查看Java进程在做什么。从它的外观来看(没有-F就不能得到jstack,而线程5548没有调用堆栈,并且是IN_VM),看起来线程5548做一些事情花费了太多的时间,或者可能在某个无限循环中。
#2
2
this might be caused by an Out Of memory too. I would try two things :
这可能是由于内存不足引起的。我会尝试两件事:
-
Enable automatic heap dump on OutOfMemory by addingJVM parameters
通过添加jvm参数,在OutOfMemory上启用自动堆转储
-XX:+HeapDumpOnOutOfMemoryError XX:HeapDumpPath=/tmp
- XX:+ HeapDumpOnOutOfMemoryError XX:HeapDumpPath = / tmp
-
Try to connect to your JVM with JConsole and see if there is any unusual pattern
尝试使用JConsole连接JVM,看看是否存在任何异常模式
#3
2
I would suspect a memory issue. You may want to watch the process using jstat and take a heap dump using jmap around the time you need to kill the process. See if jstat indicates continuous GC. Also, you may want to check the system's health in general (open file descriptors, network etc). Memory would be the easiest, so I would strongly recommend starting with it.
我怀疑是记忆问题。您可能希望使用jstat监视进程,并在需要终止进程时使用jmap获取堆转储。看看jstat是否表示连续GC。此外,您可能还想检查系统的总体健康状况(打开文件描述符、网络等)。内存是最简单的,所以我强烈建议从它开始。
#4
2
Take a snapshot while the process is running normally via jstack -F (-F has to be present it produces different snapshot than just jstack). The thread numbers are not Thread.id but system one. 5548 seems to be created prior to Finalizer and RefCounter (they are not the source of the issue), so it should be either a GC thread or some compiler one.
在进程正常运行的时候,通过jstack -F (-F必须显示,它生成的快照不同于jstack)。线程号不是线程。但系统一个id。5548似乎是在终结器和RefCounter之前创建的(它们不是问题的根源),所以它应该是一个GC线程或某个编译器线程。
100% probably means some bug in monitor. Java (hotspot) monitors use very simple spin locking mechanism to ensure ownership.
100%可能意味着监视器中的一些错误。Java (hotspot)监视器使用非常简单的自旋锁定机制来确保所有权。
And of course, attach a debugger - GDB to check where exactly the process has stuck.
当然,请附加一个调试器- GDB来检查进程的确切位置。
#5
1
Thread 5554 might indicate that you have a lot of Objects with finalize methods, and/or some problem with a finalize method. It might be worthwhile to look at that.
线程5554可能表明您有许多具有finalize方法的对象,以及/或finalize方法的一些问题。这也许值得一看。
I wasn't familiar with jstack, but it looks like it outputs less information that the thread dumps I am more familiar with. Might be useful to try to get a thread dump: kill -QUIT java_pid
. Note that the dump goes to stdout which might be to console or to log file depending on your setup.
我不熟悉jstack,但是看起来它输出的信息更少,我更熟悉的线程转储。尝试获取线程转储:kill -QUIT java_pid可能有用。注意,转储进入stdout,根据您的设置,stdout可能是控制台或日志文件。
If it's hard to figure out where stdout is being directed to, and assuming it is going to a file, you could use find
by recent modification time to identify candidate files. This is suggested in a comment to this blog post:
如果很难确定stdout指向何处,并且假设它指向一个文件,您可以使用find by recent modified time来标识候选文件。这是在这篇博文的评论中提出的:
you can run find[2] command at your root directory and find out what changed in the last x seconds. I've usually used find to help me access to all the logs that changed in the last 10 minutes eg : find /var/tomcat -mmin -3 -print (prints out all the files modified under /var/tomcat in hte last 3 minutes).
您可以在根目录上运行[2]命令,并找出在最后x秒内更改的内容。我通常使用find来帮助我访问过去10分钟内更改的所有日志,例如:查找/var/tomcat -mmin -3 -print(在最后3分钟内打印在/var/tomcat下修改的所有文件)。
Note that if you are running your JVM with -Xrs
, this means that the SIGQUIT
signal handler will not be installed and you will not be able to use that means of requesting a thread dump.
注意,如果您正在使用-Xrs运行JVM,这意味着SIGQUIT信号处理程序将不会被安装,您将无法使用请求线程转储的方式。
#6
1
I am encountering similar issue, my JBOSS jvm get an infinite loop, eventually it get OutOfMemory, I can't kill the process but Kill -9. I suspect the memory issue in most cases.
我遇到了类似的问题,我的JBOSS jvm得到一个无限循环,最终它得到OutOfMemory,我不能杀死进程但是杀死-9。我怀疑大多数情况下记忆问题。
#7
0
Here are some tools which you can use to localize the part of the process consuming CPU:
这里有一些工具,您可以用来本地化进程消耗CPU的部分:
-
perf
/oprofile
, especiallyopannotate
-- great for seeing what the hell code is consuming cycles - perf / oprofile,特别是opannotate——非常有利于了解什么是地狱代码在消耗循环
-
strace
,gstack
/gdb
(as mentioned by others) - strace, gstack / gdb(别人提到)
-
systemtap
is enormously powerful, but limited in some of the same ways as theptrace
based tools (if your problem doesn't involve a syscall, it's much less effective). - systemtap非常强大,但在某些方面与基于ptrace的工具相同(如果您的问题不涉及syscall,那么它的效率就会低得多)。