背景:使用countDownLatch,10个线程计算结果,最后通过主线程来汇总结果;
有个spring scheduler的程序,运行一段时间后,不运行了,查看日志发现CountDownLatch无法归零,导致整个主线程Hang在那里,此时如果调用其外部接口还能响应:
怀疑是不是死锁了? 查一下jstack发现没有出现死锁:
"Attach Listener" daemon prio=10 tid=0x00007fc1b4001000 nid=0x162e4 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Abandoned connection cleanup thread" daemon prio=10 tid=0x00007fc1801e5800 nid=0x1e632 in Object.wait() [0x00007fc1ec79a000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) - locked <0x0000000703f97058> (a java.lang.ref.ReferenceQueue$Lock) at com.mysql.jdbc.AbandonedConnectionCleanupThread.run(AbandonedConnectionCleanupThread.java:43) "Tomcat JDBC Pool Cleaner[1485406391:1509516002697]" daemon prio=10 tid=0x00007fc1801b8000 nid=0x1e631 in Object.wait() [0x00007fc1ec89b000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.util.TimerThread.mainLoop(Timer.java:552) - locked <0x000000070406b8d8> (a java.util.TaskQueue) at java.util.TimerThread.run(Timer.java:505) "http-nio-8119-Acceptor-0" daemon prio=10 tid=0x00007fc19ec27800 nid=0x2d11 runnable [0x00007fc1ba16b000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) - locked <0x000000070408f7d8> (a java.lang.Object) at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:686) at java.lang.Thread.run(Thread.java:745) "http-nio-8119-ClientPoller-1" daemon prio=10 tid=0x00007fc19f9a2800 nid=0x2d10 runnable [0x00007fc1cc1af000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked <0x000000070408fbe0> (a sun.nio.ch.Util$2) - locked <0x000000070408fbf0> (a java.util.Collections$UnmodifiableSet) - locked <0x000000070408fb98> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1049) at java.lang.Thread.run(Thread.java:745) "http-nio-8119-ClientPoller-0" daemon prio=10 tid=0x00007fc19d143000 nid=0x2d0f runnable [0x00007fc1cc2b0000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked <0x000000070416a268> (a sun.nio.ch.Util$2) - locked <0x000000070416a278> (a java.util.Collections$UnmodifiableSet) - locked <0x000000070416a220> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1049) at java.lang.Thread.run(Thread.java:745) "NioBlockingSelector.BlockPoller-1" daemon prio=10 tid=0x00007fc19e89b000 nid=0x2d0e runnable [0x00007fc1cc3b1000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked <0x0000000703d3c3a0> (a sun.nio.ch.Util$2) - locked <0x0000000703d3c390> (a java.util.Collections$UnmodifiableSet) - locked <0x0000000703d3c258> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.tomcat.util.net.NioBlockingSelector$BlockPoller.run(NioBlockingSelector.java:342) "pool-1-thread-1" prio=10 tid=0x00007fc1a0749800 nid=0x2d0b waiting on condition [0x00007fc1cc4b1000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007047385d8> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236) at com.didi.bigdataqa.tools.HiveBaseTool.doHealthyScore(HiveBaseTool.java:139) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65) at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:81) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) "pool-2-thread-1" daemon prio=10 tid=0x00007fc1a0381000 nid=0x2ceb waiting on condition [0x00007fc1cc5b3000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007037d9338> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) "cluster-ClusterId{value='59f823ec4d62e12bc9d4a556', description='null'}-localhost:27017" daemon prio=10 tid=0x00007fc1a1734800 nid=0x2cea waiting on condition [0x00007fc1ec497000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007037c1d30> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForSignalOrTimeout(DefaultServerMonitor.java:224) at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForNext(DefaultServerMonitor.java:205) at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:153) - locked <0x00000007037c19e8> (a com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable) at java.lang.Thread.run(Thread.java:745) "commons-pool-EvictionTimer" daemon prio=10 tid=0x00007fc19ecc5800 nid=0x2cbb in Object.wait() [0x00007fc1ec396000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.util.TimerThread.mainLoop(Timer.java:552) - locked <0x0000000703b48458> (a java.util.TaskQueue) at java.util.TimerThread.run(Timer.java:505) "container-0" prio=10 tid=0x00007fc19dfd7800 nid=0x2c95 waiting on condition [0x00007fc1ec598000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.catalina.core.StandardServer.await(StandardServer.java:407) at org.springframework.boot.context.embedded.tomcat.TomcatEmbeddedServletContainer$1.run(TomcatEmbeddedServletContainer.java:139) "ContainerBackgroundProcessor[StandardEngine[Tomcat].StandardHost[localhost].StandardContext[/healthy_score]]" daemon prio=10 tid=0x00007fc199247800 nid=0x2c94 waiting on condition [0x00007fc1ec699000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1345) at java.lang.Thread.run(Thread.java:745) "DestroyJavaVM" prio=10 tid=0x00007fc208009000 nid=0x2bca waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Service Thread" daemon prio=10 tid=0x00007fc208244800 nid=0x2bdc runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "C2 CompilerThread1" daemon prio=10 tid=0x00007fc208242000 nid=0x2bdb waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "C2 CompilerThread0" daemon prio=10 tid=0x00007fc20823f000 nid=0x2bda waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x00007fc20823d000 nid=0x2bd9 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=10 tid=0x00007fc208210000 nid=0x2bd7 in Object.wait() [0x00007fc1edb29000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) - locked <0x000000070001e8f0> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209) "Reference Handler" daemon prio=10 tid=0x00007fc20820e000 nid=0x2bd6 in Object.wait() [0x00007fc1edc2a000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133) - locked <0x00000006fff96908> (a java.lang.ref.Reference$Lock) "VM Thread" prio=10 tid=0x00007fc208209800 nid=0x2bd5 runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fc20801f000 nid=0x2bcb runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fc208021000 nid=0x2bcc runnable "GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fc208023000 nid=0x2bcd runnable "GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fc208025000 nid=0x2bce runnable "GC task thread#4 (ParallelGC)" prio=10 tid=0x00007fc208026800 nid=0x2bcf runnable "GC task thread#5 (ParallelGC)" prio=10 tid=0x00007fc208028800 nid=0x2bd0 runnable "GC task thread#6 (ParallelGC)" prio=10 tid=0x00007fc20802a800 nid=0x2bd1 runnable "GC task thread#7 (ParallelGC)" prio=10 tid=0x00007fc20802c800 nid=0x2bd2 runnable "GC task thread#8 (ParallelGC)" prio=10 tid=0x00007fc20802e800 nid=0x2bd3 runnable "GC task thread#9 (ParallelGC)" prio=10 tid=0x00007fc208030800 nid=0x2bd4 runnable "VM Periodic Task Thread" prio=10 tid=0x00007fc20824f800 nid=0x2bdd waiting on condition JNI global references: 151
多次输入jstack观察:
jstack 输出 nid=0x2d0f 查看这个线程ID,转换为十进制为11535
查看进程下的线程,下面这几个都是没有释放的线程
ps -T -p 11209
11209 11534 ? 00:00:15 java
11209 11535 ? 00:00:17 java
11209 11536 ? 00:00:17 java
11209 11537 ? 00:00:00 java
在使用countDownLatch的await时,加入一个超时等待时间,这样就可以避免这个问题,但是问题的根源是什么呢?
cnt.await(1, TimeUnit.HOURS);
继续定位,怀疑是哪个地方没有设置超时机制,导致一直等待,或许哪里出现了死循环? jstack打印任务的线程都已经看不到了,只能看到进程
代码里面没看到死循环的地方,怀疑是GC的问题,导致停机了
最后再定位GC,发现有个List的超级大,最大的时候有30多w,堆设置的有点小。。。
果断加大堆内存空间得到解决