在循环中执行剩余操作的Java线程阻塞所有其他线程

The following code snippet executes two threads, one is a simple timer logging every second, the second is an infinite loop that executes a remainder operation:

下面的代码片段执行两个线程，一个是每秒一个简单的计时器日志记录，另一个是执行剩余操作的无限循环:

public class TestBlockingThread {
    private static final Logger LOGGER = LoggerFactory.getLogger(TestBlockingThread.class);

    public static final void main(String[] args) throws InterruptedException {
        Runnable task = () -> {
            int i = 0;
            while (true) {
                i++;
                if (i != 0) {
                    boolean b = 1 % i == 0;
                }
            }
        };

        new Thread(new LogTimer()).start();
        Thread.sleep(2000);
        new Thread(task).start();
    }

    public static class LogTimer implements Runnable {
        @Override
        public void run() {
            while (true) {
                long start = System.currentTimeMillis();
                try {
                    Thread.sleep(1000);
                } catch (InterruptedException e) {
                    // do nothing
                }
                LOGGER.info("timeElapsed={}", System.currentTimeMillis() - start);
            }
        }
    }
}

This gives the following result:

这就产生了以下结果:

[Thread-0] INFO  c.m.c.concurrent.TestBlockingThread - timeElapsed=1004
[Thread-0] INFO  c.m.c.concurrent.TestBlockingThread - timeElapsed=1003
[Thread-0] INFO  c.m.c.concurrent.TestBlockingThread - timeElapsed=13331
[Thread-0] INFO  c.m.c.concurrent.TestBlockingThread - timeElapsed=1006
[Thread-0] INFO  c.m.c.concurrent.TestBlockingThread - timeElapsed=1003
[Thread-0] INFO  c.m.c.concurrent.TestBlockingThread - timeElapsed=1004
[Thread-0] INFO  c.m.c.concurrent.TestBlockingThread - timeElapsed=1004

I don't understand why the infinite task blocks all other threads for 13.3 seconds. I tried to change thread priorities and other settings, nothing worked.

我不明白为什么无限任务会在13.3秒内阻塞所有其他线程。我试图改变线程优先级和其他设置，但没有成功。

If you have any suggestions to fix this (including tweaking OS context switching settings) please let me know.

如果您有什么建议可以解决这个问题(包括调整操作系统上下文切换设置)，请告诉我。

4 个解决方案

#1

After all the explanations here (thanks to Peter Lawrey) we found that the main source of this pause is that safepoint inside the loop is reached rather rarely so it takes a long time to stop all threads for JIT-compiled code replacement.

在这里进行了所有的解释(感谢Peter Lawrey)之后，我们发现这种暂停的主要来源是很少到达循环中的安全点，因此需要很长时间才能停止所有用于jit编译代码替换的线程。

But I decided to go deeper and find why safepoint is reached rarely. I found it a bit confusing why the back jump of while loop is not "safe" in this case.

但我决定深入研究，找出为什么很少触及safepoint。我发现，在这种情况下，while循环的反向跳转并不“安全”，这有点让人困惑。

So I summon -XX:+PrintAssembly in all its glory to help

因此，我呼吁-XX:+PrintAssembly尽其所能地提供帮助

-XX:+UnlockDiagnosticVMOptions \
-XX:+TraceClassLoading \
-XX:+DebugNonSafepoints \
-XX:+PrintCompilation \
-XX:+PrintGCDetails \
-XX:+PrintStubCode \
-XX:+PrintAssembly \
-XX:PrintAssemblyOptions=-Mintel

After some investigation I found that after third recompilation of lambda C2 compiler threw away safepoint polls inside loop completely.

经过一些调查，我发现在lambda C2编译器的第三次重新编译之后，完全抛弃了循环中的安全点轮询。

UPDATE

更新

During the profiling stage variable i was never seen equal to 0. That's why C2 speculatively optimized this branch away, so that the loop was transformed to something like

在分析阶段变量i从未见过等于0。这就是为什么C2对这个分支进行了推测性的优化，这样循环就变成了类似的东西

for (int i = OSR_value; i != 0; i++) {
    if (1 % i == 0) {
        uncommon_trap();
    }
}
uncommon_trap();

Note that originally infinite loop was reshaped to a regular finite loop with a counter! Due to JIT optimization to eliminate safepoint polls in finite counted loops, there was no safepoint poll in this loop either.

请注意，原来无限循环被重塑为一个常规的有限循环与计数器!由于JIT优化以消除有限计数循环中的安全点轮询，因此在这个循环中也没有安全点轮询。

After some time, i wrapped back to 0, and the uncommon trap was taken. The method was deoptimized and continued execution in the interpreter. During recompilation with a new knowledge C2 recognized the infinite loop and gave up compilation. The rest of the method proceeded in the interpreter with proper safepoints.

过了一段时间，我又回到了原点。对该方法进行了优化，并在解释器中继续执行。在用新知识重新编译时，C2认识到了无限循环并放弃了编译。其余的方法在翻译人员中得到了适当的安全保障。

There is a great must-read blog post "Safepoints: Meaning, Side Effects and Overheads" by Nitsan Wakart covering safepoints and this particular issue.

Nitsan Wakart有一个很棒的必读的博客“安全点:意义，副作用和管理费用”，涵盖了安全点和这个特别的问题。

Safepoint elimination in very long counted loops is known to be a problem. The bug JDK-5014723 (thanks to Vladimir Ivanov) addresses this problem.

众所周知，在很长的循环中消除安全点是一个问题。这款bug JDK-5014723(感谢Vladimir Ivanov)解决了这个问题。

The workaround is available until the bug is finally fixed.

在bug最终修复之前，工作区是可用的。

You can try using -XX:+UseCountedLoopSafepoints (it will cause overall performance penalty and may lead to JVM crash JDK-8161147). After using it C2 compiler continue keeping safepoints at the back jumps and original pause disappears completely.
您可以尝试使用-XX:+UseCountedLoopSafepoints(它将导致整体性能损失，并可能导致JVM崩溃，JDK-8161147)。使用它之后，C2编译器继续保持安全点在后跳和原始暂停完全消失。
You can explicitly disable compilation of problematic method by using
-XX:CompileCommand='exclude,binary/class/Name,methodName'

您可以通过使用-XX: compile ecommand =' rejection,binary/class/Name,methodName，显式地禁用问题方法的编译
Or you can rewrite your code by adding safepoint manually. For example Thread.yield() call at the end of cycle or even changing int i to long i (thanks, Nitsan Wakart) will also fix pause.

或者您可以通过手动添加safepoint来重写代码。例如，在循环结束时调用Thread.yield()，甚至将int i改为long i(谢谢，Nitsan Wakart)也将修复暂停。

#2

In short, the loop you have has no safe point inside it except when i == 0 is reached. When this method is compiled and triggers the code to be replaced it needs to bring all the threads to a safe point, but this takes a very long time, locking up not just the thread running the code but all threads in the JVM.

简而言之，除了达到i = 0时之外，您所拥有的循环内部没有任何安全点。当编译此方法并触发要替换的代码时，它需要将所有线程带到安全点，但这需要很长时间，不仅锁定运行代码的线程，而且锁定JVM中的所有线程。

I added the following command line options.

我添加了以下命令行选项。

-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintCompilation

I also modified the code to use floating point which appears to take longer.

我还修改了代码，以使用浮点数，这似乎需要更长的时间。

boolean b = 1.0 / i == 0;

And what I see in the output is

我在输出中看到的是

timeElapsed=100
Application time: 0.9560686 seconds
  41423  280 %     4       TestBlockingThread::lambda$main$0 @ -2 (27 bytes)   made not entrant
Total time for which application threads were stopped: 40.3971116 seconds, Stopping threads took: 40.3967755 seconds
Application time: 0.0000219 seconds
Total time for which application threads were stopped: 0.0005840 seconds, Stopping threads took: 0.0000383 seconds
  41424  281 %     3       TestBlockingThread::lambda$main$0 @ 2 (27 bytes)
timeElapsed=40473
  41425  282 %     4       TestBlockingThread::lambda$main$0 @ 2 (27 bytes)
  41426  281 %     3       TestBlockingThread::lambda$main$0 @ -2 (27 bytes)   made not entrant
timeElapsed=100

Note: for code to be replaced, threads have to be stopped at a safe point. However it appears here that such a safe point is reached very rarely (possibly only when i == 0 Changing the task to

注意:要替换代码，必须在安全点停止线程。然而，这里似乎很少达到这样的安全点(可能只有当i == 0时才会将任务更改为

Runnable task = () -> {
    for (int i = 1; i != 0 ; i++) {
        boolean b = 1.0 / i == 0;
    }
};

I see a similar delay.

我看到了类似的延迟。

timeElapsed=100
Application time: 0.9587419 seconds
  39044  280 %     4       TestBlockingThread::lambda$main$0 @ -2 (28 bytes)   made not entrant
Total time for which application threads were stopped: 38.0227039 seconds, Stopping threads took: 38.0225761 seconds
Application time: 0.0000087 seconds
Total time for which application threads were stopped: 0.0003102 seconds, Stopping threads took: 0.0000105 seconds
timeElapsed=38100
timeElapsed=100

Adding code to the loop carefully you get a longer delay.

将代码添加到循环中，您会得到较长的延迟。

for (int i = 1; i != 0 ; i++) {
    boolean b = 1.0 / i / i == 0;
}

gets

得到

 Total time for which application threads were stopped: 59.6034546 seconds, Stopping threads took: 59.6030773 seconds

However, change the code to use a native method which always has a safe point (if it is not an intrinsic)

但是，将代码更改为使用始终具有安全点的本机方法(如果它不是内部方法)

for (int i = 1; i != 0 ; i++) {
    boolean b = Math.cos(1.0 / i) == 0;
}

prints

打印

Total time for which application threads were stopped: 0.0001444 seconds, Stopping threads took: 0.0000615 seconds

Note: adding if (Thread.currentThread().isInterrupted()) { ... } to a loop adds a safe point.

注意:添加if (Thread.currentThread(). isinterrupte()){…}对循环添加一个安全点。

Note: This happened on a 16 core machine so there is no lack of CPU resources.

注意:这发生在一台16核的机器上，因此并不缺少CPU资源。

#3

Found the answer of why. They are called safepoints, and are best known as the Stop-The-World that happens because of GC.

找到原因的答案。它们被称为safepoints，最著名的是由于GC而出现的Stop-The-World。

See this articles: Logging stop-the-world pauses in JVM

请参阅本文:在JVM中记录停止的暂停

Different events can cause the JVM to pause all the application threads. Such pauses are called Stop-The-World (STW) pauses. The most common cause for an STW pause to be triggered is garbage collection (example in github) , but different JIT actions (example), biased lock revocation (example), certain JVMTI operations , and many more also require the application to be stopped.

不同的事件会导致JVM暂停所有应用程序线程。这种停顿被称为“停止-世界”(STW)停顿。触发STW暂停的最常见原因是垃圾收集(github中的示例)，但是不同的JIT操作(示例)、有偏差的锁撤销(示例)、某些JVMTI操作，以及更多需要停止应用程序的操作。

The points at which the application threads may be safely stopped are called, surprise, safepoints. This term is also often used to refer to all the STW pauses.

可以安全地停止应用程序线程的点称为safepoints。这个术语也经常用来指所有的STW暂停。

It is more or less common that GC logs are enabled. However, this does not capture information on all the safepoints. To get it all, use these JVM options:

启用GC日志或多或少是很常见的。然而，这并不能捕获关于所有安全点的信息。要实现这一切，请使用以下JVM选项:
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
If you are wondering about the naming explicitly referring to GC, don’t be alarmed – turning on these options logs all of the safepoints, not just garbage collection pauses. If you run a following example (source in github) with the flags specified above.

如果您想知道显式地引用GC的命名，请不要惊慌——打开这些选项记录所有的安全点，而不仅仅是垃圾收集暂停。如果您运行以下示例(github中的源代码)，并使用上面指定的标志。

Reading the HotSpot Glossary of Terms, it defines this:

阅读热点词汇表，定义如下:

safepoint

safepoint

A point during program execution at which all GC roots are known and all heap object contents are consistent. From a global point of view, all threads must block at a safepoint before the GC can run. (As a special case, threads running JNI code can continue to run, because they use only handles. During a safepoint they must block instead of loading the contents of the handle.) From a local point of view, a safepoint is a distinguished point in a block of code where the executing thread may block for the GC. Most call sites qualify as safepoints. There are strong invariants which hold true at every safepoint, which may be disregarded at non-safepoints. Both compiled Java code and C/C++ code be optimized between safepoints, but less so across safepoints. The JIT compiler emits a GC map at each safepoint. C/C++ code in the VM uses stylized macro-based conventions (e.g., TRAPS) to mark potential safepoints.

在程序执行期间，所有GC根都是已知的，所有堆对象内容都是一致的。从全局的角度来看，在GC运行之前，所有线程必须在一个安全点上阻塞。(作为一种特殊情况，运行JNI代码的线程可以继续运行，因为它们只使用句柄。在安全点期间，它们必须阻塞而不是加载句柄的内容。从本地的观点来看，safepoint是代码块中的一个区别点，执行线程可以在其中为GC阻塞。大多数电话网站都符合安全点的标准。有一些强不变量在每个安全点都成立，在非安全点可以忽略。编译后的Java代码和C/ c++代码都可以在安全点之间进行优化，但在安全点之间的优化效果要差一些。JIT编译器在每个安全点发出一个GC映射。在VM中，C/ c++代码使用基于格式的基于宏的约定(例如，陷阱)来标记潜在的安全点。

Running with the above mentioned flags, I get this output:

使用上述标志运行，我得到如下输出:

Application time: 0.9668750 seconds
Total time for which application threads were stopped: 0.0000747 seconds, Stopping threads took: 0.0000291 seconds
timeElapsed=1015
Application time: 1.0148568 seconds
Total time for which application threads were stopped: 0.0000556 seconds, Stopping threads took: 0.0000168 seconds
timeElapsed=1015
timeElapsed=1014
Application time: 2.0453971 seconds
Total time for which application threads were stopped: 10.7951187 seconds, Stopping threads took: 10.7950774 seconds
timeElapsed=11732
Application time: 1.0149263 seconds
Total time for which application threads were stopped: 0.0000644 seconds, Stopping threads took: 0.0000368 seconds
timeElapsed=1015

Notice the third STW event:
Total time stopped: 10.7951187 seconds
Stopping threads took: 10.7950774 seconds

注意第三个STW事件:总时间停止:10.7951187秒停止线程:10.7950774秒

JIT itself took virtually no time, but once the JVM had decided to perform a JIT compilation, it entered STW mode, however since the code to be compiled (the infinite loop) doesn't have a call site, no safepoint was ever reached.

JIT本身几乎不需要花费任何时间，但是一旦JVM决定执行JIT编译，它就进入了STW模式，但是由于要编译的代码(无限循环)没有调用站点，所以没有到达任何安全点。

The STW ends when JIT eventually gives up waiting and concludes the code is in an infinite loop.

当JIT最终放弃等待并得出代码处于无限循环的结论时，STW就结束了。

#4

After following the comment threads and some testing on my own, I believe that the pause is caused by the JIT compiler. Why the JIT compiler is taking such a long time is beyond my ability to debug.

在跟踪了注释线程和我自己进行了一些测试之后，我认为暂停是由JIT编译器引起的。为什么JIT编译器要花这么长时间是我无法调试的。

However, since you only asked for how to prevent this, I have a solution:

但是，既然你只问如何预防，我有一个解决办法:

Pull your infinite loop into a method where it can be excluded from the JIT compiler

将无限循环拉到一个方法中，在这个方法中可以将它从JIT编译器中排除

public class TestBlockingThread {
    private static final Logger LOGGER = Logger.getLogger(TestBlockingThread.class.getName());

    public static final void main(String[] args) throws InterruptedException     {
        Runnable task = () -> {
            infLoop();
        };
        new Thread(new LogTimer()).start();
        Thread.sleep(2000);
        new Thread(task).start();
    }

    private static void infLoop()
    {
        int i = 0;
        while (true) {
            i++;
            if (i != 0) {
                boolean b = 1 % i == 0;
            }
        }
    }

Run your program with this VM argument:

使用这个VM参数运行程序:

-XX:CompileCommand=exclude,PACKAGE.TestBlockingThread::infLoop (replace PACKAGE with your package information)

- xx:CompileCommand =排除,包。TestBlockingThread::infLoop(用您的包信息替换包)

You should get a message like this to indicate when the method would have been JIT-compiled:
### Excluding compile: static blocking.TestBlockingThread::infLoop
you may notice that I put the class into a package called blocking

您应该得到这样的消息，以指示何时该方法将被jit编译:###(不包括compile: static blocking)。TestBlockingThread:::infLoop您可能会注意到，我将类放入了一个称为阻塞的包中

#1