为什么JIT顺序会影响性能?

Why does the order in which C# methods in .NET 4.0 are just-in-time compiled affect how quickly they execute? For example, consider two equivalent methods:

为什么在。net 4.0中c#方法的顺序是即时编译的，影响了它们执行的速度?例如，考虑两种等效方法:

public static void SingleLineTest()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    int count = 0;
    for (uint i = 0; i < 1000000000; ++i) {
        count += i % 16 == 0 ? 1 : 0;
    }
    stopwatch.Stop();
    Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}

public static void MultiLineTest()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    int count = 0;
    for (uint i = 0; i < 1000000000; ++i) {
        var isMultipleOf16 = i % 16 == 0;
        count += isMultipleOf16 ? 1 : 0;
    }
    stopwatch.Stop();
    Console.WriteLine("Multi-line test  --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}

The only difference is the introduction of a local variable, which affects the assembly code generated and the loop performance. Why that is the case is a question in its own right.

惟一的区别是引入了一个局部变量，它影响生成的汇编代码和循环性能。为什么会出现这种情况本身就是一个问题。

Possibly even stranger is that on x86 (but not x64), the order that the methods are invoked has around a 20% impact on performance. Invoke the methods like this...

可能更奇怪的是，在x86(但不是x64)上，调用方法的顺序对性能有大约20%的影响。调用这样的方法…

static void Main()
{
    SingleLineTest();
    MultiLineTest();
}

...and SingleLineTest is faster. (Compile using the x86 Release configuration, ensuring that "Optimize code" setting is enabled, and run the test from outside VS2010.) But reverse the order...

…和SingleLineTest更快。(使用x86发布配置，确保启用“优化代码”设置，并在VS2010之外运行测试。)但反向顺序…

static void Main()
{
    MultiLineTest();
    SingleLineTest();
}

...and both methods take the same time (almost, but not quite, as long as MultiLineTest before). (When running this test, it's useful to add some additional calls to SingleLineTest and MultiLineTest to get additional samples. How many and what order doesn't matter, except for which method is called first.)

…这两种方法都是同时使用的(几乎是，但不完全是，只要是多linetest之前)。(在运行这个测试时，需要向SingleLineTest和MultiLineTest添加一些额外的调用以获得更多的示例。有多少和什么顺序无关紧要，除非先调用哪个方法。)

Finally, to demonstrate that JIT order is important, leave MultiLineTest first, but force SingleLineTest to be JITed first...

最后，要证明JIT的顺序是重要的，首先要离开多行，但是要先让它先退出。

static void Main()
{
    RuntimeHelpers.PrepareMethod(typeof(Program).GetMethod("SingleLineTest").MethodHandle);
    MultiLineTest();
    SingleLineTest();
}

Now, SingleLineTest is faster again.

现在，SingleLineTest又快了。

If you turn off "Suppress JIT optimization on module load" in VS2010, you can put a breakpoint in SingleLineTest and see that the assembly code in the loop is the same regardless of JIT order; however, the assembly code at the beginning of the method varies. But how this matters when the bulk of the time is spent in the loop is perplexing.

如果在VS2010中关闭“抑制模块负载的JIT优化”，您可以在SingleLineTest中放置一个断点，并且可以看到，无论JIT顺序如何，循环中的汇编代码都是相同的;但是，在方法开始时的汇编代码是不同的。但是，当大部分时间都花在循环上的时候，这又有什么关系呢?

A sample project demonstrating this behavior is on github.

演示此行为的示例项目位于github上。

It's not clear how this behavior affects real-world applications. One concern is that it can make performance tuning volatile, depending on the order methods happen to be first called. Problems of this sort would be difficult to detect with a profiler. Once you found the hotspots and optimized their algorithms, it would be hard to know without a lot of guess and check whether additional speedup is possible by JITing methods early.

目前还不清楚这种行为如何影响实际应用。一个问题是，它可以使性能调优变得不稳定，这取决于第一次调用的顺序方法。这类问题很难用分析器来检测。一旦你找到了热点并优化了他们的算法，你就很难知道没有很多猜测，并检查是否有可能通过早期的JITing方法进行额外的加速。

Update: See also the Microsoft Connect entry for this issue.

更新:请参阅Microsoft Connect条目以解决此问题。

3 个解决方案

#1

Please note that I do not trust the "Suppress JIT optimization on module load" option, I spawn the process without debugging and attach my debugger after the JIT has run.

请注意，我不相信“在模块加载上抑制JIT优化”选项，在JIT运行后，我不需要调试和附加调试器就可以生成进程。

In the version where single-line runs faster, this is Main:

在单行运行更快的版本中，这是主要的:

        SingleLineTest();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  call        dword ptr ds:[0019380Ch] 
            MultiLineTest();
00000009  call        dword ptr ds:[00193818h] 
            SingleLineTest();
0000000f  call        dword ptr ds:[0019380Ch] 
            MultiLineTest();
00000015  call        dword ptr ds:[00193818h] 
            SingleLineTest();
0000001b  call        dword ptr ds:[0019380Ch] 
            MultiLineTest();
00000021  call        dword ptr ds:[00193818h] 
00000027  pop         ebp 
        }
00000028  ret

Note that MultiLineTest has been placed on an 8 byte boundary, and SingleLineTest on a 4 byte boundary.

注意，MultiLineTest被放置在一个8字节的边界上，而SingleLineTest位于一个4字节的边界上。

Here's Main for the version where both run at the same speed:

这是主要的版本，两者都以相同的速度运行:

            MultiLineTest();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  call        dword ptr ds:[00153818h] 

            SingleLineTest();
00000009  call        dword ptr ds:[0015380Ch] 
            MultiLineTest();
0000000f  call        dword ptr ds:[00153818h] 
            SingleLineTest();
00000015  call        dword ptr ds:[0015380Ch] 
            MultiLineTest();
0000001b  call        dword ptr ds:[00153818h] 
            SingleLineTest();
00000021  call        dword ptr ds:[0015380Ch] 
            MultiLineTest();
00000027  call        dword ptr ds:[00153818h] 
0000002d  pop         ebp 
        }
0000002e  ret

Amazingly, the addresses chosen by the JIT are identical in the last 4 digits, even though it allegedly processed them in the opposite order. Not sure I believe that any more.

令人惊讶的是，JIT所选择的地址在最后4位数字中是相同的，尽管据称它以相反的顺序处理它们。我不太相信。

More digging is necessary. I think it was mentioned that the code before the loop wasn't exactly the same in both versions? Going to investigate.

更多的挖掘是必要的。我认为在这两个版本中循环之前的代码并不完全相同?去调查。

Here's the "slow" version of SingleLineTest (and I checked, the last digits of the function address haven't changed).

这里是SingleLineTest的“慢”版本(我检查过，函数地址的最后一个数字没有改变)。

            Stopwatch stopwatch = new Stopwatch();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        edi 
00000004  push        esi 
00000005  push        ebx 
00000006  mov         ecx,7A5A2C68h 
0000000b  call        FFF91EA0 
00000010  mov         esi,eax 
00000012  mov         dword ptr [esi+4],0 
00000019  mov         dword ptr [esi+8],0 
00000020  mov         byte ptr [esi+14h],0 
00000024  mov         dword ptr [esi+0Ch],0 
0000002b  mov         dword ptr [esi+10h],0 
            stopwatch.Start();
00000032  cmp         byte ptr [esi+14h],0 
00000036  jne         00000047 
00000038  call        7A22B314 
0000003d  mov         dword ptr [esi+0Ch],eax 
00000040  mov         dword ptr [esi+10h],edx 
00000043  mov         byte ptr [esi+14h],1 
            int count = 0;
00000047  xor         edi,edi 
            for (uint i = 0; i < 1000000000; ++i) {
00000049  xor         edx,edx 
                count += i % 16 == 0 ? 1 : 0;
0000004b  mov         eax,edx 
0000004d  and         eax,0Fh 
00000050  test        eax,eax 
00000052  je          00000058 
00000054  xor         eax,eax 
00000056  jmp         0000005D 
00000058  mov         eax,1 
0000005d  add         edi,eax 
            for (uint i = 0; i < 1000000000; ++i) {
0000005f  inc         edx 
00000060  cmp         edx,3B9ACA00h 
00000066  jb          0000004B 
            }
            stopwatch.Stop();
00000068  mov         ecx,esi 
0000006a  call        7A23F2C0 
            Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
0000006f  mov         ecx,797C29B4h 
00000074  call        FFF91EA0 
00000079  mov         ecx,eax 
0000007b  mov         dword ptr [ecx+4],edi 
0000007e  mov         ebx,ecx 
00000080  mov         ecx,797BA240h 
00000085  call        FFF91EA0 
0000008a  mov         edi,eax 
0000008c  mov         ecx,esi 
0000008e  call        7A23ABE8 
00000093  push        edx 
00000094  push        eax 
00000095  push        0 
00000097  push        2710h 
0000009c  call        783247EC 
000000a1  mov         dword ptr [edi+4],eax 
000000a4  mov         dword ptr [edi+8],edx 
000000a7  mov         esi,edi 
000000a9  call        793C6F40 
000000ae  push        ebx 
000000af  push        esi 
000000b0  mov         ecx,eax 
000000b2  mov         edx,dword ptr ds:[03392034h] 
000000b8  mov         eax,dword ptr [ecx] 
000000ba  mov         eax,dword ptr [eax+3Ch] 
000000bd  call        dword ptr [eax+1Ch] 
000000c0  pop         ebx 
        }
000000c1  pop         esi 
000000c2  pop         edi 
000000c3  pop         ebp 
000000c4  ret

And the "fast" version:

和“快速”的版本:

            Stopwatch stopwatch = new Stopwatch();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        edi 
00000004  push        esi 
00000005  push        ebx 
00000006  mov         ecx,7A5A2C68h 
0000000b  call        FFE11F70 
00000010  mov         esi,eax 
00000012  mov         ecx,esi 
00000014  call        7A1068BC 
            stopwatch.Start();
00000019  cmp         byte ptr [esi+14h],0 
0000001d  jne         0000002E 
0000001f  call        7A12B3E4 
00000024  mov         dword ptr [esi+0Ch],eax 
00000027  mov         dword ptr [esi+10h],edx 
0000002a  mov         byte ptr [esi+14h],1 
            int count = 0;
0000002e  xor         edi,edi 
            for (uint i = 0; i < 1000000000; ++i) {
00000030  xor         edx,edx 
                count += i % 16 == 0 ? 1 : 0;
00000032  mov         eax,edx 
00000034  and         eax,0Fh 
00000037  test        eax,eax 
00000039  je          0000003F 
0000003b  xor         eax,eax 
0000003d  jmp         00000044 
0000003f  mov         eax,1 
00000044  add         edi,eax 
            for (uint i = 0; i < 1000000000; ++i) {
00000046  inc         edx 
00000047  cmp         edx,3B9ACA00h 
0000004d  jb          00000032 
            }
            stopwatch.Stop();
0000004f  mov         ecx,esi 
00000051  call        7A13F390 
            Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
00000056  mov         ecx,797C29B4h 
0000005b  call        FFE11F70 
00000060  mov         ecx,eax 
00000062  mov         dword ptr [ecx+4],edi 
00000065  mov         ebx,ecx 
00000067  mov         ecx,797BA240h 
0000006c  call        FFE11F70 
00000071  mov         edi,eax 
00000073  mov         ecx,esi 
00000075  call        7A13ACB8 
0000007a  push        edx 
0000007b  push        eax 
0000007c  push        0 
0000007e  push        2710h 
00000083  call        782248BC 
00000088  mov         dword ptr [edi+4],eax 
0000008b  mov         dword ptr [edi+8],edx 
0000008e  mov         esi,edi 
00000090  call        792C7010 
00000095  push        ebx 
00000096  push        esi 
00000097  mov         ecx,eax 
00000099  mov         edx,dword ptr ds:[03562030h] 
0000009f  mov         eax,dword ptr [ecx] 
000000a1  mov         eax,dword ptr [eax+3Ch] 
000000a4  call        dword ptr [eax+1Ch] 
000000a7  pop         ebx 
        }
000000a8  pop         esi 
000000a9  pop         edi 
000000aa  pop         ebp 
000000ab  ret

Just the loops, fast on the left, slow on the right:

只在左边快速的循环，在右边缓慢:

00000030  xor         edx,edx                 00000049  xor         edx,edx 
00000032  mov         eax,edx                 0000004b  mov         eax,edx 
00000034  and         eax,0Fh                 0000004d  and         eax,0Fh 
00000037  test        eax,eax                 00000050  test        eax,eax 
00000039  je          0000003F                00000052  je          00000058 
0000003b  xor         eax,eax                 00000054  xor         eax,eax 
0000003d  jmp         00000044                00000056  jmp         0000005D 
0000003f  mov         eax,1                   00000058  mov         eax,1 
00000044  add         edi,eax                 0000005d  add         edi,eax 
00000046  inc         edx                     0000005f  inc         edx 
00000047  cmp         edx,3B9ACA00h           00000060  cmp         edx,3B9ACA00h 
0000004d  jb          00000032                00000066  jb          0000004B

The instructions are identical (being relative jumps, the machine code is identical even though the disassembly shows different addresses), but the alignment is different. There are three jumps. the je loading a constant 1 is aligned in the slow version and not in the fast version, but it hardly matters, since that jump is only taken 1/16 of the time. The other two jumps ( jmp after loading a constant zero, and jb repeating the entire loop) are taken millions more times, and are aligned in the "fast" version.

指令是相同的(相对跳转，即使拆卸显示不同的地址，机器代码也是相同的)，但是对齐方式是不同的。有三个跳跃。我加载一个常数1是在慢版本中，而不是在快速的版本中，但是这几乎不重要，因为这个跳跃只占用了1/16的时间。另外两个跳转(在加载一个常量0之后jmp, jb重复整个循环)被花费了数百万次，并在“快速”版本中对齐。

I think this is the smoking gun.

我想这是确凿的证据。

#2

So for a definitive answer... I suspect we would need to dig into the dis-assembly.

因此，为了得到一个明确的答案……我想我们需要深入剖析一下这个问题。

However, I have a guess. The compiler for the SingleLineTest() stores each result of the equation on the stack and pops each value as needed. However, the MultiLineTest() may be storing values and having to access them from there. This could cause a few clock cycles to be missed. Where as grabbing the values off the stack will keep it in a register.

然而，我有一个猜想。SingleLineTest()的编译器将每个结果存储在堆栈上，并在需要时弹出每个值。但是，MultiLineTest()可能存储值，并且必须从那里访问它们。这可能导致几个时钟周期被忽略。从堆栈中获取值将保存在寄存器中。

Interestingly, changing the order of the function compilation may be adjusting the garbage collector's actions. Because isMultipleOf16 is defined within the loop, it may be be handled funny. You may want to move the definition outside of the loop and see what that changes...

有趣的是，改变函数编译的顺序可能会调整垃圾收集器的操作。因为isMultipleOf16是在循环中定义的，所以它可能被处理得很有趣。您可能想要将定义移到循环之外，看看会发生什么变化……

#3

My time is 2400 and 2600 on i5-2410M 2,3Ghz 4GB ram 64bit Win 7.

我的时间是2400和2600，在i5-2410M 2,3Ghz 4GB ram 64位赢7。

Here is my output: Single first

这是我的输出:单头。

After starting the process and then attaching the debugger

启动进程后，然后附加调试器。

            SingleLineTest();
            MultiLineTest();
            SingleLineTest();
            MultiLineTest();
            SingleLineTest();
            MultiLineTest();
--------------------------------
SingleLineTest()
           Stopwatch stopwatch = new Stopwatch();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        edi 
00000004  push        esi 
00000005  push        ebx 
00000006  mov         ecx,685D2C68h 
0000000b  call        FFD91F70 
00000010  mov         esi,eax 
00000012  mov         ecx,esi 
00000014  call        681D68BC 
            stopwatch.Start();
00000019  cmp         byte ptr [esi+14h],0 
0000001d  jne         0000002E 
0000001f  call        681FB3E4 
00000024  mov         dword ptr [esi+0Ch],eax 
00000027  mov         dword ptr [esi+10h],edx 
0000002a  mov         byte ptr [esi+14h],1 
            int count = 0;
0000002e  xor         edi,edi 
            for (int i = 0; i < 1000000000; ++i)
00000030  xor         edx,edx 
            {
                count += i % 16 == 0 ? 1 : 0;
00000032  mov         eax,edx 
00000034  and         eax,8000000Fh 
00000039  jns         00000040 
0000003b  dec         eax 
0000003c  or          eax,0FFFFFFF0h 
0000003f  inc         eax 
00000040  test        eax,eax 
00000042  je          00000048 
00000044  xor         eax,eax 
00000046  jmp         0000004D 
00000048  mov         eax,1 
0000004d  add         edi,eax 
            for (int i = 0; i < 1000000000; ++i)
0000004f  inc         edx 
00000050  cmp         edx,3B9ACA00h 
00000056  jl          00000032 
            }
            stopwatch.Stop();
00000058  mov         ecx,esi 
0000005a  call        6820F390 
            Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
0000005f  mov         ecx,6A8B29B4h 
00000064  call        FFD91F70 
00000069  mov         ecx,eax 
0000006b  mov         dword ptr [ecx+4],edi 
0000006e  mov         ebx,ecx 
00000070  mov         ecx,6A8AA240h 
00000075  call        FFD91F70 
0000007a  mov         edi,eax 
0000007c  mov         ecx,esi 
0000007e  call        6820ACB8 
00000083  push        edx 
00000084  push        eax 
00000085  push        0 
00000087  push        2710h 
0000008c  call        6AFF48BC 
00000091  mov         dword ptr [edi+4],eax 
00000094  mov         dword ptr [edi+8],edx 
00000097  mov         esi,edi 
00000099  call        6A457010 
0000009e  push        ebx 
0000009f  push        esi 
000000a0  mov         ecx,eax 
000000a2  mov         edx,dword ptr ds:[039F2030h] 
000000a8  mov         eax,dword ptr [ecx] 
000000aa  mov         eax,dword ptr [eax+3Ch] 
000000ad  call        dword ptr [eax+1Ch] 
000000b0  pop         ebx 
        }
000000b1  pop         esi 
000000b2  pop         edi 
000000b3  pop         ebp 
000000b4  ret

Multi first:

多:

            MultiLineTest();

            SingleLineTest();
            MultiLineTest();
            SingleLineTest();
            MultiLineTest();
            SingleLineTest();
            MultiLineTest();
--------------------------------
SingleLineTest()
            Stopwatch stopwatch = new Stopwatch();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        edi 
00000004  push        esi 
00000005  push        ebx 
00000006  mov         ecx,685D2C68h 
0000000b  call        FFF31EA0 
00000010  mov         esi,eax 
00000012  mov         dword ptr [esi+4],0 
00000019  mov         dword ptr [esi+8],0 
00000020  mov         byte ptr [esi+14h],0 
00000024  mov         dword ptr [esi+0Ch],0 
0000002b  mov         dword ptr [esi+10h],0 
            stopwatch.Start();
00000032  cmp         byte ptr [esi+14h],0 
00000036  jne         00000047 
00000038  call        682AB314 
0000003d  mov         dword ptr [esi+0Ch],eax 
00000040  mov         dword ptr [esi+10h],edx 
00000043  mov         byte ptr [esi+14h],1 
            int count = 0;
00000047  xor         edi,edi 
            for (int i = 0; i < 1000000000; ++i)
00000049  xor         edx,edx 
            {
                count += i % 16 == 0 ? 1 : 0;
0000004b  mov         eax,edx 
0000004d  and         eax,8000000Fh 
00000052  jns         00000059 
00000054  dec         eax 
00000055  or          eax,0FFFFFFF0h 
00000058  inc         eax 
00000059  test        eax,eax 
0000005b  je          00000061 
0000005d  xor         eax,eax 
0000005f  jmp         00000066 
00000061  mov         eax,1 
00000066  add         edi,eax 
            for (int i = 0; i < 1000000000; ++i)
00000068  inc         edx 
00000069  cmp         edx,3B9ACA00h 
0000006f  jl          0000004B 
            }
            stopwatch.Stop();
00000071  mov         ecx,esi 
00000073  call        682BF2C0 
            Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
00000078  mov         ecx,6A8B29B4h 
0000007d  call        FFF31EA0 
00000082  mov         ecx,eax 
00000084  mov         dword ptr [ecx+4],edi 
00000087  mov         ebx,ecx 
00000089  mov         ecx,6A8AA240h 
0000008e  call        FFF31EA0 
00000093  mov         edi,eax 
00000095  mov         ecx,esi 
00000097  call        682BABE8 
0000009c  push        edx 
0000009d  push        eax 
0000009e  push        0 
000000a0  push        2710h 
000000a5  call        6B0A47EC 
000000aa  mov         dword ptr [edi+4],eax 
000000ad  mov         dword ptr [edi+8],edx 
000000b0  mov         esi,edi 
000000b2  call        6A506F40 
000000b7  push        ebx 
000000b8  push        esi 
000000b9  mov         ecx,eax 
000000bb  mov         edx,dword ptr ds:[038E2034h] 
000000c1  mov         eax,dword ptr [ecx] 
000000c3  mov         eax,dword ptr [eax+3Ch] 
000000c6  call        dword ptr [eax+1Ch] 
000000c9  pop         ebx 
        }
000000ca  pop         esi 
000000cb  pop         edi 
000000cc  pop         ebp 
000000cd  ret

#1