在执行if判断时,可以使用GCC提供了__builtin_expect对代码进行优化,可以提高代码的运行速度,参考GCC手册的"3.10 Options That Control Optimization".原理是:CPU在执行指令时采用的是流水线的方式,一条指令的执行大致会经过"取码 --> 译码 -->执行",如果在执行时发现需要进行跳转的话,会flush流水线,然后从新的地址重新开始"取码 --> 译码 --> 执行",这个过程会降低代码的执行效率,所以尽量减少跳转的可能(也就是flush流水线的发生频率),就可以提高代码的执行效率 。下面用一个简单的程序为例分析一下。1 #include <stdio.h> 2 3 #define likely(x) __builtin_expect(!!(x), 1) 4 #define unlikely(x) __builtin_expect(!!(x), 0) 5 6 void func1(int a) 7 { 8 int b; 9 10 if (unlikely(a >= 0)) { 11 b = a + 1; 12 printf("b = %d\n", b); 13 } else { 14 b = a + 2; 15 printf("b = %d\n", b); 16 } 17 } 18 19 void func2(int a) 20 { 21 int b; 22 23 if (likely(a >= 0)) { 24 b = a + 1; 25 printf("b = %d\n", b); 26 } else { 27 b = a + 2; 28 printf("b = %d\n", b); 29 } 30 31 } 32 33 int main(int argc, const char *argv[]) 34 { 35 int a = 0; 36 37 scanf("a = %d", &a); 38 39 func1(a); 40 func2(a); 41 42 return 0; 43 }likely(x)用于x为真的可能性更大的场景,unlikey(x)用于x为假的可能性更大的场景,这两个宏的最终目的就是尽量减少跳转,因为只要跳转,pipeline就会flush,就会降低效率。
想让上面的优化生效的话,需要指定一定的优化等级,因为默认是-O0,没有任何优化。下面是-O0的反汇编:
00000000004005bc <func1>: 4005bc: a9bd7bfd stp x29, x30, [sp, #-48]! 4005c0: 910003fd mov x29, sp 4005c4: b9001fa0 str w0, [x29, #28] 4005c8: b9401fa0 ldr w0, [x29, #28] 4005cc: 2a2003e0 mvn w0, w0 4005d0: 531f7c00 lsr w0, w0, #31 4005d4: 12001c00 and w0, w0, #0xff 4005d8: 92401c00 and x0, x0, #0xff 4005dc: f100001f cmp x0, #0x0 4005e0: 54000120 b.eq 400604 <func1+0x48> // b.none 4005e4: b9401fa0 ldr w0, [x29, #28] 4005e8: 11000400 add w0, w0, #0x1 4005ec: b9002fa0 str w0, [x29, #44] 4005f0: 90000000 adrp x0, 400000 <_init-0x430> 4005f4: 911e4000 add x0, x0, #0x790 4005f8: b9402fa1 ldr w1, [x29, #44] 4005fc: 97ffffad bl 4004b0 <printf@plt> 400600: 14000008 b 400620 <func1+0x64> 400604: b9401fa0 ldr w0, [x29, #28] 400608: 11000800 add w0, w0, #0x2 40060c: b9002fa0 str w0, [x29, #44] 400610: 90000000 adrp x0, 400000 <_init-0x430> 400614: 911e4000 add x0, x0, #0x790 400618: b9402fa1 ldr w1, [x29, #44] 40061c: 97ffffa5 bl 4004b0 <printf@plt> 400620: d503201f nop 400624: a8c37bfd ldp x29, x30, [sp], #48 400628: d65f03c0 ret 000000000040062c <func2>: 40062c: a9bd7bfd stp x29, x30, [sp, #-48]! 400630: 910003fd mov x29, sp 400634: b9001fa0 str w0, [x29, #28] 400638: b9401fa0 ldr w0, [x29, #28] 40063c: 2a2003e0 mvn w0, w0 400640: 531f7c00 lsr w0, w0, #31 400644: 12001c00 and w0, w0, #0xff 400648: 92401c00 and x0, x0, #0xff 40064c: f100001f cmp x0, #0x0 400650: 54000120 b.eq 400674 <func2+0x48> // b.none 400654: b9401fa0 ldr w0, [x29, #28] 400658: 11000400 add w0, w0, #0x1 40065c: b9002fa0 str w0, [x29, #44] 400660: 90000000 adrp x0, 400000 <_init-0x430> 400664: 911e4000 add x0, x0, #0x790 400668: b9402fa1 ldr w1, [x29, #44] 40066c: 97ffff91 bl 4004b0 <printf@plt> 400670: 14000008 b 400690 <func2+0x64> 400674: b9401fa0 ldr w0, [x29, #28] 400678: 11000800 add w0, w0, #0x2 40067c: b9002fa0 str w0, [x29, #44] 400680: 90000000 adrp x0, 400000 <_init-0x430> 400684: 911e4000 add x0, x0, #0x790 400688: b9402fa1 ldr w1, [x29, #44] 40068c: 97ffff89 bl 4004b0 <printf@plt> 400690: d503201f nop 400694: a8c37bfd ldp x29, x30, [sp], #48 400698: d65f03c0 ret
可以看到,反汇编完全是按照C语言逻辑走的,一五一十,按部就班,上面的优化宏没有起到任何作用。
下面先用-O1看看效果。GCC对-O和-O1的描述是:the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
aarch64-linux-gnu-gcc predict.c -o predict -O1
aarch64-linux-gnu-objdump -D predict > predict.S
下面是func1的反汇编结果:
00000000004005bc <func1>: 4005bc: a9bf7bfd stp x29, x30, [sp, #-16]! 4005c0: 910003fd mov x29, sp 4005c4: 36f800e0 tbz w0, #31, 4005e0 <func1+0x24> 4005c8: 11000801 add w1, w0, #0x2 4005cc: 90000000 adrp x0, 400000 <_init-0x430> 4005d0: 911c6000 add x0, x0, #0x718 4005d4: 97ffffb7 bl 4004b0 <printf@plt> 4005d8: a8c17bfd ldp x29, x30, [sp], #16 4005dc: d65f03c0 ret 4005e0: 11000401 add w1, w0, #0x1 4005e4: 90000000 adrp x0, 400000 <_init-0x430> 4005e8: 911c6000 add x0, x0, #0x718 4005ec: 97ffffb1 bl 4004b0 <printf@plt> 4005f0: 17fffffa b 4005d8 <func1+0x1c>
00000000004005f4 <func2>: 4005f4: a9bf7bfd stp x29, x30, [sp, #-16]! 4005f8: 910003fd mov x29, sp 4005fc: 37f800e0 tbnz w0, #31, 400618 <func2+0x24> 400600: 11000401 add w1, w0, #0x1 400604: 90000000 adrp x0, 400000 <_init-0x430> 400608: 911c6000 add x0, x0, #0x718 40060c: 97ffffa9 bl 4004b0 <printf@plt> 400610: a8c17bfd ldp x29, x30, [sp], #16 400614: d65f03c0 ret 400618: 11000801 add w1, w0, #0x2 40061c: 90000000 adrp x0, 400000 <_init-0x430> 400620: 911c6000 add x0, x0, #0x718 400624: 97ffffa3 bl 4004b0 <printf@plt> 400628: 17fffffa b 400610 <func2+0x1c>
当然,如果likely和unlikely用的不符合实际情况,代码的执行效率更恶化。
下面我们在看看不同的优化等级下,对最终生成的机器码有什么影响:
-O2:Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to ‘-O’, this option increases both compilation time and the performance of the generated code.
00000000004005f8 <func1>: 4005f8: 90000002 adrp x2, 400000 <_init-0x430> 4005fc: 36f80080 tbz w0, #31, 40060c <func1+0x14> 400600: 11000801 add w1, w0, #0x2 400604: 911ba040 add x0, x2, #0x6e8 400608: 17ffffaa b 4004b0 <printf@plt> 40060c: 11000401 add w1, w0, #0x1 400610: 911ba040 add x0, x2, #0x6e8 400614: 17ffffa7 b 4004b0 <printf@plt> 0000000000400618 <func2>: 400618: 90000002 adrp x2, 400000 <_init-0x430> 40061c: 37f80080 tbnz w0, #31, 40062c <func2+0x14> 400620: 11000401 add w1, w0, #0x1 400624: 911ba040 add x0, x2, #0x6e8 400628: 17ffffa2 b 4004b0 <printf@plt> 40062c: 11000801 add w1, w0, #0x2 400630: 911ba040 add x0, x2, #0x6e8 400634: 17ffff9f b 4004b0 <printf@plt>
-O3:Optimize yet more. ‘-O3’ turns on all optimizations specified by ‘-O2’ and also turns on more optimization flags
00000000004005f8 <func1>: 4005f8: 90000002 adrp x2, 400000 <_init-0x430> 4005fc: 36f80080 tbz w0, #31, 40060c <func1+0x14> 400600: 11000801 add w1, w0, #0x2 400604: 911ba040 add x0, x2, #0x6e8 400608: 17ffffaa b 4004b0 <printf@plt> 40060c: 11000401 add w1, w0, #0x1 400610: 911ba040 add x0, x2, #0x6e8 400614: 17ffffa7 b 4004b0 <printf@plt> 0000000000400618 <func2>: 400618: 90000002 adrp x2, 400000 <_init-0x430> 40061c: 37f80080 tbnz w0, #31, 40062c <func2+0x14> 400620: 11000401 add w1, w0, #0x1 400624: 911ba040 add x0, x2, #0x6e8 400628: 17ffffa2 b 4004b0 <printf@plt> 40062c: 11000801 add w1, w0, #0x2 400630: 911ba040 add x0, x2, #0x6e8 400634: 17ffff9f b 4004b0 <printf@plt>
-Os:Optimize for size. ‘-Os’ enables all ‘-O2’ optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
00000000004005f4 <func1>: 4005f4: 90000002 adrp x2, 400000 <_init-0x430> 4005f8: 37f80080 tbnz w0, #31, 400608 <func1+0x14> 4005fc: 11000401 add w1, w0, #0x1 400600: 911b8040 add x0, x2, #0x6e0 400604: 17ffffab b 4004b0 <printf@plt> 400608: 11000801 add w1, w0, #0x2 40060c: 17fffffd b 400600 <func1+0xc> 0000000000400610 <func2>: 400610: 90000002 adrp x2, 400000 <_init-0x430> 400614: 37f80080 tbnz w0, #31, 400624 <func2+0x14> 400618: 11000401 add w1, w0, #0x1 40061c: 911b8040 add x0, x2, #0x6e0 400620: 17ffffa4 b 4004b0 <printf@plt> 400624: 11000801 add w1, w0, #0x2 400628: 17fffffd b 40061c <func2+0xc>
-Os主要是对代码尺寸的优化(可以看到,此时两个func反汇编出来的汇编指令是最少的),但是从执行效率看,就差点,likely和unlikey此时对代码没有起到任何优化效果。
完。