I am rephrasing this question based on the comments received.
我根据收到的意见重新提出这个问题。
I have a loop that runs 30 Billion times and assigns values to a chunk of memory assigned using malloc();
我有一个运行30亿次的循环,并将值分配给使用malloc()分配的内存块;
When the loop contains a condition it runs much slower than when the condition is not present. Review the scenarios below:
当循环包含条件时,它比不存在条件时运行得慢得多。查看以下方案:
Scenario A: Condition is present and program is slow (43 sec)
情景A:条件存在且程序缓慢(43秒)
Scenario B: Condition is not present and program is much faster (4 sec)
场景B:条件不存在,程序更快(4秒)
// gcc -O3 -c block.c && gcc -o block block.o
#include <stdio.h>
#include <stdlib.h>
#define LEN 3000000000
int main (int argc, char** argv){
long i,j;
unsigned char *n = NULL;
unsigned char *m = NULL;
m = (unsigned char *) malloc (sizeof(char) * LEN);
n = m;
srand ((unsigned) time(NULL));
int t = (unsigned) time(NULL);
for (j = 0; j < 10; j++){
n = m;
for (i = 0; i < LEN; i++){
//////////// A: THIS IS SLOW
/*
if (i % 2){
*n = 1;
} else {
*n = 0;
}
*/
/////////// END OF A
/////////// B: THIS IS FAST
*n = 0;
i % 2;
*n = 1;
/////////// END OF B
n += 1;
}
}
printf("Done. %d sec \n", ((unsigned) time(NULL)) - t );
free(m);
return 0;
}
Regards, KD
1 个解决方案
#1
0
You can use gcc -S -O3 to have a look at the resulting assembler. Here is an example on an Intel box:
您可以使用gcc -S -O3查看生成的汇编程序。以下是英特尔机箱的示例:
Fast version:
movl %eax, %r12d
.p2align 4,,10
.p2align 3
.L2:
movl $3000000000, %edx
movl $1, %esi
movq %rbp, %rdi
call memset
subq $1, %rbx
jne .L2
Slow version:
movl $10, %edi
movl %eax, %ebp
movl $3000000000, %esi
.p2align 4,,10
.p2align 3
.L2:
xorl %edx, %edx
.p2align 4,,10
.p2align 3
.L5:
movq %rdx, %rcx
andl $1, %ecx
movb %cl, (%rbx,%rdx)
addq $1, %rdx
cmpq %rsi, %rdx
jne .L5
subq $1, %rdi
jne .L2
Conclusion: the compiler is smarter than you think. It is able to optimize the inner loop as a memset (which is faster because it uses SSE/AVX or REP instructions on Intel). However, this optimization cannot kick in if the condition is kept - because the result is different.
结论:编译器比你想象的更聪明。它能够将内部循环优化为memset(由于它在Intel上使用SSE / AVX或REP指令,因此速度更快)。但是,如果保持条件,则无法启动此优化 - 因为结果不同。
#1
0
You can use gcc -S -O3 to have a look at the resulting assembler. Here is an example on an Intel box:
您可以使用gcc -S -O3查看生成的汇编程序。以下是英特尔机箱的示例:
Fast version:
movl %eax, %r12d
.p2align 4,,10
.p2align 3
.L2:
movl $3000000000, %edx
movl $1, %esi
movq %rbp, %rdi
call memset
subq $1, %rbx
jne .L2
Slow version:
movl $10, %edi
movl %eax, %ebp
movl $3000000000, %esi
.p2align 4,,10
.p2align 3
.L2:
xorl %edx, %edx
.p2align 4,,10
.p2align 3
.L5:
movq %rdx, %rcx
andl $1, %ecx
movb %cl, (%rbx,%rdx)
addq $1, %rdx
cmpq %rsi, %rdx
jne .L5
subq $1, %rdi
jne .L2
Conclusion: the compiler is smarter than you think. It is able to optimize the inner loop as a memset (which is faster because it uses SSE/AVX or REP instructions on Intel). However, this optimization cannot kick in if the condition is kept - because the result is different.
结论:编译器比你想象的更聪明。它能够将内部循环优化为memset(由于它在Intel上使用SSE / AVX或REP指令,因此速度更快)。但是,如果保持条件,则无法启动此优化 - 因为结果不同。