I have heard a lot of guys here saying that C++ is as fast or faster than C in everything, but cleaner and nicer.
我听到很多人都说C ++在所有方面都比C快或快,但更清洁,更好。
While I do not contradict the fact that C++ is very elegant, and quite fast, I did not find any replacement for critical memory access or processor-bound applications.
虽然我并不反对C ++非常优雅且非常快的事实,但我没有找到关键内存访问或处理器绑定应用程序的替代品。
Question: is there an equivalent in C++ for C-style arrays in terms of performance?
问题:在性能方面,C风格的数组在C ++中是否有等价的?
The example below is contrived, but I am interested in the solution for real-life problems: I develop image processing apps, and the amount of pixel processing there is huge.
以下示例是设计的,但我对实际问题的解决方案感兴趣:我开发了图像处理应用程序,并且像素处理的数量巨大。
double t;
// C++
std::vector<int> v;
v.resize(1000000,1);
int i, j, count = 0, size = v.size();
t = (double)getTickCount();
for(j=0;j<1000;j++)
{
count = 0;
for(i=0;i<size;i++)
count += v[i];
}
t = ((double)getTickCount() - t)/getTickFrequency();
std::cout << "(C++) For loop time [s]: " << t/1.0 << std::endl;
std::cout << count << std::endl;
// C-style
#define ARR_SIZE 1000000
int* arr = (int*)malloc( ARR_SIZE * sizeof(int) );
int ci, cj, ccount = 0, csize = ARR_SIZE;
for(ci=0;ci<csize;ci++)
arr[ci] = 1;
t = (double)getTickCount();
for(cj=0;cj<1000;cj++)
{
ccount = 0;
for(ci=0;ci<csize;ci++)
ccount += arr[ci];
}
free(arr);
t = ((double)getTickCount() - t)/getTickFrequency();
std::cout << "(C) For loop time [s]: " << t/1.0 << std::endl;
std::cout << ccount << std::endl;
Here is the result:
结果如下:
(C++) For loop time [s]: 0.329069
(C) For loop time [s]: 0.229961
Note: getTickCount()
comes from a third-party lib. If you want to test, just replace with your favourite clock measurement
注意:getTickCount()来自第三方库。如果您想测试,只需用您喜欢的时钟测量替换
Update:
I am using VS 2010, Release mode, everything else default
我正在使用VS 2010,发布模式,其他一切都默认
6 个解决方案
#1
11
Question: is there an equivalent in C++ for C-style arrays in terms of performance?
问题:在性能方面,C风格的数组在C ++中是否有等价的?
Answer: Write C++ code! Know your language, know your standard library and use it. Standard algorithms are correct, readable and fast (They know the best how to implement it to be fast on the current compiler).
答:编写C ++代码!了解您的语言,了解您的标准库并使用它。标准算法是正确的,可读的和快速的(他们知道如何在当前编译器上快速实现它)。
void testC()
{
// unchanged
}
void testCpp()
{
// unchanged initialization
for(j=0;j<1000;j++)
{
// how a C++ programmer accumulates:
count = std::accumulate(begin(v), end(v), 0);
}
// unchanged output
}
int main()
{
testC();
testCpp();
}
Output:
(C) For loop time [ms]: 434.373
1000000
(C++) For loop time [ms]: 419.79
1000000
Compiled with g++ -O3 -std=c++0x
Version 4.6.3 on Ubuntu.
在Ubuntu上用g ++ -O3 -std = c ++ 0x Version 4.6.3编译。
For your code my output is similiar to yours. user1202136 gives a good answer about the differences...
对于您的代码,我的输出与您的类似。 user1202136对差异给出了很好的答案......
#2
12
Simple answer: Your benchmark is flawed.
简单回答:您的基准是有缺陷的。
Longer answer: You need to turn on full optimization to get C++ performance advantage. Yet your benchmark is still flawed.
更长的答案:您需要启用完全优化才能获得C ++性能优势。然而,你的基准仍然存在缺陷。
Some observations:
- If you turn on full optimization, a very large chunk of for-loop would be removed. This make your benchmark meaningless.
-
std::vector
have overhead for dynamic reallocation, trystd::array
.
To be specific, microsoft's stl have checked iterator by default. - You don't have any barrier to prevent cross reordering between C / C++ code / benchmark code.
- (not really related)
cout << ccount
is locale aware,printf
is not;std::endl
flush the output,printf("\n")
don't.
如果打开完全优化,将删除一大块for循环。这使您的基准毫无意义。
std :: vector有动态重新分配的开销,请尝试std :: array。具体来说,microsoft的stl默认检查了迭代器。
您没有任何障碍来阻止C / C ++代码/基准代码之间的交叉重新排序。
(没有真正相关)cout << ccount是语言环境意识,printf不是; std :: endl刷新输出,printf(“\ n”)不刷新。
The "traditional" code for showing c++ advantage is C qsort()
vs C++ std::sort()
. This is where code inlineing shines.
显示c ++优势的“传统”代码是C qsort()vs C ++ std :: sort()。这是内联代码闪耀的地方。
If you want some "real-life" application example. Search for some raytracer or matrix multiplication stuff. Pick an compiler that do auto vectorization.
如果你想要一些“真实”的应用实例。搜索一些光线跟踪器或矩阵乘法的东西。选择一个执行自动矢量化的编译器。
Update Using the LLVM online demo, we can see the whole loop is reordered. The benchmark code is moved to start, and it jump to the loop ending point in the first loop for better branch prediction:
更新使用LLVM在线演示,我们可以看到整个循环被重新排序。基准代码移动到开始,并跳转到第一个循环中的循环结束点,以便更好地进行分支预测:
(this is c++ code)
(这是c ++代码)
######### jump to the loop end
jg .LBB0_11
.LBB0_3: # %..split_crit_edge
.Ltmp2:
# print the benchmark result
movl $0, 12(%esp)
movl $25, 8(%esp)
movl $.L.str, 4(%esp)
movl std::cout, (%esp)
calll std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
.Ltmp3:
# BB#4: # %_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc.exit
.Ltmp4:
movl std::cout, (%esp)
calll std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
.Ltmp5:
# BB#5: # %_ZNSolsEd.exit
movl %eax, %ecx
movl %ecx, 28(%esp) # 4-byte Spill
movl (%ecx), %eax
movl -24(%eax), %eax
movl 240(%eax,%ecx), %ebp
testl %ebp, %ebp
jne .LBB0_7
# BB#6:
.Ltmp52:
calll std::__throw_bad_cast()
.Ltmp53:
.LBB0_7: # %.noexc41
cmpb $0, 28(%ebp)
je .LBB0_15
# BB#8:
movb 39(%ebp), %al
jmp .LBB0_21
.align 16, 0x90
.LBB0_9: # Parent Loop BB0_11 Depth=1
# => This Inner Loop Header: Depth=2
addl (%edi,%edx,4), %ebx
addl $1, %edx
adcl $0, %esi
cmpl %ecx, %edx
jne .LBB0_9
# BB#10: # in Loop: Header=BB0_11 Depth=1
incl %eax
cmpl $1000, %eax # imm = 0x3E8
######### jump back to the print benchmark code
je .LBB0_3
My test code:
我的测试代码:
std::vector<int> v;
v.resize(1000000,1);
int i, j, count = 0, size = v.size();
for(j=0;j<1000;j++)
{
count = 0;
for(i=0;i<size;i++)
count += v[i];
}
std::cout << "(C++) For loop time [s]: " << t/1.0 << std::endl;
std::cout << count << std::endl;
#3
8
It seems to be a compiler problem. For C-arrays, the compiler detects the pattern, uses auto-vectorization and emits SSE instructions. For vector, it seems to lack the necessary intelligence.
这似乎是一个编译器问题。对于C阵列,编译器检测模式,使用自动矢量化并发出SSE指令。对于矢量,它似乎缺乏必要的智能。
If I force the compiler not to use SSE, the results are very similar (tested with g++ -mno-mmx -mno-sse -msoft-float -O3
):
如果我强制编译器不使用SSE,结果非常相似(使用g ++ -mno-mmx -mno-sse -msoft-float -O3测试):
(C++) For loop time [us]: 604610
1000000
(C) For loop time [us]: 601493
1000000
Here is the code that generated this output. It is basically the code in your question, but without any floating point.
以下是生成此输出的代码。它基本上是你问题中的代码,但没有任何浮点。
#include <iostream>
#include <vector>
#include <sys/time.h>
using namespace std;
long getTickCount()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000000 + tv.tv_usec;
}
int main() {
long t;
// C++
std::vector<int> v;
v.resize(1000000,1);
int i, j, count = 0, size = v.size();
t = getTickCount();
for(j=0;j<1000;j++)
{
count = 0;
for(i=0;i<size;i++)
count += v[i];
}
t = getTickCount() - t;
std::cout << "(C++) For loop time [us]: " << t << std::endl;
std::cout << count << std::endl;
// C-style
#define ARR_SIZE 1000000
int* arr = new int[ARR_SIZE];
int ci, cj, ccount = 0, csize = ARR_SIZE;
for(ci=0;ci<csize;ci++)
arr[ci] = 1;
t = getTickCount();
for(cj=0;cj<1000;cj++)
{
ccount = 0;
for(ci=0;ci<csize;ci++)
ccount += arr[ci];
}
delete arr;
t = getTickCount() - t;
std::cout << "(C) For loop time [us]: " << t << std::endl;
std::cout << ccount << std::endl;
}
#4
4
The C++ equivalent of a dynamically sized array would be std::vector
. The C++ equivalent of a fixed size array would be std::array
or std::tr1::array
pre-C++11.
动态大小的数组的C ++等价物是std :: vector。固定大小数组的C ++等价物是std :: array或std :: tr1 :: array pre-C ++ 11。
If your vector code has no re-sizings it is hard to see how it could be significantly slower than using a dynamically allocated C array, provided you compile with some optimization enabled.
如果您的矢量代码没有重复大小,那么很难看出它如何比使用动态分配的C数组慢得多,前提是您在编译时启用了一些优化。
Note: running the code as posted, compiled on gcc 4.4.3 on x86, compiler options
注意:运行代码已发布,在x86上的gcc 4.4.3上编译,编译器选项
g++ -Wall -Wextra -pedantic-errors -O2 -std=c++0x
g ++ -Wall -Wextra -pedantic-errors -O2 -std = c ++ 0x
the results are repeatably close to
结果重复接近
(C++) For loop time [us]: 507888
(C ++)对于循环时间[us]:507888
1000000
(C) For loop time [us]: 496659
(C)循环时间[us]:496659
1000000
So, seemingly ~2% slower for the std::vector
variant after a small number of trials. I'd consider this compatible performance.
因此,在少数试验后,std :: vector变种看起来似乎慢了约2%。我会考虑这种兼容的性能。
#5
0
What you point out is the fact that accessing objects will always come with a little overhead, so accessing a vector
won't be faster than accessing a good old array.
你指出的是访问对象总是带来一点开销,因此访问一个向量并不比访问一个好的旧数组快。
But even if using an array is "C-stylish", it remains C++ so it won't be a problem.
但即使使用数组是“C-stylish”,它仍然是C ++,所以它不会成为问题。
Then, as @juanchopanza said, there is std::array
in C++11, which may be more efficient than std::vector
, but specialized for fixed-size array.
然后,正如@juanchopanza所说,C ++ 11中有std :: array,它可能比std :: vector更有效,但专门用于固定大小的数组。
#6
0
Usually Compiler does all the optimization... You have just to choose a good compiler
通常编译器会进行所有优化...您只需要选择一个好的编译器
#1
11
Question: is there an equivalent in C++ for C-style arrays in terms of performance?
问题:在性能方面,C风格的数组在C ++中是否有等价的?
Answer: Write C++ code! Know your language, know your standard library and use it. Standard algorithms are correct, readable and fast (They know the best how to implement it to be fast on the current compiler).
答:编写C ++代码!了解您的语言,了解您的标准库并使用它。标准算法是正确的,可读的和快速的(他们知道如何在当前编译器上快速实现它)。
void testC()
{
// unchanged
}
void testCpp()
{
// unchanged initialization
for(j=0;j<1000;j++)
{
// how a C++ programmer accumulates:
count = std::accumulate(begin(v), end(v), 0);
}
// unchanged output
}
int main()
{
testC();
testCpp();
}
Output:
(C) For loop time [ms]: 434.373
1000000
(C++) For loop time [ms]: 419.79
1000000
Compiled with g++ -O3 -std=c++0x
Version 4.6.3 on Ubuntu.
在Ubuntu上用g ++ -O3 -std = c ++ 0x Version 4.6.3编译。
For your code my output is similiar to yours. user1202136 gives a good answer about the differences...
对于您的代码,我的输出与您的类似。 user1202136对差异给出了很好的答案......
#2
12
Simple answer: Your benchmark is flawed.
简单回答:您的基准是有缺陷的。
Longer answer: You need to turn on full optimization to get C++ performance advantage. Yet your benchmark is still flawed.
更长的答案:您需要启用完全优化才能获得C ++性能优势。然而,你的基准仍然存在缺陷。
Some observations:
- If you turn on full optimization, a very large chunk of for-loop would be removed. This make your benchmark meaningless.
-
std::vector
have overhead for dynamic reallocation, trystd::array
.
To be specific, microsoft's stl have checked iterator by default. - You don't have any barrier to prevent cross reordering between C / C++ code / benchmark code.
- (not really related)
cout << ccount
is locale aware,printf
is not;std::endl
flush the output,printf("\n")
don't.
如果打开完全优化,将删除一大块for循环。这使您的基准毫无意义。
std :: vector有动态重新分配的开销,请尝试std :: array。具体来说,microsoft的stl默认检查了迭代器。
您没有任何障碍来阻止C / C ++代码/基准代码之间的交叉重新排序。
(没有真正相关)cout << ccount是语言环境意识,printf不是; std :: endl刷新输出,printf(“\ n”)不刷新。
The "traditional" code for showing c++ advantage is C qsort()
vs C++ std::sort()
. This is where code inlineing shines.
显示c ++优势的“传统”代码是C qsort()vs C ++ std :: sort()。这是内联代码闪耀的地方。
If you want some "real-life" application example. Search for some raytracer or matrix multiplication stuff. Pick an compiler that do auto vectorization.
如果你想要一些“真实”的应用实例。搜索一些光线跟踪器或矩阵乘法的东西。选择一个执行自动矢量化的编译器。
Update Using the LLVM online demo, we can see the whole loop is reordered. The benchmark code is moved to start, and it jump to the loop ending point in the first loop for better branch prediction:
更新使用LLVM在线演示,我们可以看到整个循环被重新排序。基准代码移动到开始,并跳转到第一个循环中的循环结束点,以便更好地进行分支预测:
(this is c++ code)
(这是c ++代码)
######### jump to the loop end
jg .LBB0_11
.LBB0_3: # %..split_crit_edge
.Ltmp2:
# print the benchmark result
movl $0, 12(%esp)
movl $25, 8(%esp)
movl $.L.str, 4(%esp)
movl std::cout, (%esp)
calll std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
.Ltmp3:
# BB#4: # %_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc.exit
.Ltmp4:
movl std::cout, (%esp)
calll std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<double>(double)
.Ltmp5:
# BB#5: # %_ZNSolsEd.exit
movl %eax, %ecx
movl %ecx, 28(%esp) # 4-byte Spill
movl (%ecx), %eax
movl -24(%eax), %eax
movl 240(%eax,%ecx), %ebp
testl %ebp, %ebp
jne .LBB0_7
# BB#6:
.Ltmp52:
calll std::__throw_bad_cast()
.Ltmp53:
.LBB0_7: # %.noexc41
cmpb $0, 28(%ebp)
je .LBB0_15
# BB#8:
movb 39(%ebp), %al
jmp .LBB0_21
.align 16, 0x90
.LBB0_9: # Parent Loop BB0_11 Depth=1
# => This Inner Loop Header: Depth=2
addl (%edi,%edx,4), %ebx
addl $1, %edx
adcl $0, %esi
cmpl %ecx, %edx
jne .LBB0_9
# BB#10: # in Loop: Header=BB0_11 Depth=1
incl %eax
cmpl $1000, %eax # imm = 0x3E8
######### jump back to the print benchmark code
je .LBB0_3
My test code:
我的测试代码:
std::vector<int> v;
v.resize(1000000,1);
int i, j, count = 0, size = v.size();
for(j=0;j<1000;j++)
{
count = 0;
for(i=0;i<size;i++)
count += v[i];
}
std::cout << "(C++) For loop time [s]: " << t/1.0 << std::endl;
std::cout << count << std::endl;
#3
8
It seems to be a compiler problem. For C-arrays, the compiler detects the pattern, uses auto-vectorization and emits SSE instructions. For vector, it seems to lack the necessary intelligence.
这似乎是一个编译器问题。对于C阵列,编译器检测模式,使用自动矢量化并发出SSE指令。对于矢量,它似乎缺乏必要的智能。
If I force the compiler not to use SSE, the results are very similar (tested with g++ -mno-mmx -mno-sse -msoft-float -O3
):
如果我强制编译器不使用SSE,结果非常相似(使用g ++ -mno-mmx -mno-sse -msoft-float -O3测试):
(C++) For loop time [us]: 604610
1000000
(C) For loop time [us]: 601493
1000000
Here is the code that generated this output. It is basically the code in your question, but without any floating point.
以下是生成此输出的代码。它基本上是你问题中的代码,但没有任何浮点。
#include <iostream>
#include <vector>
#include <sys/time.h>
using namespace std;
long getTickCount()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000000 + tv.tv_usec;
}
int main() {
long t;
// C++
std::vector<int> v;
v.resize(1000000,1);
int i, j, count = 0, size = v.size();
t = getTickCount();
for(j=0;j<1000;j++)
{
count = 0;
for(i=0;i<size;i++)
count += v[i];
}
t = getTickCount() - t;
std::cout << "(C++) For loop time [us]: " << t << std::endl;
std::cout << count << std::endl;
// C-style
#define ARR_SIZE 1000000
int* arr = new int[ARR_SIZE];
int ci, cj, ccount = 0, csize = ARR_SIZE;
for(ci=0;ci<csize;ci++)
arr[ci] = 1;
t = getTickCount();
for(cj=0;cj<1000;cj++)
{
ccount = 0;
for(ci=0;ci<csize;ci++)
ccount += arr[ci];
}
delete arr;
t = getTickCount() - t;
std::cout << "(C) For loop time [us]: " << t << std::endl;
std::cout << ccount << std::endl;
}
#4
4
The C++ equivalent of a dynamically sized array would be std::vector
. The C++ equivalent of a fixed size array would be std::array
or std::tr1::array
pre-C++11.
动态大小的数组的C ++等价物是std :: vector。固定大小数组的C ++等价物是std :: array或std :: tr1 :: array pre-C ++ 11。
If your vector code has no re-sizings it is hard to see how it could be significantly slower than using a dynamically allocated C array, provided you compile with some optimization enabled.
如果您的矢量代码没有重复大小,那么很难看出它如何比使用动态分配的C数组慢得多,前提是您在编译时启用了一些优化。
Note: running the code as posted, compiled on gcc 4.4.3 on x86, compiler options
注意:运行代码已发布,在x86上的gcc 4.4.3上编译,编译器选项
g++ -Wall -Wextra -pedantic-errors -O2 -std=c++0x
g ++ -Wall -Wextra -pedantic-errors -O2 -std = c ++ 0x
the results are repeatably close to
结果重复接近
(C++) For loop time [us]: 507888
(C ++)对于循环时间[us]:507888
1000000
(C) For loop time [us]: 496659
(C)循环时间[us]:496659
1000000
So, seemingly ~2% slower for the std::vector
variant after a small number of trials. I'd consider this compatible performance.
因此,在少数试验后,std :: vector变种看起来似乎慢了约2%。我会考虑这种兼容的性能。
#5
0
What you point out is the fact that accessing objects will always come with a little overhead, so accessing a vector
won't be faster than accessing a good old array.
你指出的是访问对象总是带来一点开销,因此访问一个向量并不比访问一个好的旧数组快。
But even if using an array is "C-stylish", it remains C++ so it won't be a problem.
但即使使用数组是“C-stylish”,它仍然是C ++,所以它不会成为问题。
Then, as @juanchopanza said, there is std::array
in C++11, which may be more efficient than std::vector
, but specialized for fixed-size array.
然后,正如@juanchopanza所说,C ++ 11中有std :: array,它可能比std :: vector更有效,但专门用于固定大小的数组。
#6
0
Usually Compiler does all the optimization... You have just to choose a good compiler
通常编译器会进行所有优化...您只需要选择一个好的编译器