I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.
我用这个简单的算法做矩阵乘法。为了更灵活,我将对象用于包含动态创建数组的矩阵。
Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.
将这个解决方案与第一个使用静态数组的解决方案相比,它的速度要慢4倍。我能做些什么来加快数据访问?我不想改变算法。
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int j = 0; j < a.dim(); j++) {
int sum = 0;
for (int k = 0; k < a.dim(); k++)
sum += a(i,k) * b(k,j);
c(i,j) = sum;
}
return c;
}
EDIT
I corrected my Question avove! I added the full source code below and tried some of your advices:
- swapped
k
andj
loop iterations -> performance improvement - 交换k和j循环迭代——>性能改进
- declared
dim()
andoperator()()
asinline
-> performance improvement - 声明dim()和运算符()()作为内联->性能改进。
- passing arguments by const reference -> performance loss! why? so I don't use it.
- 通过const引用传递参数——>性能损失!为什么?所以我不用它。
The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.
现在的表演和以前的porgram中学的差不多。也许应该有更多的改进。
But I have another problem: I get a memory error in the function mult_strassen(...)
. Why? terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
但是我有另一个问题:我在函数mult_strassen(…)中得到一个内存错误。为什么?在抛出'std::bad_alloc' what(): std::bad_alloc的实例后调用终止
OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW
c99 main.c -o matrix -O3
c99主要。c - o矩阵o3
NEW PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr
g++ main.cpp matrix.cpp -o matrix -O3
.
g++主要。cpp矩阵。cpp o3 - o矩阵。
EDIT
Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block).
7 个解决方案
#1
1
Pass the parameters by const reference to start with:
将参数通过const引用开始:
matrix mult_std(matrix const& a, matrix const& b) {
To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.
为了给你更多的细节,我们需要知道其他方法的细节。要回答为什么原来的方法快4倍我们需要看到原来的方法。
The problem is undoubtedly yours as this problem has been solved a million times before.
问题无疑是你的,因为这个问题已经解决了一百万次。
Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.
同样,当问这类问题时,总是要提供具有适当输入的可编译源代码,这样我们就可以构建并运行代码,看看会发生什么。
Without the code we are just guessing.
没有代码我们只是猜测。
Edit
After fixing the main bug in the original C code (a buffer over-run)
修复了原始C代码中的主要bug(缓冲区溢出)
I have update the code to run the test side by side in a fair comparison:
我更新了代码,以便在公平的比较中并排运行测试:
// INCLUDES -------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
// DEFINES -------------------------------------------------------------------
// The original problem was here. The MAXDIM was 500. But we were using arrays
// that had a size of 512 in each dimension. This caused a buffer overrun that
// the dim variable and caused it to be reset to 0. The result of this was causing
// the multiplication loop to fall out before it had finished (as the loop was
// controlled by this global variable.
//
// Everything now uses the MAXDIM variable directly.
// This of course gives the C code an advantage as the compiler can optimize the
// loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
#define MAXDIM 512
#define RUNS 10
// MATRIX FUNCTIONS ----------------------------------------------------------
class matrix
{
public:
matrix(int dim)
: dim_(dim)
{
data_ = new int[dim_ * dim_];
}
inline int dim() const {
return dim_;
}
inline int& operator()(unsigned row, unsigned col) {
return data_[dim_*row + col];
}
inline int operator()(unsigned row, unsigned col) const {
return data_[dim_*row + col];
}
private:
int dim_;
int* data_;
};
// ---------------------------------------------------
void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++)
for (int c = 0; c < MAXDIM; c++)
matrix[r][c] = rand() % 100;
}
void random_matrix_class(matrix& matrix) {
for (int r = 0; r < matrix.dim(); r++)
for (int c = 0; c < matrix.dim(); c++)
matrix(r, c) = rand() % 100;
}
template<typename T, typename M>
float run(T f, M const& a, M const& b, M& c)
{
float time = 0;
for (int i = 0; i < RUNS; i++) {
struct timeval start, end;
gettimeofday(&start, NULL);
f(a,b,c);
gettimeofday(&end, NULL);
long s = start.tv_sec * 1000 + start.tv_usec / 1000;
long e = end.tv_sec * 1000 + end.tv_usec / 1000;
time += e - s;
}
return time / RUNS;
}
// SEQ MULTIPLICATION ----------------------------------------------------------
int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++) {
for (int c = 0; c < MAXDIM; c++) {
z[r][c] = 0;
for (int i = 0; i < MAXDIM; i++)
z[r][c] += a[r][i] * b[i][c];
}
}
}
void mult_std(matrix const& a, matrix const& b, matrix& z) {
for (int r = 0; r < a.dim(); r++) {
for (int c = 0; c < a.dim(); c++) {
z(r,c) = 0;
for (int i = 0; i < a.dim(); i++)
z(r,c) += a(r,i) * b(i,c);
}
}
}
// MAIN ------------------------------------------------------------------------
using namespace std;
int main(int argc, char* argv[]) {
srand(time(NULL));
int matrix_a[MAXDIM][MAXDIM];
int matrix_b[MAXDIM][MAXDIM];
int matrix_c[MAXDIM][MAXDIM];
random_matrix(matrix_a);
random_matrix(matrix_b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));
matrix a(MAXDIM);
matrix b(MAXDIM);
matrix c(MAXDIM);
random_matrix_class(a);
random_matrix_class(b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_std, a, b, c));
return 0;
}
The results now:
结果:
$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000
$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000
From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.
从这里我们可以看到,当完全优化时,C代码的速度大约是c++代码的两倍。我看不出代码中的原因。
#2
23
Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k
and j
loop iterations:
说到加速,如果您交换k和j循环迭代的顺序,那么您的函数将更适合缓存:
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int k = 0; k < a.dim(); k++)
for (int j = 0; j < a.dim(); j++) // swapped order
c(i,j) += a(i,k) * b(k,j);
return c;
}
That's because a k
index on the inner-most loop will cause a cache miss in b
on every iteration. With j
as the inner-most index, both c
and b
are accessed contiguously, while a
stays put.
这是因为最内部循环的k个索引会在每次迭代中导致b的缓存丢失。使用j作为最内部的索引,c和b都被连续访问,而a保持不变。
#3
4
You say you don't want to modify the algorithm, but what does that mean exactly?
你说你不想修改算法,但这到底意味着什么?
Does unrolling the loop count as "modifying the algorithm"? What about using SSE/VMX whichever SIMD instructions are available on your CPU? What about employing some form of blocking to improve cache locality?
展开循环是否算作“修改算法”?使用SSE/VMX怎么样,不管CPU上有什么SIMD指令?如何使用某种形式的阻塞来改进缓存局部性?
If you don't want to restructure your code at all, I doubt there's more you can do than the changes you've already made. Everything else becomes a trade-off of minor changes to the algorithm to achieve a performance boost.
如果你不想重构你的代码,我怀疑你能做的比你已经做的改变还多。其他一切都变成了对算法进行微小修改以获得性能提升的代价。
Of course, you should still take a look at the asm generated by the compiler. That'll tell you much more about what can be done to speed up the code.
当然,您仍然应该查看编译器生成的asm。这将告诉您更多关于如何加速代码的内容。
#4
3
Make sure that the members dim()
and operator()()
are declared inline, and that compiler optimization is turned on. Then play with options like -funroll-loops
(on gcc).
确保成员dim()和运算符()()被声明为内联,并打开编译器优化。然后使用类似-funroll-loop的选项(在gcc上)。
How big is a.dim()
anyway? If a row of the matrix doesn't fit in just a couple cache lines, you'd be better off with a block access pattern instead of a full row at-a-time.
a。dim()到底有多大?如果矩阵的一行并不只适合几个高速缓存行,那么最好使用块访问模式,而不是每次一行。
#5
3
- Use SIMD if you can. You absolutely have to use something like VMX registers if you do extensive vector math assuming you are using a platform that is capable of doing so, otherwise you will incur a huge performance hit.
- 如果可以,使用SIMD。如果你做了大量的矢量运算假设你使用的平台有能力这么做,你必须使用像VMX寄存器这样的东西,否则你的性能会受到很大的影响。
- Don't pass complex types like
matrix
by value - use a const reference. - 不要按值传递复杂类型,如矩阵——使用const引用。
- Don't call a function in each iteration - cache
dim()
outside your loops. - 不要在每次迭代中调用函数——在循环之外的缓存dim()。
- Although compilers typically optimize this efficiently, it's often a good idea to have the caller provide a matrix reference for your function to fill out rather than returning a matrix by type. In some cases, this may result in an expensive copy operation.
- 尽管编译器通常会高效地优化它,但是最好让调用者为您的函数提供一个矩阵引用来填充,而不是按类型返回一个矩阵。在某些情况下,这可能导致昂贵的复制操作。
#6
1
Here is my implementation of the fast simple multiplication algorithm for square float matrices (2D arrays). It should be a little faster than chrisaycock code since it spares some increments.
这是我对方浮点矩阵(2D数组)的快速简单乘法算法的实现。它应该比chrisaycock代码快一点,因为它节省了一些增量。
static void fastMatrixMultiply(const int dim, float* dest, const float* srcA, const float* srcB)
{
memset( dest, 0x0, dim * dim * sizeof(float) );
for( int i = 0; i < dim; i++ ) {
for( int k = 0; k < dim; k++ )
{
const float* a = srcA + i * dim + k;
const float* b = srcB + k * dim;
float* c = dest + i * dim;
float* cMax = c + dim;
while( c < cMax )
{
*c++ += (*a) * (*b++);
}
}
}
}
#7
0
I'm taking a wild guess here, but if you dynamically allocating the matrices makes such a huge difference, maybe the problem is fragmentation. Again, I've no idea how the underlying matrix is implemented.
我在这里做一个大胆的猜测,但是如果你动态地分配矩阵会有很大的不同,也许问题是碎片化。同样,我不知道底层矩阵是如何实现的。
Why don't you allocate the memory for the matrices by hand, ensuring it's contiguous, and build the pointer structure yourself?
为什么不手工分配矩阵的内存,确保它是连续的,然后自己构建指针结构呢?
Also, does the dim() method have any extra complexity? I would declare it inline, too.
另外,dim()方法是否有额外的复杂性?我也声明它是内联的。
#1
1
Pass the parameters by const reference to start with:
将参数通过const引用开始:
matrix mult_std(matrix const& a, matrix const& b) {
To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.
为了给你更多的细节,我们需要知道其他方法的细节。要回答为什么原来的方法快4倍我们需要看到原来的方法。
The problem is undoubtedly yours as this problem has been solved a million times before.
问题无疑是你的,因为这个问题已经解决了一百万次。
Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.
同样,当问这类问题时,总是要提供具有适当输入的可编译源代码,这样我们就可以构建并运行代码,看看会发生什么。
Without the code we are just guessing.
没有代码我们只是猜测。
Edit
After fixing the main bug in the original C code (a buffer over-run)
修复了原始C代码中的主要bug(缓冲区溢出)
I have update the code to run the test side by side in a fair comparison:
我更新了代码,以便在公平的比较中并排运行测试:
// INCLUDES -------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
// DEFINES -------------------------------------------------------------------
// The original problem was here. The MAXDIM was 500. But we were using arrays
// that had a size of 512 in each dimension. This caused a buffer overrun that
// the dim variable and caused it to be reset to 0. The result of this was causing
// the multiplication loop to fall out before it had finished (as the loop was
// controlled by this global variable.
//
// Everything now uses the MAXDIM variable directly.
// This of course gives the C code an advantage as the compiler can optimize the
// loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
#define MAXDIM 512
#define RUNS 10
// MATRIX FUNCTIONS ----------------------------------------------------------
class matrix
{
public:
matrix(int dim)
: dim_(dim)
{
data_ = new int[dim_ * dim_];
}
inline int dim() const {
return dim_;
}
inline int& operator()(unsigned row, unsigned col) {
return data_[dim_*row + col];
}
inline int operator()(unsigned row, unsigned col) const {
return data_[dim_*row + col];
}
private:
int dim_;
int* data_;
};
// ---------------------------------------------------
void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++)
for (int c = 0; c < MAXDIM; c++)
matrix[r][c] = rand() % 100;
}
void random_matrix_class(matrix& matrix) {
for (int r = 0; r < matrix.dim(); r++)
for (int c = 0; c < matrix.dim(); c++)
matrix(r, c) = rand() % 100;
}
template<typename T, typename M>
float run(T f, M const& a, M const& b, M& c)
{
float time = 0;
for (int i = 0; i < RUNS; i++) {
struct timeval start, end;
gettimeofday(&start, NULL);
f(a,b,c);
gettimeofday(&end, NULL);
long s = start.tv_sec * 1000 + start.tv_usec / 1000;
long e = end.tv_sec * 1000 + end.tv_usec / 1000;
time += e - s;
}
return time / RUNS;
}
// SEQ MULTIPLICATION ----------------------------------------------------------
int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
for (int r = 0; r < MAXDIM; r++) {
for (int c = 0; c < MAXDIM; c++) {
z[r][c] = 0;
for (int i = 0; i < MAXDIM; i++)
z[r][c] += a[r][i] * b[i][c];
}
}
}
void mult_std(matrix const& a, matrix const& b, matrix& z) {
for (int r = 0; r < a.dim(); r++) {
for (int c = 0; c < a.dim(); c++) {
z(r,c) = 0;
for (int i = 0; i < a.dim(); i++)
z(r,c) += a(r,i) * b(i,c);
}
}
}
// MAIN ------------------------------------------------------------------------
using namespace std;
int main(int argc, char* argv[]) {
srand(time(NULL));
int matrix_a[MAXDIM][MAXDIM];
int matrix_b[MAXDIM][MAXDIM];
int matrix_c[MAXDIM][MAXDIM];
random_matrix(matrix_a);
random_matrix(matrix_b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));
matrix a(MAXDIM);
matrix b(MAXDIM);
matrix c(MAXDIM);
random_matrix_class(a);
random_matrix_class(b);
printf("%d ", MAXDIM);
printf("%f \n", run(mult_std, a, b, c));
return 0;
}
The results now:
结果:
$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000
$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000
From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.
从这里我们可以看到,当完全优化时,C代码的速度大约是c++代码的两倍。我看不出代码中的原因。
#2
23
Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k
and j
loop iterations:
说到加速,如果您交换k和j循环迭代的顺序,那么您的函数将更适合缓存:
matrix mult_std(matrix a, matrix b) {
matrix c(a.dim(), false, false);
for (int i = 0; i < a.dim(); i++)
for (int k = 0; k < a.dim(); k++)
for (int j = 0; j < a.dim(); j++) // swapped order
c(i,j) += a(i,k) * b(k,j);
return c;
}
That's because a k
index on the inner-most loop will cause a cache miss in b
on every iteration. With j
as the inner-most index, both c
and b
are accessed contiguously, while a
stays put.
这是因为最内部循环的k个索引会在每次迭代中导致b的缓存丢失。使用j作为最内部的索引,c和b都被连续访问,而a保持不变。
#3
4
You say you don't want to modify the algorithm, but what does that mean exactly?
你说你不想修改算法,但这到底意味着什么?
Does unrolling the loop count as "modifying the algorithm"? What about using SSE/VMX whichever SIMD instructions are available on your CPU? What about employing some form of blocking to improve cache locality?
展开循环是否算作“修改算法”?使用SSE/VMX怎么样,不管CPU上有什么SIMD指令?如何使用某种形式的阻塞来改进缓存局部性?
If you don't want to restructure your code at all, I doubt there's more you can do than the changes you've already made. Everything else becomes a trade-off of minor changes to the algorithm to achieve a performance boost.
如果你不想重构你的代码,我怀疑你能做的比你已经做的改变还多。其他一切都变成了对算法进行微小修改以获得性能提升的代价。
Of course, you should still take a look at the asm generated by the compiler. That'll tell you much more about what can be done to speed up the code.
当然,您仍然应该查看编译器生成的asm。这将告诉您更多关于如何加速代码的内容。
#4
3
Make sure that the members dim()
and operator()()
are declared inline, and that compiler optimization is turned on. Then play with options like -funroll-loops
(on gcc).
确保成员dim()和运算符()()被声明为内联,并打开编译器优化。然后使用类似-funroll-loop的选项(在gcc上)。
How big is a.dim()
anyway? If a row of the matrix doesn't fit in just a couple cache lines, you'd be better off with a block access pattern instead of a full row at-a-time.
a。dim()到底有多大?如果矩阵的一行并不只适合几个高速缓存行,那么最好使用块访问模式,而不是每次一行。
#5
3
- Use SIMD if you can. You absolutely have to use something like VMX registers if you do extensive vector math assuming you are using a platform that is capable of doing so, otherwise you will incur a huge performance hit.
- 如果可以,使用SIMD。如果你做了大量的矢量运算假设你使用的平台有能力这么做,你必须使用像VMX寄存器这样的东西,否则你的性能会受到很大的影响。
- Don't pass complex types like
matrix
by value - use a const reference. - 不要按值传递复杂类型,如矩阵——使用const引用。
- Don't call a function in each iteration - cache
dim()
outside your loops. - 不要在每次迭代中调用函数——在循环之外的缓存dim()。
- Although compilers typically optimize this efficiently, it's often a good idea to have the caller provide a matrix reference for your function to fill out rather than returning a matrix by type. In some cases, this may result in an expensive copy operation.
- 尽管编译器通常会高效地优化它,但是最好让调用者为您的函数提供一个矩阵引用来填充,而不是按类型返回一个矩阵。在某些情况下,这可能导致昂贵的复制操作。
#6
1
Here is my implementation of the fast simple multiplication algorithm for square float matrices (2D arrays). It should be a little faster than chrisaycock code since it spares some increments.
这是我对方浮点矩阵(2D数组)的快速简单乘法算法的实现。它应该比chrisaycock代码快一点,因为它节省了一些增量。
static void fastMatrixMultiply(const int dim, float* dest, const float* srcA, const float* srcB)
{
memset( dest, 0x0, dim * dim * sizeof(float) );
for( int i = 0; i < dim; i++ ) {
for( int k = 0; k < dim; k++ )
{
const float* a = srcA + i * dim + k;
const float* b = srcB + k * dim;
float* c = dest + i * dim;
float* cMax = c + dim;
while( c < cMax )
{
*c++ += (*a) * (*b++);
}
}
}
}
#7
0
I'm taking a wild guess here, but if you dynamically allocating the matrices makes such a huge difference, maybe the problem is fragmentation. Again, I've no idea how the underlying matrix is implemented.
我在这里做一个大胆的猜测,但是如果你动态地分配矩阵会有很大的不同,也许问题是碎片化。同样,我不知道底层矩阵是如何实现的。
Why don't you allocate the memory for the matrices by hand, ensuring it's contiguous, and build the pointer structure yourself?
为什么不手工分配矩阵的内存,确保它是连续的,然后自己构建指针结构呢?
Also, does the dim() method have any extra complexity? I would declare it inline, too.
另外,dim()方法是否有额外的复杂性?我也声明它是内联的。