如何提高内联功能效率?

时间:2021-11-07 03:52:29

I profiled my code and found that one inline function takes about 8% of the samples. The function is to convert matrix subscripts to indices. It is quite like the matlab function sub2ind.

我分析了我的代码,发现一个内联函数占用了大约8%的样本。该函数用于将矩阵下标转换为索引。它非常像matlab函数sub2ind。

inline int sub2ind(const int sub_height, const int sub_width, const int width) {
    return sub_height * width + sub_width;
}

I guess the compiler does not perform inline expansion, but I don't know how to check that out.

我猜编译器不执行内联扩展,但我不知道如何检查它。

Is there any way to improve this? Or explicitly let the compiler perform inline expansion?

有没有办法改善这个?或者明确让编译器执行内联扩展?

2 个解决方案

#1


Did you remember to compile with optimizations? Some compilers have an attribute to force inlining, even when the compiler doesn't want to: see this question.

你还记得用优化编译吗?有些编译器有一个强制内联的属性,即使编译器不想:看到这个问题。

But it probably has already; you can try having your compiler output the assembly code and try to check for sure that way.

但它可能已经存在;您可以尝试让编译器输出汇编代码并尝试检查这种方式。

It is not implausible that index calculations can be a significant fraction of your time -- e.g. if your algorithm is reading from a matrix, a little bit of calculation, then writing back, then index calculations really are a significant fraction of your compute time.

指数计算可能占您时间的很大一部分并不是不可信的 - 例如如果您的算法是从矩阵读取,稍微计算一下,然后回写,那么索引计算确实是您计算时间的重要部分。

Or, you've written your code in a way that the compiler can't prove that width remains constant throughout your loops*, and so it has to reread it from memory every time, just to be sure. Try copying width to a local variable and use that in your inner loops.

或者,您编写的代码的编写方式无法证明宽度在整个循环中保持不变*,因此每次都必须从内存中重新读取它,只是为了确定。尝试将宽度复制到局部变量并在内部循环中使用它。

Now, you've said that this takes 8% of your time -- that means it is unlikely that you can possibly get anything more than an 8% improvement to your runtime, and probably much less. If that's really worth it, then the thing to do is probably to fundamentally change how you iterate through the array.

现在,您已经说过这需要8%的时间 - 这意味着您不可能获得比运行时间提高8%的任何东西,并且可能更少。如果这真的值得,那么要做的就是从根本上改变迭代数组的方式。

e.g.

  • if you tend to access the matrix in a linear fashion, you could write some sort of two-dimensional iterator class that you can advance up, down, left, or right, and it will use additions everywhere instead of multiplication
  • 如果您倾向于以线性方式访问矩阵,您可以编写某种二维迭代器类,您可以向上,向下,向左或向右前进,并且它将在任何地方使用添加而不是乘法

  • same thing, but writing an "index" class that just holds the numbers rather than pretending to be a pointer
  • 同样的事情,但写一个“索引”类只保存数字而不是假装成指针

  • if width is a compile-time constant, you could make it explicitly so, e.g. as a template parameter, and your compiler might be able to do more clever things with the multiplication
  • 如果width是一个编译时常量,你可以明确地使它,例如作为模板参数,您的编译器可能能够通过乘法做更多聪明的事情

*: You could have done something silly, like put the data structure for your matrix in the very memory where you're storing the matrix entries! So when you update the matrix, you might change the width. The compiler has to guard against these loopholes, so it can't do optimizations it 'obviously should' be able to do. And sometimes, the sort of thing that a loophole in one context can well be the programmer's obvious intent in another context. Generally speaking, these sorts of loop holes tend to be all over the place, and the compiler is better at finding these loopholes than humans are at noticing them.

*:您可能已经做了一些愚蠢的事情,比如将矩阵的数据结构放在存储矩阵条目的内存中!因此,当您更新矩阵时,您可能会更改宽度。编译器必须防范这些漏洞,因此它无法进行优化,它显然应该能够做到。有时,在一个环境中的漏洞很可能是程序员在另一个环境中的明显意图。一般来说,这些类型的循环漏洞往往遍布整个地方,编译器更容易发现这些漏洞,而不是人类注意到它们。

#2


As @user3528438 mentioned, you can look at the assembly output. Consider the following example:

正如@ user3528438所提到的,您可以查看程序集输出。请考虑以下示例:

inline int sub2ind(const int sub_height, const int sub_width, const int width) {
    return sub_height * width + sub_width;
}

int main() {
    volatile int n[] = {1, 2, 3};
    return sub2ind(n[0], n[1], n[2]);
}

Compiling it without optimization (g++ -S test.cc) results in the following code with sub2ind not inlined:

在没有优化的情况下编译它(g ++ -S test.cc)会导致以下代码中没有内联的sub2ind:

main:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $32, %rsp
    movl    $1, -16(%rbp)
    movl    $2, -12(%rbp)
    movl    $3, -8(%rbp)
    movq    -16(%rbp), %rax
    movq    %rax, -32(%rbp)
    movl    -8(%rbp), %eax
    movl    %eax, -24(%rbp)
    movl    -24(%rbp), %edx
    movl    -28(%rbp), %ecx
    movl    -32(%rbp), %eax
    movl    %ecx, %esi
    movl    %eax, %edi
    call    _Z7sub2indiii ; call to sub2ind
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

while compiling with optimization (g++ -S -O3 test.cc) results in sub2ind being inlined and mostly optimized away:

在使用优化进行编译时(g ++ -S -O3 test.cc)会导致sub2ind被内联并且大部分被优化掉:

main:
.LFB1:
    .cfi_startproc
    movl    $1, -24(%rsp)
    movl    $2, -20(%rsp)
    movq    -24(%rsp), %rax
    movl    $3, -16(%rsp)
    movq    %rax, -40(%rsp)
    movl    $3, -32(%rsp)
    movl    -32(%rsp), %eax
    movl    -36(%rsp), %edx
    movl    -40(%rsp), %ecx
    imull   %ecx, %eax
    addl    %edx, %eax
    ret
    .cfi_endproc

So if you are convinced that your function is not inlined, first make sure that you enable optimization in the compiler options.

因此,如果您确信您的函数未内联,请首先确保在编译器选项中启用优化。

#1


Did you remember to compile with optimizations? Some compilers have an attribute to force inlining, even when the compiler doesn't want to: see this question.

你还记得用优化编译吗?有些编译器有一个强制内联的属性,即使编译器不想:看到这个问题。

But it probably has already; you can try having your compiler output the assembly code and try to check for sure that way.

但它可能已经存在;您可以尝试让编译器输出汇编代码并尝试检查这种方式。

It is not implausible that index calculations can be a significant fraction of your time -- e.g. if your algorithm is reading from a matrix, a little bit of calculation, then writing back, then index calculations really are a significant fraction of your compute time.

指数计算可能占您时间的很大一部分并不是不可信的 - 例如如果您的算法是从矩阵读取,稍微计算一下,然后回写,那么索引计算确实是您计算时间的重要部分。

Or, you've written your code in a way that the compiler can't prove that width remains constant throughout your loops*, and so it has to reread it from memory every time, just to be sure. Try copying width to a local variable and use that in your inner loops.

或者,您编写的代码的编写方式无法证明宽度在整个循环中保持不变*,因此每次都必须从内存中重新读取它,只是为了确定。尝试将宽度复制到局部变量并在内部循环中使用它。

Now, you've said that this takes 8% of your time -- that means it is unlikely that you can possibly get anything more than an 8% improvement to your runtime, and probably much less. If that's really worth it, then the thing to do is probably to fundamentally change how you iterate through the array.

现在,您已经说过这需要8%的时间 - 这意味着您不可能获得比运行时间提高8%的任何东西,并且可能更少。如果这真的值得,那么要做的就是从根本上改变迭代数组的方式。

e.g.

  • if you tend to access the matrix in a linear fashion, you could write some sort of two-dimensional iterator class that you can advance up, down, left, or right, and it will use additions everywhere instead of multiplication
  • 如果您倾向于以线性方式访问矩阵,您可以编写某种二维迭代器类,您可以向上,向下,向左或向右前进,并且它将在任何地方使用添加而不是乘法

  • same thing, but writing an "index" class that just holds the numbers rather than pretending to be a pointer
  • 同样的事情,但写一个“索引”类只保存数字而不是假装成指针

  • if width is a compile-time constant, you could make it explicitly so, e.g. as a template parameter, and your compiler might be able to do more clever things with the multiplication
  • 如果width是一个编译时常量,你可以明确地使它,例如作为模板参数,您的编译器可能能够通过乘法做更多聪明的事情

*: You could have done something silly, like put the data structure for your matrix in the very memory where you're storing the matrix entries! So when you update the matrix, you might change the width. The compiler has to guard against these loopholes, so it can't do optimizations it 'obviously should' be able to do. And sometimes, the sort of thing that a loophole in one context can well be the programmer's obvious intent in another context. Generally speaking, these sorts of loop holes tend to be all over the place, and the compiler is better at finding these loopholes than humans are at noticing them.

*:您可能已经做了一些愚蠢的事情,比如将矩阵的数据结构放在存储矩阵条目的内存中!因此,当您更新矩阵时,您可能会更改宽度。编译器必须防范这些漏洞,因此它无法进行优化,它显然应该能够做到。有时,在一个环境中的漏洞很可能是程序员在另一个环境中的明显意图。一般来说,这些类型的循环漏洞往往遍布整个地方,编译器更容易发现这些漏洞,而不是人类注意到它们。

#2


As @user3528438 mentioned, you can look at the assembly output. Consider the following example:

正如@ user3528438所提到的,您可以查看程序集输出。请考虑以下示例:

inline int sub2ind(const int sub_height, const int sub_width, const int width) {
    return sub_height * width + sub_width;
}

int main() {
    volatile int n[] = {1, 2, 3};
    return sub2ind(n[0], n[1], n[2]);
}

Compiling it without optimization (g++ -S test.cc) results in the following code with sub2ind not inlined:

在没有优化的情况下编译它(g ++ -S test.cc)会导致以下代码中没有内联的sub2ind:

main:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $32, %rsp
    movl    $1, -16(%rbp)
    movl    $2, -12(%rbp)
    movl    $3, -8(%rbp)
    movq    -16(%rbp), %rax
    movq    %rax, -32(%rbp)
    movl    -8(%rbp), %eax
    movl    %eax, -24(%rbp)
    movl    -24(%rbp), %edx
    movl    -28(%rbp), %ecx
    movl    -32(%rbp), %eax
    movl    %ecx, %esi
    movl    %eax, %edi
    call    _Z7sub2indiii ; call to sub2ind
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

while compiling with optimization (g++ -S -O3 test.cc) results in sub2ind being inlined and mostly optimized away:

在使用优化进行编译时(g ++ -S -O3 test.cc)会导致sub2ind被内联并且大部分被优化掉:

main:
.LFB1:
    .cfi_startproc
    movl    $1, -24(%rsp)
    movl    $2, -20(%rsp)
    movq    -24(%rsp), %rax
    movl    $3, -16(%rsp)
    movq    %rax, -40(%rsp)
    movl    $3, -32(%rsp)
    movl    -32(%rsp), %eax
    movl    -36(%rsp), %edx
    movl    -40(%rsp), %ecx
    imull   %ecx, %eax
    addl    %edx, %eax
    ret
    .cfi_endproc

So if you are convinced that your function is not inlined, first make sure that you enable optimization in the compiler options.

因此,如果您确信您的函数未内联,请首先确保在编译器选项中启用优化。