在c++中对vtable查找的性能影响。

I'm evaluating to rewrite a piece of real-time software from C/assembly language to C++/assembly language (for reasons not relevant to the question parts of the code are absolutely necessary to do in assembly).

我正在评估将一个实时软件从C/汇编语言重写为c++ /汇编语言(原因与代码的问题部分无关，在汇编语言中是绝对必要的)。

An interrupt comes with a 3 kHz frequency, and for each interrupt around 200 different things are to be done in a sequence. The processor runs with 300 MHz, giving us 100,000 cycles to do the job. This has been solved in C with an array of function pointers:

一个中断有3千赫兹的频率，每一个中断大约有200件不同的事情要按顺序进行。处理器以300mhz的速度运行，给我们10万个周期来完成这项工作。用C语言用函数指针数组解决:

// Each function does a different thing, all take one parameter being a pointer
// to a struct, each struct also being different.
void (*todolist[200])(void *parameters);

// Array of pointers to structs containing each function's parameters.
void *paramlist[200];

void realtime(void)
{
  int i;
  for (i = 0; i < 200; i++)
    (*todolist[i])(paramlist[i]);
}

Speed is important. The above 200 iterations are done 3,000 times per second, so practically we do 600,000 iterations per second. The above for loop compiles to five cycles per iteration, yielding a total cost of 3,000,000 cycles per second, i.e. 1% CPU load. Assembler optimization might bring that down to four instructions, however I fear we might get some extra delay due to memory accesses close to each other, etc. In short, I believe those five cycles are pretty optimal.

速度是非常重要的。上面的200次迭代每秒钟做3000次，所以实际上我们每秒钟做60万次迭代。上面的for循环每次迭代编译5个周期，产生每秒3,000,000个周期的总成本，即1%的CPU负载。汇编程序优化可能会将其减少到4条指令，但是我担心由于内存访问彼此很近等原因，我们可能会得到一些额外的延迟。总之，我认为这5个周期是非常理想的。

Now to the C++ rewrite. Those 200 things we do are sort of related to each other. There is a subset of parameters that they all need and use, and have in their respective structs. In a C++ implementation they could thus neatly be regarded as inheriting from a common base class:

现在来重写一下c++。我们做的这200件事是相互关联的。它们都需要和使用一个参数子集，并且在各自的结构中都有。因此，在c++实现中，它们可以被巧妙地视为继承自公共基类:

class Base
{
  virtual void Execute();
  int something_all_things_need;
}
class Derived1 : Base
{
  void Execute() { /* Do something */ }
  int own_parameter;
  // Other own parameters
}
class Derived2 : Base { /* Etc. */ }

Base *todolist[200];

void realtime(void)
{
  for (int i = 0; i < 200; i++)
    todolist[i]->Execute(); // vtable look-up! 20+ cycles.
}

My problem is the vtable lookup. I cannot do 600,000 lookups per second; this would account for more than 4% of wasted CPU load. Moreover the todolist never changes during run-time, it is only set up once at start-up, so the effort of looking up what function to call is truly wasted. Upon asking myself the question "what is the most optimal end result possible", I look at the assembler code given by the C solution, and refind an array of function pointers...

我的问题是vtable查找。我不能每秒查找60万次;这将占浪费CPU负载的4%以上。此外，todolist在运行时不会更改，它只在启动时设置一次，因此查找要调用的函数的工作是真正浪费的。当我问自己“什么是最优的最终结果”时，我查看了C解决方案给出的汇编代码，然后重新找到一个函数指针数组……

What is the clean and proper way to do this in C++? Making a nice base class, derived classes and so on feels pretty pointless when in the end one again picks out function pointers for performance reasons.

在c++中，最干净、最合适的方法是什么?创建一个好的基类、派生类等等感觉很没有意义，因为最终还是有人出于性能原因选择函数指针。

Update (including correction of where the loop starts):

更新(包括修正回路开始的位置):

The processor is an ADSP-214xx, and the compiler is VisualDSP++ 5.0. When enabling #pragma optimize_for_speed, the C loop is 9 cycles. Assembly-optimizing it in my mind yields 4 cycles, however I didn't test it so it's not guaranteed. The C++ loop is 14 cycles. I'm aware of the compiler could do a better job, however I did not want to dismiss this as a compiler issue - getting by without polymorphism is still preferable in an embedded context, and the design choice still interests me. For reference, here the resulting assembly:

处理器是ADSP-214xx，编译器是VisualDSP+ 5.0。当启用#pragma optimize_for_speed时，C循环是9个循环。在我的头脑中对它进行汇编优化会产生4个循环，但是我没有对它进行测试，所以它没有保证。c++循环是14个循环。我知道编译器可以做得更好，但是我不想把它作为一个编译器问题而忽略——在嵌入式上下文中，不使用多态性仍然是更好的选择，设计选择仍然让我感兴趣。作为参考，这里产生的程序集:

i3=0xb27ba;
i5=0xb28e6;
r15=0xc8;

Here's the actual loop:

这是实际的循环:

r4=dm(i5,m6);
i12=dm(i3,m6);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;
r15=r15-1;
if ne jump (pc, 0xfffffff2);

C++ :

c++:

i5=0xb279a;
r15=0xc8;

Here's the actual loop:

这是实际的循环:

i5=modify(i5,m6);
i4=dm(m7,i5);
r2=i4;
i4=dm(m6,i4);
r1=dm(0x3,i4);
r4=r2+r1;
i12=dm(0x5,i4);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279e2;
r15=r15-1;
if ne jump (pc, 0xffffffe7);

In the meanwhile, I think I have found sort of an answer. The lowest amount of cycles is achieved by doing the very least possible. I have to fetch a data pointer, fetch a function pointer, and call the function with the data pointer as parameter. When fetching a pointer the index register is automatically modified by a constant, and one can just as well let this constant equal 1. So once again one finds oneself with an array of function pointers, and an array of data pointers.

同时，我想我找到了一个答案。最小的周期是通过尽可能不可能实现的。我必须获取一个数据指针，获取一个函数指针，并以数据指针作为参数调用函数。当获取一个指针时，索引寄存器会被一个常量自动修改，我们也可以让这个常量等于1。因此，我们再一次找到了一个函数指针数组和一个数据指针数组。

Naturally, the limit is what can be done in assembly, and that has now been explored. Having this in mind, I now understand that even though it comes natural to one to introduce a base class, it was not really what fit the bill. So I guess the answer is that if one wants an array of function pointers, one should make oneself an array of function pointers...

当然，限制是在组装中可以做的事情，这一点现在已经得到了探索。考虑到这一点，我现在明白了，尽管引入基类是很自然的事情，但它实际上并不符合要求。所以我猜答案是，如果你想要一个函数指针的数组，你应该给自己一个函数指针的数组……

6 个解决方案

#1

What makes you think vtable lookup overhead is 20 cycles? If that's really true, you need a better C++ compiler.

什么使您认为vtable查找开销是20个周期?如果这是真的，你需要一个更好的c++编译器。

I tried this on an Intel box, not knowing anything about the processor you're using, and as expected the difference between the C despatch code and the C++ vtable despatch is one instruction, having to do with the fact that the vtable involves an extra indirect.

我在一个英特尔机顶盒上尝试了这个操作，对您正在使用的处理器一无所知，正如预期的那样，C despatch代码和c++ vtable despatch之间的差异是一条指令，与vtable涉及一个额外的间接的事实有关。

C code (based on OP):

C代码(基于OP):

void (*todolist[200])(void *parameters);                                  
void *paramlist[200];
void realtime(void)
{       
  int i;
  for (i = 0; i < 200; i++)                                               
    (*todolist[i])(paramlist[i]);                                         
}

C++ code:

c++代码:

class Base {
  public:
    Base(void* unsafe_pointer) : unsafe_pointer_(unsafe_pointer) {}
    virtual void operator()() = 0;
  protected:
    void* unsafe_pointer_;
};

Base* todolist[200];
void realtime() {
  for (int i = 0; i < 200; ++i)
    (*todolist[i])();
}

Both compiled with gcc 4.8, -O3:

两者都使用gcc 4.8编译，-O3:

realtime:                             |_Z8realtimev:
.LFB0:                                |.LFB3:
        .cfi_startproc                |        .cfi_startproc
        pushq   %rbx                  |        pushq   %rbx
        .cfi_def_cfa_offset 16        |        .cfi_def_cfa_offset 16
        .cfi_offset 3, -16            |        .cfi_offset 3, -16
        xorl    %ebx, %ebx            |        movl    $todolist, %ebx
        .p2align 4,,10                |        .p2align 4,,10
        .p2align 3                    |        .p2align 3
.L3:                                  |.L3:
        movq    paramlist(%rbx), %rdi |        movq    (%rbx), %rdi
        call    *todolist(%rbx)       |        addq    $8, %rbx
        addq    $8, %rbx              |        movq    (%rdi), %rax
                                      |        call    *(%rax)
        cmpq    $1600, %rbx           |        cmpq    $todolist+1600, %rbx
        jne     .L3                   |        jne     .L3
        popq    %rbx                  |        popq    %rbx
        .cfi_def_cfa_offset 8         |        .cfi_def_cfa_offset 8
        ret                           |        ret

In the C++ code, the first movq gets the address of the vtable, and the call then indexes through that. So that's one instruction overhead.

在c++代码中，第一个movq获得了vtable的地址，然后通过这个调用索引。这是一个指令开销。

According to OP, the DSP's C++ compiler produces the following code. I've inserted comments based on my understanding of what's going on (which might be wrong). Note that (IMO) the loop starts one location earlier than OP indicates; otherwise, it makes no sense (to me).

根据OP, DSP的c++编译器生成以下代码。我插入了基于我对所发生的事情的理解的评论(这可能是错误的)。注意(IMO)循环比OP显示的时间要早一个位置;否则，(对我来说)就没有意义了。

# Initialization.
# i3=todolist; i5=paramlist           | # i5=todolist holds paramlist
i3=0xb27ba;                           | # No paramlist in C++
i5=0xb28e6;                           | i5=0xb279a;
# r15=count
r15=0xc8;                             | r15=0xc8;

# Loop. We need to set up r4 (first parameter) and figure out the branch address.
# In C++ by convention, the first parameter is 'this'
# Note 1:
r4=dm(i5,m6); # r4 = *paramlist++;    | i5=modify(i5,m6); # i4 = *todolist++
                                      | i4=dm(m7,i5);     # ..
# Note 2:                            
                                      | r2=i4;            # r2 = obj
                                      | i4=dm(m6,i4);     # vtable = *(obj + 1)
                                      | r1=dm(0x3,i4);    # r1 = vtable[3]
                                      | r4=r2+r1;         # param = obj + r1

i12=dm(i3,m6); # i12 = *todolist++;   | i12=dm(0x5,i4);   # i12 = vtable[5]

# Boilerplate call. Set frame pointer, push return address and old frame pointer.
# The two (push) instructions after jump are actually executed before the jump.
r2=i6;                                | r2=i6;
i6=i7;                                | i6=i7;
jump (m13,i12) (db);                  | jump (m13,i12) (db);
dm(i7,m7)=r2;                         | dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;                   | dm(i7,m7)=0x1279e2;

# if (count--) loop
r15=r15-1;                            | r15=r15-1;
if ne jump (pc, 0xfffffff2);          | if ne jump (pc, 0xffffffe7);

Notes:

注:

In the C++ version, it seems that the compiler has decided to do the post-increment in two steps, presumably because it wants the result in an i register rather than in r4. This is undoubtedly related to the issue below.

在c++版本中，似乎编译器已经决定在两个步骤中执行后增量，大概是因为它希望结果在i寄存器而不是r4中。这无疑与下面的问题有关。
The compiler has decided to compute the base address of the object's real class, using the object's vtable. This occupies three instructions, and presumably also requires the use of i4 as a temporary in step 1. The vtable lookup itself occupies one instruction.

编译器决定使用对象的vtable计算对象真实类的基地址。这占用了三个指令，并且可能还需要在步骤1中使用i4作为临时操作。vtable查找本身占用一条指令。

So: the issue is not vtable lookup, which could have been done in a single extra instruction (but actually requires two). The problem is that the compiler feels the need to "find" the object. But why doesn't gcc/i86 need to do that?

所以:问题不是vtable查找，它可以在一个额外的指令中完成(但实际上需要两个)。问题是编译器认为需要“查找”对象。但是为什么gcc/i86不需要这样做呢?

The answer is: it used to, but it doesn't any more. In many cases (where there is no multiple inheritance, for example), the cast of a pointer to a derived class to a pointer of a base class does not require modifying the pointer. Consequently, when we call a method of the derived class, we can just give it the base class pointer as its this parameter. But in other cases, that doesn't work, and we have to adjust the pointer when we do the cast, and consequently adjust it back when we do the call.

答案是:它曾经是，但现在不再是了。在许多情况下(例如，没有多重继承)，将一个指向派生类的指针转换为基类的指针不需要修改指针。因此，当我们调用派生类的方法时，我们可以给它一个基类指针作为这个参数。但在其他情况下，这不起作用，我们必须在执行强制转换时调整指针，然后在执行调用时调整指针。

There are (at least) two ways to perform the second adjustment. One is the way shown by the generated DSP code, where the adjustment is stored in the vtable -- even if it is 0 -- and then applied during the call. The other way, (called vtable-thunks) is to create a thunk -- a little bit of executable code -- which adjusts the this pointer and then jumps to the method's entry point, and put a pointer to this thunk into the vtable. (This can all be done at compile time.) The advantage of the thunk solution is that in the common case where no adjustment needs to be done, we can optimize away the thunk and there is no adjustment code left. (The disadvantage is that if we do need an adjustment, we've generated an extra branch.)

(至少)有两种方法来执行第二次调整。一个是生成的DSP代码所显示的方式，其中的调整存储在vtable中(即使是0)，然后在调用期间应用。另一种方法(称为vtable-thunks)是创建一个thunk(一种可执行代码)，它调整这个指针，然后跳转到方法的入口点，并将指向这个thunk的指针放入vtable中。(这一切都可以在编译时完成。)thunk解决方案的好处是，在通常的情况下，如果不需要做任何调整，我们可以优化thunk并且没有调整代码。(缺点是，如果我们确实需要调整，就会产生额外的分支。)

As I understand it, VisualDSP++ is based on gcc, and it might have the -fvtable-thunks and -fno-vtable-thunks options. So you might be able to compile with -fvtable-thunks. But if you do that, you would need to compile all the C++ libraries you use with that option, because you cannot mix the two calling styles. Also, there were (15 years ago) various bugs in gcc's vtable-thunks implementation, so if the version of gcc used by VisualDSP++ is old enough, you might run into those problems too (IIRC, they all involved multiple inheritance, so they might not apply to your use case.)

据我所知，VisualDSP++是基于gcc的，它可能有-fvtable-thunks和-fno-vtable-thunks选项。因此，您可以使用-fvtable-thunks进行编译。但是如果这样做，就需要编译使用该选项的所有c++库，因为不能混合使用这两种调用样式。另外，在gcc的vtable-thunks实现中也有(15年前)各种各样的bug，所以如果VisualDSP++ +使用的gcc版本足够旧，您可能也会遇到这些问题(IIRC，它们都涉及到多重继承，因此它们可能不适用于您的用例)。

(Original test, before update):

(原始的测试,之前更新):

I tried the following simple case (no multiple inheritance, which can slow things down):

我尝试了以下简单的情况(没有多重继承，这会让事情变慢):

class Base {
  public:
    Base(int val) : val_(val) {}
    virtual int binary(int a, int b) = 0;
    virtual int unary(int a) = 0;
    virtual int nullary() = 0;
  protected:
    int val_;
};

int binary(Base* begin, Base* end, int a, int b) {
  int accum = 0;
  for (; begin != end; ++begin) { accum += begin->binary(a, b); }
  return accum;
}

int unary(Base* begin, Base* end, int a) {
  int accum = 0;
  for (; begin != end; ++begin) { accum += begin->unary(a); }
  return accum;
}

int nullary(Base* begin, Base* end) {
  int accum = 0;
  for (; begin != end; ++begin) { accum += begin->nullary(); }
  return accum;
}

And compiled it with gcc (4.8) using -O3. As I expected, it produced exactly the same assembly code as your C despatch would have done. Here's the for loop in the case of the unary function, for example:

并使用-O3使用gcc(4.8)编译。如我所料，它生成的汇编代码与您的C despatch所做的完全相同。这里是一元函数的for循环，例如:

.L9:
        movq    (%rbx), %rax
        movq    %rbx, %rdi
        addq    $16, %rbx
        movl    %r13d, %esi
        call    *8(%rax)
        addl    %eax, %ebp
        cmpq    %rbx, %r12
        jne     .L9

#2

I suggest using static methods in your derived classes and placing these functions into your array. This would eliminate the overhead of the v-table search. This is closest to your C language implementation.

我建议在派生类中使用静态方法，并将这些函数放到数组中。这将消除v-table搜索的开销。这是最接近C语言实现的。

You may end up sacrificing the polymorphism for speed.
Is the inheritance necessary?
Just because you switch to C++ doesn't mean you have to switch to Object Oriented.

您可能会为了速度而牺牲多态性。继承是必要的吗?仅仅因为你切换到c++并不意味着你必须切换到面向对象。

Also, have you tried unrolling your loop in the ISR?
For example, perform 2 or more execution calls before returning back to the top of the loop.

另外，您是否尝试过在ISR中展开循环?例如，在返回到循环的顶部之前执行两个或多个执行调用。

Also, can you move any of the functionality out of the ISR? Can any part of the functionality be performed by the "background loop" instead of the ISR? This would reduce the time in your ISR.

此外，您能将任何功能从ISR中移除吗?功能的任何部分可以由“后台循环”而不是ISR来执行吗?这将减少你的ISR的时间。

#3

As has already been mentioned, you can use templates to do away with the dynamic dispatch. Here is an example that does this:

正如已经提到的，您可以使用模板来消除动态调度。这里有一个这样的例子:

template <typename FirstCb, typename ... RestCb>
struct InterruptHandler {
    void execute() {
        // I construct temporary objects here since I could not figure out how you
        // construct your objects. You can change these signatures to allow for 
        // passing arbitrary params to these handlers.
        FirstCb().execute();
        InterruptHandler<RestCb...>().execute();
    }
}

InterruptHandler</* Base, Derived1, and so on */> handler;

void realtime(void) {
    handler.execute();
}

This should completely eliminate the vtable lookups while providing more opportunities for compiler optimization since the code inside execute can be inlined.

这将完全消除vtable查找，同时为编译器优化提供更多的机会，因为执行的代码可以内联。

Note however that you will need to change some parts depending on how you initialize your handlers. The basic framework should remain the same. Also, this requires that you have a C++11 compliant compiler.

但是，请注意，您将需要更改某些部分，这取决于如何初始化处理程序。基本框架应该保持不变。而且，这要求您有一个兼容c++ 11的编译器。

#4

You can hide the void* type erasure and type recovery inside templates. The result would (hopefully) be the same array to function pointers. This yould help with casting and compatible to your code:

可以在模板中隐藏void*类型擦除和类型恢复。结果(希望)与函数指针的数组相同。这将有助于您的代码的转换和兼容:

#include <iostream>

template<class ParamType,class F>
void fun(void* param) {
  F f;
  f(*static_cast<ParamType*>(param));
}

struct my_function {
  void operator()(int& i) {
    std::cout << "got it " << i << std::endl;
  }
};


int main() {
  void (*func)(void*) = fun<int, my_function>;

  int j=4;
  func(&j);

  return 0;
}

In this case you can create new functions as a function object with more type safty. The "normal" OOP aproach with virtual functions doesn't help here.

在这种情况下，可以将新函数创建为具有更多类型安全性的函数对象。使用虚拟函数的“正常”OOP aproach在这里不起作用。

In case of A C++11 environment you could create the array with help of variadic templates at compile time (but with an complicated syntax).

对于c++ 11环境，您可以在编译时(但使用复杂的语法)使用可变类型模板创建数组。

#5

This is unrelated to your question, but if you are that keen on performance you could use templates to do a loop unroll for the todolist:

这与您的问题无关，但如果您对性能很感兴趣，您可以使用模板为todolist执行循环展开:

void (*todo[3])(void *);
void *param[3];

void f1(void*) {std::cout<<"1" << std::endl;}
void f2(void*) {std::cout<<"2" << std::endl;}
void f3(void*) {std::cout<<"3" << std::endl;}

template<int N>
struct Obj {
    static void apply()
    {
        todo[N-1](param[N-1]);
        Obj<N-1>::apply();
    }
};

template<> struct Obj<0> { static void apply() {} };

todo[0] = f1;
todo[1] = f2;
todo[2] = f3;

Obj<sizeof todo / sizeof *todo>::apply();

#6

Find out where your compiler puts the vtable and access it directly to get the function pointers and store them for usage. That way you will have pretty much the same approach like in C with an array of function pointers.

找出编译器将vtable放在哪里，并直接访问它以获取函数指针并存储它们以供使用。这样你就会得到和C一样的方法，用一系列函数指针。

#1