优化基于arm的设备的C代码

时间:2021-09-20 03:14:32

Recently, I've stumbled upon an interview question where you need to write a code that's optimized for ARM, especially for iphone:

最近,我偶然发现了一个面试问题,你需要为ARM(尤其是iphone)编写一个优化过的代码:

Write a function which takes an array of char (ASCII symbols) and find the most frequent character.

编写一个函数,该函数接受字符数组(ASCII符号)并找到最常见的字符。

char mostFrequentCharacter(char* str, int size)

char most frequency字符(char* str, int size)

The function should be optimized to run on dual-core ARM-based processors, and an infinity amount of memory.

这个函数应该被优化为可以在基于arm的双核处理器上运行,并且有无限的内存。

On the face of it, the problem itself looks pretty simple and here is the simple implementation of the function, that came out in my head:

从表面上看,问题本身看起来很简单,这是我脑海中出现的函数的简单实现:

#define RESULT_SIZE 127

inline int set_char(char c, int result[])
{
    int count = result[c];
    result[c] = ++count;
    return count;
}

char mostFrequentChar(char str[], int size)
{
    int result[RESULT_SIZE] = {0};

    char current_char;
    char frequent_char = '\0';

    int current_char_frequency = 0;
    int char_frequency = 0;

    for(size_t i = 0; i<size; i++)
    {
        current_char = str[i];
        current_char_frequency = set_char(current_char, result);

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = current_char;
        }
    }

    return frequent_char;
}

Firstly, I did some basic code optimization; I moved the code, that calculates the most frequent char every iteration, to an additional for loop and got a significant increase in speed, instead of evaluating the following block of code size times

首先,我做了一些基本的代码优化;我将每次迭代计算最频繁字符的代码移动到一个额外的for循环中,并获得了显著的速度提升,而不是计算后面的代码块大小乘以多少

if(current_char_frequency >= char_frequency)
{
    char_frequency = current_char_frequency;
    frequent_char = current_char;
}

we can find a most frequent char in O(RESULT_SIZE) where RESULT_SIZE == 127.

我们可以找到一个最频繁的char (RESULT_SIZE),其中RESULT_SIZE == 127。

char mostFrequentCharOpt1(char str[], int size)
{
    int result[RESULT_SIZE] = {0};

    char frequent_char = '\0';

    int current_char_frequency = 0;
    int char_frequency = 0;

    for(int i = 0; i<size; i++)
    {
        set_char(str[i], result);
    }

    for(int i = 0; i<RESULT_SIZE; i++)
    {
        current_char_frequency = result[i];

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = i;
        }
    }

    return frequent_char;
}

Benchmarks: iPhone 5s

基准:iPhone 5 s

size = 1000000
iterations = 500

// seconds = 7.842381
char mostFrequentChar(char str[], int size)

// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)

In average, the mostFrequentCharOpt1 works in ~24% faster than basic implementation.

平均而言,最常见的charopt1比基本实现快24%。

Type optimization

类型的优化

The ARM cores registers are 32-bits long. Therefore, changing all local variables that has a type char to type int prevents the processor from doing additional instructions to account for the size of the local variable after each assignment.

ARM内核寄存器为32位长。因此,将具有类型char的所有局部变量更改为int类型,可以防止处理器在每次分配后执行额外的指令来计算局部变量的大小。

Note: The ARM64 provides 31 registers (x0-x30) where each register is 64 bits wide and also has a 32-bit form (w0-w30). Hence, there is no need to do something special to operate on int data type. infocenter.arm.com - ARMv8 Registers

注意:ARM64提供了31个寄存器(x0-x30),其中每个寄存器的宽度为64位,并且还有一个32位的表单(w0-w30)。因此,不需要对int数据类型执行特殊操作。infocenter.arm.com ARMv8寄存器

While comparing functions in assembly language version, I've noticed a difference between how the ARM works with int type and char type. The ARM uses LDRB instruction to load byte and STRB instruction to store byte into individual bytes in memory. Thereby, from my point of view, LDRB is a bit slower than LDR, because LDRB do zero-extending every time when accessing a memory and load to register. In other words, we can't just load a byte into the 32-bit registers, we should cast byte to word.

在比较汇编语言版本中的函数时,我注意到ARM与int类型和char类型之间的区别。ARM使用LDRB指令将字节和STRB指令加载到内存中的各个字节中。因此,在我看来,LDRB比LDR慢一点,因为LDRB在每次访问内存和加载寄存器时都做零扩展。换句话说,我们不能仅仅将一个字节装入32位寄存器,我们应该将字节转换为字。

Benchmarks: iPhone 5s

基准:iPhone 5 s

size = 1000000
iterations = 500

// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)

// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)

Changing char type to int didn't give me a significant increase of speed on iPhone 5s, by way of contrast, running the same code on iPhone 4 gave a different result:

在iPhone 5s上,将char类型改为int并没有显著提高速度,相比之下,在iPhone 4上运行相同的代码会得到不同的结果:

Benchmarks: iPhone 4

基准:iPhone 4

size = 1000000
iterations = 500

// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)

// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)

Loop optimization

循环优化

Next, I did a loop optimization, where, instead of incrementing i value, I decremented it.

接下来,我做了一个循环优化,我没有递增I值,而是递减它。

before    
for(int i = 0; i<size; i++) { ... }

after
for(int i = size; i--) { ... }

Again, comparing assembly code, gave me a clear distinction between the two approaches.

再次,通过比较汇编代码,我清楚地区分了这两种方法。

mostFrequentCharOpt2                                              |      mostFrequentCharOpt3
0x10001250c <+88>:  ldr    w8, [sp, #28] ; w8 = i                 |      0x100012694 <+92>:  ldr    w8, [sp, #28]                                             ; w8 = i
0x100012510 <+92>:  ldr    w9, [sp, #44] ; w9 = size              |      0x100012698 <+96>:  sub    w9, w8, #1 ; w9 = i - 1                                           
0x100012514 <+96>:  cmp    w8, w9 ; if i<size                     |      0x10001269c <+100>: str    w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge   0x100012548 ; if true => end loop      |      0x1000126a0 <+104>: cbz    w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine                    |      0x1000126a4 <+108>: ... set_char start routine
...                                                               |      ...
0x100012534 <+128>: ... set_char end routine                      |      0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr    w8, [sp, #28] ; w8 = i                 |      0x1000126c0 <+136>: b      0x100012694 ; back to the first line
0x10001253c <+136>: add    w8, w8, #1 ; i++                       |      0x1000126c4 <+140>: ...
0x100012540 <+140>: str    w8, [sp, #28] ; save i to $sp+28       |      
0x100012544 <+144>: b      0x10001250c ; back to the first line   |      
0x100012548 <+148>: str    ...                                    |      

Here, in place of accessing size from the memory and comparing it with the i variable, where the i variable, was incrementing, we just decremented i by 0x1 and compared the register, where the i is stored, with 0.

这里,从内存中访问大小并将其与i变量进行比较,其中i变量是递增的,我们只是将i递减到0x1,然后比较寄存器,其中i被存储为0。

Benchmarks: iPhone 5s

基准:iPhone 5 s

size = 1000000
iterations = 500

// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization

// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization

Threading optimization

线程优化

Reading the question accurately gives us at least one more optimization. This line ..optimized to run on dual-core ARM-based processors ... especially, dropped a hint to optimize the code using pthread or gcd.

准确地阅读这个问题至少能让我们再优化一次。这条线. .优化后可以在基于arm的双核处理器上运行……特别是,使用pthread或gcd优化代码的提示。

int mostFrequentCharThreadOpt(char str[], int size)
{
    int s;
    int tnum;
    int num_threads = THREAD_COUNT; //by default 2
    struct thread_info *tinfo;

    tinfo = calloc(num_threads, sizeof(struct thread_info));

    if (tinfo == NULL)
        exit(EXIT_FAILURE);

    int minCharCountPerThread = size/num_threads;
    int startIndex = 0;

    for (tnum = num_threads; tnum--;)
    {
        startIndex = minCharCountPerThread*tnum;

        tinfo[tnum].thread_num = tnum + 1;
        tinfo[tnum].startIndex = minCharCountPerThread*tnum;
        tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
        tinfo[tnum].str = str;

        s = pthread_create(&tinfo[tnum].thread_id, NULL,
                           (void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
        if (s != 0)
            exit(EXIT_FAILURE);
    }

    int frequent_char = 0;
    int char_frequency = 0;
    int current_char_frequency = 0;

    for (tnum = num_threads; tnum--; )
    {
        s = pthread_join(tinfo[tnum].thread_id, NULL);
    }

    for(int i = RESULT_SIZE; i--; )
    {
        current_char_frequency = 0;

        for (int z = num_threads; z--;)
        {
            current_char_frequency += tinfo[z].resultArray[i];
        }

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = i;
        }
    }

    free(tinfo);

    return frequent_char;
}

Benchmarks: iPhone 5s

基准:iPhone 5 s

size = 1000000
iterations = 500

// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization

// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization

Note: mostFrequentCharThreadOpt works slower than mostFrequentCharOpt2 on iPhone 4.

注意:大多数频率charentcharthreadopt的工作速度比iPhone 4上的大多数频率charopt2要慢。

Benchmarks: iPhone 4

基准:iPhone 4

size = 1000000
iterations = 500

// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization

// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization

Question

问题

How well optimized is the mostFrequentCharOpt3 and mostFrequentCharThreadOpt, in other words: are there any other methods to optimize both methods?

最常charopt3和最常charthreadopt的优化程度如何,换句话说:是否有其他方法来优化这两种方法?

Source code

源代码

2 个解决方案

#1


1  

Alright, the following things you can try, I can't 100% say what will be effective in your situation, but from experience, if you put all possible optimizations off, and looking at the fact that even loop optimization worked for you: your compiler is pretty numb.

好吧,你可以试试下面的方法,我不能100%地说什么在你的情况下是有效的,但是从经验来看,如果你把所有可能的优化都去掉,并且考虑到即使循环优化对你也有效:你的编译器是相当麻木的。

It slightly depends a bit on your THREAD_COUNT, you say its 2 at default, but you might be able to spare some time if you are 100% its 2. You know the platform you work on, don't make anything dynamic without a reason if speed is your priority.

它有点依赖于您的THREAD_COUNT,默认情况下您说它是2,但是如果您是100%的its 2,您可能可以抽出一些时间。你知道你所工作的平台,如果你的首要任务是速度,那就不要毫无理由地让任何东西充满活力。

If THREAD == 2, num_threads is a unnecessary variable and can be removed.

如果线程== 2,num_threads是一个不必要的变量,可以被删除。

int minCharCountPerThread = size/num_threads;

And the olden way to many discussed topic about bit-shifting, try it:

关于位转换的老话题,不妨试试:

int minCharCountPerThread = size >> 1; //divide by 2

The next thing you can try is unroll your loops: multiple loops are only used 2 times, if size isn't a problem, why not remove the loop aspect? This is really something you should try, look what happens, and if it useful too you. I've seen cases loop unrolling works great, I've seen cases loop unrolling slows down my code.

你可以尝试的下一件事是展开你的循环:多个循环只被使用2次,如果大小不是问题,为什么不删除循环方面?这是你应该尝试的,看看发生了什么,如果它对你也有用的话。我见过案例循环展开工作很好,我见过案例循环展开会减慢我的代码。

Last thing: try using unsigned numbers instead if signed/int (unless you really need signed). It is known that some tricks/instruction are only available for unsigned variables.

最后一件事:尝试使用未签名的数字,而不是签名/int(除非你真的需要签名)。众所周知,有些技巧/指令只适用于无符号变量。

#2


1  

There are quite a few things you could do, but the results will really depend on which specific ARM hardware the code is running on. For example, older iPhone hardware is completely different than the newer 64 bit devices. Totally different hardware arch and diff instruction set. Older 32 bit arm hardware contained some real "tricks" that could make things a lot faster like multiple register read/write operation. One example optimization, instead of loading bytes you load while 32 bit words and then operate on each byte in the register using bit shifts. If you are using 2 threads, then another approach can be to break up the memory access so that 1 memory page is processed by 1 thread and then the second thread operates on the 2nd memory page and so on. That way different registers in the different processors can do maximum crunching without reading or writing to the same memory page (and memory access is the slow part typically). I would also suggest that you start with a good timing framework, I built a timing framework for ARM+iOS that you might find useful for that purpose.

您可以做很多事情,但是结果将取决于代码运行的具体ARM硬件。例如,旧的iPhone硬件与新的64位设备完全不同。完全不同的硬件arch和diff指令集。较老的32位arm硬件包含一些真正的“诀窍”,可以使事情更快,比如多寄存器读写操作。一个例子优化,不是加载字节,而是加载32位字,然后使用位移位对寄存器中的每个字节进行操作。如果您使用两个线程,那么另一种方法可以是中断内存访问,使一个线程处理一个内存页,然后第二个线程在第二个内存页上操作,以此类推。这样,不同处理器中的不同寄存器可以在不读取或写入相同内存页的情况下进行最大的处理(内存访问通常是比较慢的部分)。我还建议你从一个好的计时框架开始,我为ARM+iOS构建了一个计时框架,你可能会发现它在这方面很有用。

#1


1  

Alright, the following things you can try, I can't 100% say what will be effective in your situation, but from experience, if you put all possible optimizations off, and looking at the fact that even loop optimization worked for you: your compiler is pretty numb.

好吧,你可以试试下面的方法,我不能100%地说什么在你的情况下是有效的,但是从经验来看,如果你把所有可能的优化都去掉,并且考虑到即使循环优化对你也有效:你的编译器是相当麻木的。

It slightly depends a bit on your THREAD_COUNT, you say its 2 at default, but you might be able to spare some time if you are 100% its 2. You know the platform you work on, don't make anything dynamic without a reason if speed is your priority.

它有点依赖于您的THREAD_COUNT,默认情况下您说它是2,但是如果您是100%的its 2,您可能可以抽出一些时间。你知道你所工作的平台,如果你的首要任务是速度,那就不要毫无理由地让任何东西充满活力。

If THREAD == 2, num_threads is a unnecessary variable and can be removed.

如果线程== 2,num_threads是一个不必要的变量,可以被删除。

int minCharCountPerThread = size/num_threads;

And the olden way to many discussed topic about bit-shifting, try it:

关于位转换的老话题,不妨试试:

int minCharCountPerThread = size >> 1; //divide by 2

The next thing you can try is unroll your loops: multiple loops are only used 2 times, if size isn't a problem, why not remove the loop aspect? This is really something you should try, look what happens, and if it useful too you. I've seen cases loop unrolling works great, I've seen cases loop unrolling slows down my code.

你可以尝试的下一件事是展开你的循环:多个循环只被使用2次,如果大小不是问题,为什么不删除循环方面?这是你应该尝试的,看看发生了什么,如果它对你也有用的话。我见过案例循环展开工作很好,我见过案例循环展开会减慢我的代码。

Last thing: try using unsigned numbers instead if signed/int (unless you really need signed). It is known that some tricks/instruction are only available for unsigned variables.

最后一件事:尝试使用未签名的数字,而不是签名/int(除非你真的需要签名)。众所周知,有些技巧/指令只适用于无符号变量。

#2


1  

There are quite a few things you could do, but the results will really depend on which specific ARM hardware the code is running on. For example, older iPhone hardware is completely different than the newer 64 bit devices. Totally different hardware arch and diff instruction set. Older 32 bit arm hardware contained some real "tricks" that could make things a lot faster like multiple register read/write operation. One example optimization, instead of loading bytes you load while 32 bit words and then operate on each byte in the register using bit shifts. If you are using 2 threads, then another approach can be to break up the memory access so that 1 memory page is processed by 1 thread and then the second thread operates on the 2nd memory page and so on. That way different registers in the different processors can do maximum crunching without reading or writing to the same memory page (and memory access is the slow part typically). I would also suggest that you start with a good timing framework, I built a timing framework for ARM+iOS that you might find useful for that purpose.

您可以做很多事情,但是结果将取决于代码运行的具体ARM硬件。例如,旧的iPhone硬件与新的64位设备完全不同。完全不同的硬件arch和diff指令集。较老的32位arm硬件包含一些真正的“诀窍”,可以使事情更快,比如多寄存器读写操作。一个例子优化,不是加载字节,而是加载32位字,然后使用位移位对寄存器中的每个字节进行操作。如果您使用两个线程,那么另一种方法可以是中断内存访问,使一个线程处理一个内存页,然后第二个线程在第二个内存页上操作,以此类推。这样,不同处理器中的不同寄存器可以在不读取或写入相同内存页的情况下进行最大的处理(内存访问通常是比较慢的部分)。我还建议你从一个好的计时框架开始,我为ARM+iOS构建了一个计时框架,你可能会发现它在这方面很有用。