ARM性能计数器vs linux clock_gettime

I am using a Zynq chip on a development board ( ZC702 ) , which has a dual cortex-A9 MPCore at 667MHz and comes with a Linux kernel 3.3 I wanted to compare the execution time of a program so first a used clock_gettime and then used the counters provided by the co-processor of ARM. The counter increment every one processor cycle. ( based on this question of * and this)

我用Zynq芯片开发板上(ZC702),双重cortex - a9 MPCore并在667 mhz,附带了一个Linux内核3.3我想比较程序的执行时间首先使用clock_gettime然后用臂的协同处理器提供的计数器。计数器增加每个处理器周期。(基于*和这个问题)

I compile the program with -O0 flag ( since I don't want any reordering or optimization done)

我使用-O0标志编译程序(因为我不希望任何重新排序或优化完成)

The time I measure with the performance counters is 583833498 ( cycles ) / 666.666687 MHz = 875750.221 (microseconds)

我用性能计数器测量的时间是583833498(循环)/ 66666687 MHz = 875750.221(微秒)

While using clock_gettime() ( either REALTIME or MONOTONIC or MONOTONIC_RAW ) the time measured is : 731627.126 ( microseconds) which is 150000 microseconds less..

使用clock_gettime()(实时或单调或单调或单调)时，测量的时间为:731627.126(微秒)，比15万微秒少。

Can anybody explain me why is this happening? Why is there a difference? The processor does not clock-scale , how is it possible to get less execution time measured by clock_gettime ? I have a sample code below:

有人能解释一下为什么会这样吗?为什么会有区别?处理器没有时钟刻度，怎么可能通过clock_gettime获得更少的执行时间?我有一个示例代码如下:

#define RUNS 50000000
#define BENCHMARK(val) \
__asm__  __volatile__("mov r4, %1\n\t" \
                 "mov r5, #0\n\t" \
                 "1:\n\t"\
                 "add r5,r5,r4\n\t"\
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "mov r4 ,r4  \n\t" \
                 "sub r4,r4,#1\n\t" \
                 "cmp r4, #0\n\t" \
                 "bne 1b\n\t" \
                 "mov %0 ,r5  \n\t" \
                 :"=r" (val) \
                 : "r" (RUNS) \
                 : "r4","r5" \
        );
clock_gettime(CLOCK_MONOTONIC_RAW,&start);
__asm__ __volatile__ ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(start_cycles));
for(index=0;index<5;index++)
{
    BENCHMARK(i);
}
__asm__ __volatile__ ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(end_cycles));
clock_gettime(CLOCK_MONOTONIC_RAW,&stop);

2 个解决方案

#1

I found the solution. I upgraded the platform from a linux kernel 3.3.0 to 3.5 and the value is similar to that of the performance counters. Apparently the frequency of the clock counter in 3.3.0 is assumed higher that what it is ( around 400 MHz ) instead of half of the CPU frequency. Probably a porting error in the old version.

我找到了解决方案。我将平台从linux内核3.3.0升级到3.5，其值与性能计数器的值相似。显然，3.3.0的时钟计数器的频率要高于它(大约400兆赫)，而不是CPU频率的一半。可能是旧版本中的移植错误。

#2

The POSIX clocks operate within certain precision, which you can get with clock_getres. Check if that 150,000us difference is inside or outside the error margin.

POSIX时钟在一定的精度内工作，您可以通过clock_getres获得。检查这个150000的误差是在误差范围内还是外面。

In any case, it shouldn't matter, you should repeat you benchmark many times, not 5, but 1000 or more. You can then get the timing of a single benchmark run like

无论如何，这不重要，您应该重复您的基准测试多次，而不是5次，而是1000次或更多。然后，您可以获得一个单一基准运行的时间。

((end + e1) - (start + e0)) / 1000, or

(结束+ e1) -(开始+ e0)) / 1000，或

(end - start) / 1000 + (e1 - e0) / 1000.

(结束-开始)/ 1000 + (e1 - e0) / 1000。

If e1 and e0 are the error terms, which are bound by a small constant, your maximum measurement error will be abs (e1 - e0) / 1000, which will be negligible as the number of loops increase.

如果e1和e0是由一个小常数约束的误差项，那么最大的测量误差将是abs (e1 - e0) / 1000，当循环次数增加时，这个误差可以忽略不计。

#1

#2