提高读取易失性存储器的性能

时间:2021-08-20 03:54:03

I have a function reading from some volatile memory which is updated by a DMA. The DMA is never operating on the same memory-location as the function. My application is performance critical. Hence, I realized the execution time is improved by approx. 20% if I not declare the memory as volatile. In the scope of my function the memory is non-volatile. Hovever, I have to be sure that next time the function is called, the compiler know that the memory may have changed.

我有一个函数从一些易失性存储器读取,由DMA更新。 DMA永远不会在与函数相同的内存位置上运行。我的应用程序是性能关键。因此,我意识到执行时间大约提高了。如果我没有将内存声明为volatile,则为20%。在我的函数范围内,内存是非易失性的。 Hovever,我必须确保下次调用该函数时,编译器知道内存可能已经改变。

The memory is two two-dimensional arrays:

内存是两个二维数组:

volatile uint16_t memoryBuffer[2][10][20] = {0};

The DMA operates on the opposite "matrix" than the program function:

DMA运行在与程序功能相反的“矩阵”上:

void myTask(uint8_t indexOppositeOfDMA)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      //Do some stuff with memory (readings only):
      foo(memoryBuffer[indexOppositeOfDMA][n][m]);
    }
  }
}

Is there a proper way to tell my compiler that the memoryBuffer is non-volatile inside the scope of myTask() but may be changed next time i call myTask(), so I could optain the performance improvement of 20%?

是否有正确的方法告诉我的编译器memoryBuffer在myTask()范围内是非易失性的,但是下次调用myTask()时可能会更改,所以我可以将性能提升20%?

Platform Cortex-M4

5 个解决方案

#1


6  

The problem without volatile

Let's assume that volatile is omitted from the data array. Then the C compiler and the CPU do not know that its elements change outside the program-flow. Some things that could happen then:

我们假设从数据数组中省略了volatile。然后C编译器和CPU不知道它的元素在程序流之外发生了变化。然后可能发生的一些事情:

  • The whole array might be loaded into the cache when myTask() is called for the first time. The array might stay in the cache forever and is never updated from the "main" memory again. This issue is more pressing on multi-core CPUs if myTask() is bound to a single core, for example.

    第一次调用myTask()时,整个数组可能会加载到缓存中。该阵列可能永远保留在缓存中,并且永远不会再从“主”内存更新。例如,如果myTask()绑定到单个核心,则此问题在多核CPU上更加紧迫。

  • If myTask() is inlined into the parent function, the compiler might decide to hoist loads outside of the loop even to a point where the DMA transfer has not been completed.

    如果将myTask()内联到父函数中,则编译器可能会决定将循环外的负载提升到DMA传输尚未完成的位置。

  • The compiler might even be able to determine that no write happens to memoryBuffer and assume that the array elements stay at 0 all the time (which would again trigger a lot of optimizations). This could happen if the program was rather small and all the code is visible to the compiler at once (or LTO is used). Remember: After all the compiler does not know anything about the DMA peripheral and that it is writing "unexpectedly and wildly into memory" (from a compiler perspective).

    编译器甚至可以确定memoryBuffer没有发生写入,并假设数组元素始终保持为0(这将再次触发大量优化)。如果程序相当小并且所有代码一次对编译器可见(或使用LTO),则可能发生这种情况。记住:毕竟编译器对DMA外设没有任何了解,并且它正在“意外地和疯狂地写入内存”(从编译器的角度来看)。

If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the volatile declaration. But it also might not...

如果编译器是愚蠢/保守的并且CPU不是很复杂(单核,没有无序执行),那么代码甚至可以在没有volatile声明的情况下工作。但它也可能不会......

The problem with volatile

Making the whole array volatile is often a pessimisation. For speed reasons you probably want to unroll the loop. So instead of loading from the array and incrementing the index alternatingly such as

使整个阵列变得不稳定通常是一种悲观。出于速度原因,您可能想要展开循环。因此,而不是从数组加载并交替递增索引,如

load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;

it can be faster to load multiple elements at once and increment the index in larger steps such as

一次加载多个元素并以更大的步骤递增索引可以更快

load memoryBuffer[m]
load memoryBuffer[m + 1]
load memoryBuffer[m + 2]
load memoryBuffer[m + 3]
m += 4;

This is especially true, if the loads can be fused together (e.g. to perform one 32-bit load instead of two 16-bit loads). Further you want the compiler to use SIMD instruction to process multiple array elements with a single instruction.

如果负载可以融合在一起(例如,执行一个32位负载而不是两个16位负载),则尤其如此。此外,您希望编译器使用SIMD指令通过单个指令处理多个数组元素。

These optimizations are often prevented if the load happens from volatile memory because compilers are usually very conservative with load/store reordering around volatile memory accesses. Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).

如果负载发生在易失性存储器中,则通常会阻止这些优化,因为编译器通常非常保守,在易失性存储器访问周围进行加载/存储重新排序。同样,编译器供应商之间的行为也不同(例如,MSVC与GCC)。

Possible solution 1: fences

So you would like to make the array non-volatile but add a hint for the compiler/CPU saying "when you see this line (execute this statement), flush the cache and reload the array from memory". In C11 you could insert an atomic_thread_fence at the beginning of myTask(). Such fences prevent the re-ordering of loads/stores across them.

因此,您希望使数组非易失性,但为编译器/ CPU添加提示“当您看到此行(执行此语句)时,刷新缓存并从内存重新加载阵列”。在C11中,您可以在myTask()的开头插入atomic_thread_fence。这样的围栏阻止了它们之间的装载/存储的重新排序。

Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a __dmb() intrinsic (data memory barrier). For GCC you may want to look at __sync_synchronize() (doc).

由于我们没有C11编译器,因此我们使用内在函数来完成此任务。 ARMCC编译器具有__dmb()内在函数(数据内存屏障)。对于GCC,您可能需要查看__sync_synchronize()(doc)。

Possible solution 2: atomic variable holding the buffer state

We use the following pattern a lot in our codebase (e.g. when reading data from SPI via DMA and calling a function to analyze it): The buffer is declared as plain array (no volatile) and an atomic flag is added to each buffer, which is set when the DMA transfer has finished. The code looks something like this:

我们在代码库中经常使用以下模式(例如,当通过DMA从SPI读取数据并调用函数来分析它时):缓冲区被声明为普通数组(无易失性),并且每个缓冲区都添加了一个原子标志,在DMA传输完成时设置。代码看起来像这样:

typedef struct Buffer
{
    uint16_t data[10][20];
    // Flag indicating if the buffer has been filled. Only use atomic instructions on it!
    int filled;
    // C11: atomic_int filled;
    // C++: std::atomic_bool filled{false};
} Buffer_t;

Buffer_t buffers[2];

Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy

void setupDMA(void)
{
    for (int i = 0; i < 2; ++i)
    {
        int bufferFilled;
        // Atomically load the flag.
        bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0);
        // C11: bufferFilled = atomic_load(&buffers[i].filled);
        // C++: bufferFilled = buffers[i].filled;

        if (!bufferFilled)
        {
            currentDmaBuffer = &buffers[i];
            ... configure DMA to write to buffers[i].data and start it
        }
    }

    // If you end up here, there is no free buffer available because the
    // data processing takes too long.
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    // Atomically set the flag indicating that the buffer has been filled.
    __sync_fetch_and_or(&currentDmaBuffer->filled, 1);
    // C11: atomic_store(&currentDmaBuffer->filled, 1);
    // C++: currentDmaBuffer->filled = true;

    currentDmaBuffer = 0;
    // ... possibly start another DMA transfer ...
}

void myTask(Buffer_t* buffer)
{
    for (uint8_t n=0; n<10; n++)
        for (uint8_t m=0; m<20; m++)
            foo(buffer->data[n][m]);

    // Reset the flag atomically.
    __sync_fetch_and_and(&buffer->filled, 0);
    // C11: atomic_store(&buffer->filled, 0);
    // C++: buffer->filled = false;
}

void waitForData(void)
{
    // ... see setupDma(void) ...
}

The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more, make the incoming data slower or the processing code faster or whatever is sufficient in your case.

将缓冲区与原子配对的优点是,您可以检测到处理速度太慢意味着您必须缓冲更多,使传入数据更慢或处理代码更快或在您的情况下足够。

Possible solution 3: OS support

If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on the queue until it is non-empty. The pattern looks a bit like this:

如果您有(嵌入式)操作系统,则可能需要使用其他模式而不是使用易失性数组。我们使用的操作系统具有内存池和队列。后者可以从线程或中断填充,并且线程可以阻塞队列,直到它为非空。该模式看起来有点像这样:

MemoryPool pool;              // A pool to acquire DMA buffers.
Queue bufferQueue;            // A queue for pointers to buffers filled by the DMA.
void* volatile currentBuffer; // The buffer currently filled by the DMA.

void setupDMA(void)
{
    currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t));
    // ... make the DMA write to currentBuffer
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    Queue_Post(&bufferQueue, currentBuffer);
    currentBuffer = 0;
}

void myTask(void)
{
    void* buffer = Queue_Wait(&bufferQueue);
    [... work with buffer ...]
    MemoryPool_Deallocate(&pool, buffer);
}

This is probably the easiest approach to implement but only if you have an OS and if portability is not an issue.

这可能是最简单的实现方法,但前提是您拥有操作系统并且可移植性不是问题。

#2


2  

Here you say that the buffer is non-volatile:

在这里你说缓冲区是非易失性的:

"memoryBuffer is non-volatile inside the scope of myTask"

“memoryBuffer在myTask范围内是非易失性的”

But here you say that it must be volatile:

但在这里你说它必须是不稳定的:

"but may be changed next time i call myTask"

“但下次我打电话给myTask时可能会改变”

These two sentences are contradicting. Clearly the memory area must be volatile or the compiler can't know that it may be updated by DMA.

这两句话是矛盾的。显然,内存区域必须是易失性的,否则编译器无法知道它可能会被DMA更新。

However, I rather suspect that the actual performance loss comes from accessing this memory region repeatedly through your algorithm, forcing the compiler to read it back over and over again.

但是,我宁愿怀疑实际性能损失来自于通过算法反复访问此内存区域,迫使编译器反复读取它。

What you should do is to take a local, non-volatile copy of the part of the memory you are interested in:

你应该做的是获取你感兴趣的内存部分的本地非易失性副本:

void myTask(uint8_t indexOppositeOfDMA)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      volatile uint16_t* data = &memoryBuffer[indexOppositeOfDMA][n][m];
      uint16_t local_copy = *data; // this access is volatile and wont get optimized away

      foo(&local_copy); // optimizations possible here

      // if needed, write back again:
      *data = local_copy; // optional
    }
  }
}

You'll have to benchmark it, but I'm pretty sure this should improve performance.

你必须对它进行基准测试,但我很确定这应该会提高性能。

Alternatively, you could first copy the whole part of the array you are interested in, then work on that, before writing it back. That should help performance even more.

或者,您可以先复制您感兴趣的数组的整个部分,然后再写回来,然后再写回来。这应该有助于提高性能。

#3


1  

You're not allowed to cast away the volatile qualifier1.

你不能抛弃volatile限定符1。

If the array must be defined holding volatile elements then the only two options, "that let the compiler know that the memory has changed", are to keep the volatile qualifier, or use a temporary array which is defined without volatile and is copied to the proper array after the function call. Pick whichever is faster.

如果必须定义包含volatile元素的数组,则只有两个选项“让编译器知道内存已经更改”,是保留volatile限定符,或者使用一个没有volatile定义的临时数组,并将其复制到函数调用后的正确数组。选择哪个更快。


1 (Quoted from: ISO/IEC 9899:201x 6.7.3 Type qualifiers 6)
If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.

1(引用自:ISO / IEC 9899:201x 6.7.3类型限定符6)如果尝试通过使用具有非易失性限定类型的左值来引用使用volatile限定类型定义的对象,则行为未定义。

#4


0  

It seems to me that you a passing half of the buffer to myTask and each half does not need to be volatile. So I wonder if you could solve your issue by defining the buffer as such, and then passing a pointer to one of the half-buffers to myTask. I'm not sure whether this will work but maybe something like this...

在我看来,你将缓冲区的一半传递给myTask,每一半都不需要是volatile。所以我想知道你是否可以通过定义缓冲区来解决你的问题,然后将指针传递给myTask中的一个半缓冲区。我不确定这是否有效,但也许是这样的......

typedef struct memory_buffer {
    uint16_t buffer[10][20];
} memory_buffer ;

volatile memory_buffer double_buffer[2];

void myTask(memory_buffer *mem_buf)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      //Do some stuff with memory:
      foo(mem_buf->buffer[n][m]);
    }
  }
}

#5


0  

I don't know you platform/mCU/SoC, but usually DMAs have interrupt that trigger on programmable threshold.

我不知道你的平台/ mCU / SoC,但通常DMA有可在可编程阈值上触发的中断。

What I can imagine is to remove volatile keyword and use interrupt as semaphore for task.

我能想象的是删除volatile关键字并使用interrupt作为任务的信号量。

In other words:

换一种说法:

  • DMA is programmed to interrupt when last byte of buffer is written
  • DMA被编程为在写入缓冲区的最后一个字节时中断

  • Task is block on a semaphore/flag waiting that the flag is released
  • 任务是在信号量/标志上阻塞,等待标志被释放

  • When DMA calls the interrupt routine cange the buffer pointed by DMA for the next reading time and change the flag that unlock the task that can elaborate data.
  • 当DMA调用中断例程时,将DMA指向的缓冲区用于下一个读取时间,并更改解锁可以详细说明数据的任务的标志。

Something like:

uint16_t memoryBuffer[2][10][20];

volatile uint8_t PingPong = 0;

void interrupt ( void )
{    
    // Change current DMA pointed buffer

    PingPong ^= 1;    
}

void myTask(void)
{
    static uint8_t lastPingPong = 0;

    if (lastPingPong != PingPong)
    {
        for (uint8_t n = 0; n < 10; n++)
        {
            for (uint8_t m = 0; m < 20; m++)
            {
                //Do some stuff with memory:
                foo(memoryBuffer[PingPong][n][m]);
            }
        }

        lastPingPong = PingPong;
    }
}

#1


6  

The problem without volatile

Let's assume that volatile is omitted from the data array. Then the C compiler and the CPU do not know that its elements change outside the program-flow. Some things that could happen then:

我们假设从数据数组中省略了volatile。然后C编译器和CPU不知道它的元素在程序流之外发生了变化。然后可能发生的一些事情:

  • The whole array might be loaded into the cache when myTask() is called for the first time. The array might stay in the cache forever and is never updated from the "main" memory again. This issue is more pressing on multi-core CPUs if myTask() is bound to a single core, for example.

    第一次调用myTask()时,整个数组可能会加载到缓存中。该阵列可能永远保留在缓存中,并且永远不会再从“主”内存更新。例如,如果myTask()绑定到单个核心,则此问题在多核CPU上更加紧迫。

  • If myTask() is inlined into the parent function, the compiler might decide to hoist loads outside of the loop even to a point where the DMA transfer has not been completed.

    如果将myTask()内联到父函数中,则编译器可能会决定将循环外的负载提升到DMA传输尚未完成的位置。

  • The compiler might even be able to determine that no write happens to memoryBuffer and assume that the array elements stay at 0 all the time (which would again trigger a lot of optimizations). This could happen if the program was rather small and all the code is visible to the compiler at once (or LTO is used). Remember: After all the compiler does not know anything about the DMA peripheral and that it is writing "unexpectedly and wildly into memory" (from a compiler perspective).

    编译器甚至可以确定memoryBuffer没有发生写入,并假设数组元素始终保持为0(这将再次触发大量优化)。如果程序相当小并且所有代码一次对编译器可见(或使用LTO),则可能发生这种情况。记住:毕竟编译器对DMA外设没有任何了解,并且它正在“意外地和疯狂地写入内存”(从编译器的角度来看)。

If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the volatile declaration. But it also might not...

如果编译器是愚蠢/保守的并且CPU不是很复杂(单核,没有无序执行),那么代码甚至可以在没有volatile声明的情况下工作。但它也可能不会......

The problem with volatile

Making the whole array volatile is often a pessimisation. For speed reasons you probably want to unroll the loop. So instead of loading from the array and incrementing the index alternatingly such as

使整个阵列变得不稳定通常是一种悲观。出于速度原因,您可能想要展开循环。因此,而不是从数组加载并交替递增索引,如

load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;

it can be faster to load multiple elements at once and increment the index in larger steps such as

一次加载多个元素并以更大的步骤递增索引可以更快

load memoryBuffer[m]
load memoryBuffer[m + 1]
load memoryBuffer[m + 2]
load memoryBuffer[m + 3]
m += 4;

This is especially true, if the loads can be fused together (e.g. to perform one 32-bit load instead of two 16-bit loads). Further you want the compiler to use SIMD instruction to process multiple array elements with a single instruction.

如果负载可以融合在一起(例如,执行一个32位负载而不是两个16位负载),则尤其如此。此外,您希望编译器使用SIMD指令通过单个指令处理多个数组元素。

These optimizations are often prevented if the load happens from volatile memory because compilers are usually very conservative with load/store reordering around volatile memory accesses. Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).

如果负载发生在易失性存储器中,则通常会阻止这些优化,因为编译器通常非常保守,在易失性存储器访问周围进行加载/存储重新排序。同样,编译器供应商之间的行为也不同(例如,MSVC与GCC)。

Possible solution 1: fences

So you would like to make the array non-volatile but add a hint for the compiler/CPU saying "when you see this line (execute this statement), flush the cache and reload the array from memory". In C11 you could insert an atomic_thread_fence at the beginning of myTask(). Such fences prevent the re-ordering of loads/stores across them.

因此,您希望使数组非易失性,但为编译器/ CPU添加提示“当您看到此行(执行此语句)时,刷新缓存并从内存重新加载阵列”。在C11中,您可以在myTask()的开头插入atomic_thread_fence。这样的围栏阻止了它们之间的装载/存储的重新排序。

Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a __dmb() intrinsic (data memory barrier). For GCC you may want to look at __sync_synchronize() (doc).

由于我们没有C11编译器,因此我们使用内在函数来完成此任务。 ARMCC编译器具有__dmb()内在函数(数据内存屏障)。对于GCC,您可能需要查看__sync_synchronize()(doc)。

Possible solution 2: atomic variable holding the buffer state

We use the following pattern a lot in our codebase (e.g. when reading data from SPI via DMA and calling a function to analyze it): The buffer is declared as plain array (no volatile) and an atomic flag is added to each buffer, which is set when the DMA transfer has finished. The code looks something like this:

我们在代码库中经常使用以下模式(例如,当通过DMA从SPI读取数据并调用函数来分析它时):缓冲区被声明为普通数组(无易失性),并且每个缓冲区都添加了一个原子标志,在DMA传输完成时设置。代码看起来像这样:

typedef struct Buffer
{
    uint16_t data[10][20];
    // Flag indicating if the buffer has been filled. Only use atomic instructions on it!
    int filled;
    // C11: atomic_int filled;
    // C++: std::atomic_bool filled{false};
} Buffer_t;

Buffer_t buffers[2];

Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy

void setupDMA(void)
{
    for (int i = 0; i < 2; ++i)
    {
        int bufferFilled;
        // Atomically load the flag.
        bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0);
        // C11: bufferFilled = atomic_load(&buffers[i].filled);
        // C++: bufferFilled = buffers[i].filled;

        if (!bufferFilled)
        {
            currentDmaBuffer = &buffers[i];
            ... configure DMA to write to buffers[i].data and start it
        }
    }

    // If you end up here, there is no free buffer available because the
    // data processing takes too long.
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    // Atomically set the flag indicating that the buffer has been filled.
    __sync_fetch_and_or(&currentDmaBuffer->filled, 1);
    // C11: atomic_store(&currentDmaBuffer->filled, 1);
    // C++: currentDmaBuffer->filled = true;

    currentDmaBuffer = 0;
    // ... possibly start another DMA transfer ...
}

void myTask(Buffer_t* buffer)
{
    for (uint8_t n=0; n<10; n++)
        for (uint8_t m=0; m<20; m++)
            foo(buffer->data[n][m]);

    // Reset the flag atomically.
    __sync_fetch_and_and(&buffer->filled, 0);
    // C11: atomic_store(&buffer->filled, 0);
    // C++: buffer->filled = false;
}

void waitForData(void)
{
    // ... see setupDma(void) ...
}

The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more, make the incoming data slower or the processing code faster or whatever is sufficient in your case.

将缓冲区与原子配对的优点是,您可以检测到处理速度太慢意味着您必须缓冲更多,使传入数据更慢或处理代码更快或在您的情况下足够。

Possible solution 3: OS support

If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on the queue until it is non-empty. The pattern looks a bit like this:

如果您有(嵌入式)操作系统,则可能需要使用其他模式而不是使用易失性数组。我们使用的操作系统具有内存池和队列。后者可以从线程或中断填充,并且线程可以阻塞队列,直到它为非空。该模式看起来有点像这样:

MemoryPool pool;              // A pool to acquire DMA buffers.
Queue bufferQueue;            // A queue for pointers to buffers filled by the DMA.
void* volatile currentBuffer; // The buffer currently filled by the DMA.

void setupDMA(void)
{
    currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t));
    // ... make the DMA write to currentBuffer
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    Queue_Post(&bufferQueue, currentBuffer);
    currentBuffer = 0;
}

void myTask(void)
{
    void* buffer = Queue_Wait(&bufferQueue);
    [... work with buffer ...]
    MemoryPool_Deallocate(&pool, buffer);
}

This is probably the easiest approach to implement but only if you have an OS and if portability is not an issue.

这可能是最简单的实现方法,但前提是您拥有操作系统并且可移植性不是问题。

#2


2  

Here you say that the buffer is non-volatile:

在这里你说缓冲区是非易失性的:

"memoryBuffer is non-volatile inside the scope of myTask"

“memoryBuffer在myTask范围内是非易失性的”

But here you say that it must be volatile:

但在这里你说它必须是不稳定的:

"but may be changed next time i call myTask"

“但下次我打电话给myTask时可能会改变”

These two sentences are contradicting. Clearly the memory area must be volatile or the compiler can't know that it may be updated by DMA.

这两句话是矛盾的。显然,内存区域必须是易失性的,否则编译器无法知道它可能会被DMA更新。

However, I rather suspect that the actual performance loss comes from accessing this memory region repeatedly through your algorithm, forcing the compiler to read it back over and over again.

但是,我宁愿怀疑实际性能损失来自于通过算法反复访问此内存区域,迫使编译器反复读取它。

What you should do is to take a local, non-volatile copy of the part of the memory you are interested in:

你应该做的是获取你感兴趣的内存部分的本地非易失性副本:

void myTask(uint8_t indexOppositeOfDMA)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      volatile uint16_t* data = &memoryBuffer[indexOppositeOfDMA][n][m];
      uint16_t local_copy = *data; // this access is volatile and wont get optimized away

      foo(&local_copy); // optimizations possible here

      // if needed, write back again:
      *data = local_copy; // optional
    }
  }
}

You'll have to benchmark it, but I'm pretty sure this should improve performance.

你必须对它进行基准测试,但我很确定这应该会提高性能。

Alternatively, you could first copy the whole part of the array you are interested in, then work on that, before writing it back. That should help performance even more.

或者,您可以先复制您感兴趣的数组的整个部分,然后再写回来,然后再写回来。这应该有助于提高性能。

#3


1  

You're not allowed to cast away the volatile qualifier1.

你不能抛弃volatile限定符1。

If the array must be defined holding volatile elements then the only two options, "that let the compiler know that the memory has changed", are to keep the volatile qualifier, or use a temporary array which is defined without volatile and is copied to the proper array after the function call. Pick whichever is faster.

如果必须定义包含volatile元素的数组,则只有两个选项“让编译器知道内存已经更改”,是保留volatile限定符,或者使用一个没有volatile定义的临时数组,并将其复制到函数调用后的正确数组。选择哪个更快。


1 (Quoted from: ISO/IEC 9899:201x 6.7.3 Type qualifiers 6)
If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.

1(引用自:ISO / IEC 9899:201x 6.7.3类型限定符6)如果尝试通过使用具有非易失性限定类型的左值来引用使用volatile限定类型定义的对象,则行为未定义。

#4


0  

It seems to me that you a passing half of the buffer to myTask and each half does not need to be volatile. So I wonder if you could solve your issue by defining the buffer as such, and then passing a pointer to one of the half-buffers to myTask. I'm not sure whether this will work but maybe something like this...

在我看来,你将缓冲区的一半传递给myTask,每一半都不需要是volatile。所以我想知道你是否可以通过定义缓冲区来解决你的问题,然后将指针传递给myTask中的一个半缓冲区。我不确定这是否有效,但也许是这样的......

typedef struct memory_buffer {
    uint16_t buffer[10][20];
} memory_buffer ;

volatile memory_buffer double_buffer[2];

void myTask(memory_buffer *mem_buf)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      //Do some stuff with memory:
      foo(mem_buf->buffer[n][m]);
    }
  }
}

#5


0  

I don't know you platform/mCU/SoC, but usually DMAs have interrupt that trigger on programmable threshold.

我不知道你的平台/ mCU / SoC,但通常DMA有可在可编程阈值上触发的中断。

What I can imagine is to remove volatile keyword and use interrupt as semaphore for task.

我能想象的是删除volatile关键字并使用interrupt作为任务的信号量。

In other words:

换一种说法:

  • DMA is programmed to interrupt when last byte of buffer is written
  • DMA被编程为在写入缓冲区的最后一个字节时中断

  • Task is block on a semaphore/flag waiting that the flag is released
  • 任务是在信号量/标志上阻塞,等待标志被释放

  • When DMA calls the interrupt routine cange the buffer pointed by DMA for the next reading time and change the flag that unlock the task that can elaborate data.
  • 当DMA调用中断例程时,将DMA指向的缓冲区用于下一个读取时间,并更改解锁可以详细说明数据的任务的标志。

Something like:

uint16_t memoryBuffer[2][10][20];

volatile uint8_t PingPong = 0;

void interrupt ( void )
{    
    // Change current DMA pointed buffer

    PingPong ^= 1;    
}

void myTask(void)
{
    static uint8_t lastPingPong = 0;

    if (lastPingPong != PingPong)
    {
        for (uint8_t n = 0; n < 10; n++)
        {
            for (uint8_t m = 0; m < 20; m++)
            {
                //Do some stuff with memory:
                foo(memoryBuffer[PingPong][n][m]);
            }
        }

        lastPingPong = PingPong;
    }
}