在c++ (boost)中,多线程对数组的各种单元格进行修改是安全的

时间:2020-12-04 21:01:26

I have an array of size n and n threads, each ith thread can read / write only to ith cell of an array. I do not use any memory locks. Is this safe for C++ Boost threads ? How is this related to the cache in the processors, there are stored chunks of memory, not single values. I guess that cores of processor share cache and there is no duplication of data chunks within cache, therefore when many modification of the same chunk (however on various positions) occurs there is no conflict between versions.

我有一个大小为n和n个线程的数组,每个第I个线程只能读/写到数组的第I个单元。我不使用任何内存锁。对于c++ Boost线程来说,这是安全的吗?这与处理器中的缓存有什么关系呢,有存储的内存块,而不是单个值。我猜想处理器的核心共享缓存,并且缓存中没有数据块的重复,因此当同一块的许多修改(但是在不同位置)发生时,版本之间没有冲突。

3 个解决方案



On any modern processor, writing to separate memory locations (even if adjacent) will pose no hazard. Otherwise, threading would be much, much harder.


Indeed, it is a relatively common idiom to have threads "fill out" the elements of an array: this is precisely what typical threaded implementations of linear algebra programs do, for example.




Writing to separate memory locations will work correctly, however 'false sharing' may cause performance problems depending on the patterns of data accesses and the specific architecture.


Oracle's OpenMP API docs have a good description of false sharing:

Oracle的OpenMP API文档很好地描述了错误共享:

6.2.1 What Is False Sharing?


Most high performance processors, such as UltraSPARC processors, insert a cache buffer between slow memory and the high speed registers of the CPU. Accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache. Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory.


However, simultaneous updates of individual elements in the same cache line coming from different processors invalidates entire cache lines, even though these updates are logically independent of each other. Each update of an individual element of a cache line marks the line as invalid. Other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a more recent copy of the line from memory or elsewhere, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.


This situation is called false sharing. If this occurs frequently, performance and scalability of an OpenMP application will suffer significantly.


False sharing degrades performance when all of the following conditions occur.


  • Shared data is modified by multiple processors.
  • 共享数据由多个处理器修改。
  • Multiple processors update data within the same cache line.
  • 多个处理器在同一缓存线内更新数据。
  • This updating occurs very frequently (for example, in a tight loop).
  • 这种更新非常频繁(例如,在紧密循环中)。

Note that shared data that is read-only in a loop does not lead to false sharing.




Before C++11, the Standard didn't address threading at all. Now it does. This rule is found in section 1.7:

在c++ 11之前,标准根本不涉及线程。现在它。这条规则见第1.7节:

A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. — end note ] Two or more threads of execution (1.10) can update and access separate memory locations without interfering with each other.


An array is not a scalar, but its elements are. So each element is a distinct memory location, and therefore distinct elements are eligible for being used by different threads simultaneously with no need for locking or synchronization (as long as at most one thread accessed any given element).


However, you will cause a great deal of extra work for the cache coherency protocol if data stored in the same cache line are written by different threads. Consider adding padding, or interchanging data layout so that all variables used by a thread are stored adjacently. (array of structures instead of structure of arrays)




On any modern processor, writing to separate memory locations (even if adjacent) will pose no hazard. Otherwise, threading would be much, much harder.


Indeed, it is a relatively common idiom to have threads "fill out" the elements of an array: this is precisely what typical threaded implementations of linear algebra programs do, for example.




Writing to separate memory locations will work correctly, however 'false sharing' may cause performance problems depending on the patterns of data accesses and the specific architecture.


Oracle's OpenMP API docs have a good description of false sharing:

Oracle的OpenMP API文档很好地描述了错误共享:

6.2.1 What Is False Sharing?


Most high performance processors, such as UltraSPARC processors, insert a cache buffer between slow memory and the high speed registers of the CPU. Accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache. Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory.


However, simultaneous updates of individual elements in the same cache line coming from different processors invalidates entire cache lines, even though these updates are logically independent of each other. Each update of an individual element of a cache line marks the line as invalid. Other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a more recent copy of the line from memory or elsewhere, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.


This situation is called false sharing. If this occurs frequently, performance and scalability of an OpenMP application will suffer significantly.


False sharing degrades performance when all of the following conditions occur.


  • Shared data is modified by multiple processors.
  • 共享数据由多个处理器修改。
  • Multiple processors update data within the same cache line.
  • 多个处理器在同一缓存线内更新数据。
  • This updating occurs very frequently (for example, in a tight loop).
  • 这种更新非常频繁(例如,在紧密循环中)。

Note that shared data that is read-only in a loop does not lead to false sharing.




Before C++11, the Standard didn't address threading at all. Now it does. This rule is found in section 1.7:

在c++ 11之前,标准根本不涉及线程。现在它。这条规则见第1.7节:

A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. — end note ] Two or more threads of execution (1.10) can update and access separate memory locations without interfering with each other.


An array is not a scalar, but its elements are. So each element is a distinct memory location, and therefore distinct elements are eligible for being used by different threads simultaneously with no need for locking or synchronization (as long as at most one thread accessed any given element).


However, you will cause a great deal of extra work for the cache coherency protocol if data stored in the same cache line are written by different threads. Consider adding padding, or interchanging data layout so that all variables used by a thread are stored adjacently. (array of structures instead of structure of arrays)
