在c++ (boost)中,多线程对数组的各种单元格进行修改是安全的

时间:2020-12-04 21:01:26

I have an array of size n and n threads, each ith thread can read / write only to ith cell of an array. I do not use any memory locks. Is this safe for C++ Boost threads ? How is this related to the cache in the processors, there are stored chunks of memory, not single values. I guess that cores of processor share cache and there is no duplication of data chunks within cache, therefore when many modification of the same chunk (however on various positions) occurs there is no conflict between versions.

我有一个大小为n和n个线程的数组,每个第I个线程只能读/写到数组的第I个单元。我不使用任何内存锁。对于c++ Boost线程来说,这是安全的吗?这与处理器中的缓存有什么关系呢,有存储的内存块,而不是单个值。我猜想处理器的核心共享缓存,并且缓存中没有数据块的重复,因此当同一块的许多修改(但是在不同位置)发生时,版本之间没有冲突。

3 个解决方案

#1


2  

On any modern processor, writing to separate memory locations (even if adjacent) will pose no hazard. Otherwise, threading would be much, much harder.

在任何现代处理器上,写入单独的内存位置(即使是相邻的)也不会造成危险。否则,线程化将更加困难。

Indeed, it is a relatively common idiom to have threads "fill out" the elements of an array: this is precisely what typical threaded implementations of linear algebra programs do, for example.

实际上,使用线程“填充”数组元素是一个比较常见的习惯用法:例如,这正是线性代数程序的典型线程实现所做的事情。

#2


2  

Writing to separate memory locations will work correctly, however 'false sharing' may cause performance problems depending on the patterns of data accesses and the specific architecture.

写入到不同的内存位置将会正常工作,但是根据数据访问模式和特定的体系结构,“错误共享”可能会导致性能问题。

Oracle's OpenMP API docs have a good description of false sharing:

Oracle的OpenMP API文档很好地描述了错误共享:

6.2.1 What Is False Sharing?

6.2.1什么是虚假分享?

Most high performance processors, such as UltraSPARC processors, insert a cache buffer between slow memory and the high speed registers of the CPU. Accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache. Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory.

大多数高性能处理器,如UltraSPARC处理器,都会在慢内存和CPU的高速寄存器之间插入缓存缓冲区。访问内存位置将导致包含请求复制到缓存中的内存位置的实际内存片(缓存行)。随后对相同内存位置或其周围的引用可能会从缓存中得到满足,直到系统确定需要保持缓存和内存之间的一致性。

However, simultaneous updates of individual elements in the same cache line coming from different processors invalidates entire cache lines, even though these updates are logically independent of each other. Each update of an individual element of a cache line marks the line as invalid. Other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a more recent copy of the line from memory or elsewhere, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.

然而,来自不同处理器的同一高速缓存线路中单个元素的同步更新会使整个缓存线路失效,尽管这些更新在逻辑上是相互独立的。缓存行的每个元素的更新都将该行标记为无效。访问同一行中不同元素的其他处理器会看到标记为无效的行。它们*从内存或其他地方获取最近的行副本,即使访问的元素没有被修改。这是因为缓存一致性是基于缓存线来维护的,而不是针对单个元素。因此,互连的流量和开销将会增加。此外,当正在进行cache-line更新时,将禁止访问行中的元素。

This situation is called false sharing. If this occurs frequently, performance and scalability of an OpenMP application will suffer significantly.

这种情况叫做虚假分享。如果经常发生这种情况,那么OpenMP应用程序的性能和可伸缩性将受到严重影响。

False sharing degrades performance when all of the following conditions occur.

当出现以下所有条件时,错误共享会降低性能。

  • Shared data is modified by multiple processors.
  • 共享数据由多个处理器修改。
  • Multiple processors update data within the same cache line.
  • 多个处理器在同一缓存线内更新数据。
  • This updating occurs very frequently (for example, in a tight loop).
  • 这种更新非常频繁(例如,在紧密循环中)。

Note that shared data that is read-only in a loop does not lead to false sharing.

注意,循环中为只读的共享数据不会导致错误的共享。

#3


1  

Before C++11, the Standard didn't address threading at all. Now it does. This rule is found in section 1.7:

在c++ 11之前,标准根本不涉及线程。现在它。这条规则见第1.7节:

A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. — end note ] Two or more threads of execution (1.10) can update and access separate memory locations without interfering with each other.

内存位置要么是标量类型的对象,要么是相邻位域的最大序列,它们都具有非零的宽度。[注意:该语言的各种特性,如引用和虚拟函数,可能涉及到程序无法访问但由实现管理的额外内存位置。-结束提示]两个或多个执行线程(1.10)可以更新和访问不同的内存位置,而不会相互干扰。

An array is not a scalar, but its elements are. So each element is a distinct memory location, and therefore distinct elements are eligible for being used by different threads simultaneously with no need for locking or synchronization (as long as at most one thread accessed any given element).

数组不是标量,但它的元素是标量。因此,每个元素都是一个不同的内存位置,因此不同的元素可以被不同的线程同时使用,不需要锁定或同步(只要最多有一个线程访问任何给定的元素)。

However, you will cause a great deal of extra work for the cache coherency protocol if data stored in the same cache line are written by different threads. Consider adding padding, or interchanging data layout so that all variables used by a thread are stored adjacently. (array of structures instead of structure of arrays)

但是,如果存储在同一缓存行的数据是由不同的线程编写的,那么您将为缓存一致性协议带来大量额外的工作。考虑添加填充或交换数据布局,以便线程使用的所有变量都是邻接存储的。(结构数组而不是数组结构)

#1


2  

On any modern processor, writing to separate memory locations (even if adjacent) will pose no hazard. Otherwise, threading would be much, much harder.

在任何现代处理器上,写入单独的内存位置(即使是相邻的)也不会造成危险。否则,线程化将更加困难。

Indeed, it is a relatively common idiom to have threads "fill out" the elements of an array: this is precisely what typical threaded implementations of linear algebra programs do, for example.

实际上,使用线程“填充”数组元素是一个比较常见的习惯用法:例如,这正是线性代数程序的典型线程实现所做的事情。

#2


2  

Writing to separate memory locations will work correctly, however 'false sharing' may cause performance problems depending on the patterns of data accesses and the specific architecture.

写入到不同的内存位置将会正常工作,但是根据数据访问模式和特定的体系结构,“错误共享”可能会导致性能问题。

Oracle's OpenMP API docs have a good description of false sharing:

Oracle的OpenMP API文档很好地描述了错误共享:

6.2.1 What Is False Sharing?

6.2.1什么是虚假分享?

Most high performance processors, such as UltraSPARC processors, insert a cache buffer between slow memory and the high speed registers of the CPU. Accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache. Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory.

大多数高性能处理器,如UltraSPARC处理器,都会在慢内存和CPU的高速寄存器之间插入缓存缓冲区。访问内存位置将导致包含请求复制到缓存中的内存位置的实际内存片(缓存行)。随后对相同内存位置或其周围的引用可能会从缓存中得到满足,直到系统确定需要保持缓存和内存之间的一致性。

However, simultaneous updates of individual elements in the same cache line coming from different processors invalidates entire cache lines, even though these updates are logically independent of each other. Each update of an individual element of a cache line marks the line as invalid. Other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a more recent copy of the line from memory or elsewhere, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.

然而,来自不同处理器的同一高速缓存线路中单个元素的同步更新会使整个缓存线路失效,尽管这些更新在逻辑上是相互独立的。缓存行的每个元素的更新都将该行标记为无效。访问同一行中不同元素的其他处理器会看到标记为无效的行。它们*从内存或其他地方获取最近的行副本,即使访问的元素没有被修改。这是因为缓存一致性是基于缓存线来维护的,而不是针对单个元素。因此,互连的流量和开销将会增加。此外,当正在进行cache-line更新时,将禁止访问行中的元素。

This situation is called false sharing. If this occurs frequently, performance and scalability of an OpenMP application will suffer significantly.

这种情况叫做虚假分享。如果经常发生这种情况,那么OpenMP应用程序的性能和可伸缩性将受到严重影响。

False sharing degrades performance when all of the following conditions occur.

当出现以下所有条件时,错误共享会降低性能。

  • Shared data is modified by multiple processors.
  • 共享数据由多个处理器修改。
  • Multiple processors update data within the same cache line.
  • 多个处理器在同一缓存线内更新数据。
  • This updating occurs very frequently (for example, in a tight loop).
  • 这种更新非常频繁(例如,在紧密循环中)。

Note that shared data that is read-only in a loop does not lead to false sharing.

注意,循环中为只读的共享数据不会导致错误的共享。

#3


1  

Before C++11, the Standard didn't address threading at all. Now it does. This rule is found in section 1.7:

在c++ 11之前,标准根本不涉及线程。现在它。这条规则见第1.7节:

A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having non-zero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. — end note ] Two or more threads of execution (1.10) can update and access separate memory locations without interfering with each other.

内存位置要么是标量类型的对象,要么是相邻位域的最大序列,它们都具有非零的宽度。[注意:该语言的各种特性,如引用和虚拟函数,可能涉及到程序无法访问但由实现管理的额外内存位置。-结束提示]两个或多个执行线程(1.10)可以更新和访问不同的内存位置,而不会相互干扰。

An array is not a scalar, but its elements are. So each element is a distinct memory location, and therefore distinct elements are eligible for being used by different threads simultaneously with no need for locking or synchronization (as long as at most one thread accessed any given element).

数组不是标量,但它的元素是标量。因此,每个元素都是一个不同的内存位置,因此不同的元素可以被不同的线程同时使用,不需要锁定或同步(只要最多有一个线程访问任何给定的元素)。

However, you will cause a great deal of extra work for the cache coherency protocol if data stored in the same cache line are written by different threads. Consider adding padding, or interchanging data layout so that all variables used by a thread are stored adjacently. (array of structures instead of structure of arrays)

但是,如果存储在同一缓存行的数据是由不同的线程编写的,那么您将为缓存一致性协议带来大量额外的工作。考虑添加填充或交换数据布局,以便线程使用的所有变量都是邻接存储的。(结构数组而不是数组结构)