CUDA的__shared__内存何时有用?

时间:2021-06-01 13:23:54

Can someone please help me with a very simple example on how to use shared memory? The example included in the Cuda C programming guide seems cluttered by irrelevant details.

有人可以帮我一个关于如何使用共享内存的一个非常简单的例子吗? Cuda C编程指南中包含的示例似乎与无关的细节混杂在一起。

For example, if I copy a large array to the device global memory and want to square each element, how can shared memory be used to speed this up? Or is it not useful in this case?

例如,如果我将一个大型数组复制到设备全局内存并想要对每个元素求平方,那么如何使用共享内存来加速它?或者在这种情况下没用?

2 个解决方案

#1


24  

In the specific case you mention, shared memory is not useful, for the following reason: each data element is used only once. For shared memory to be useful, you must use data transferred to shared memory several times, using good access patterns, to have it help. The reason for this is simple: just reading from global memory requires 1 global memory read and zero shared memory reads; reading it into shared memory first would require 1 global memory read and 1 shared memory read, which takes longer.

在您提到的特定情况下,共享内存无用,原因如下:每个数据元素只使用一次。要使共享内存有用,必须使用良好的访问模式多次使用传输到共享内存的数据来提供帮助。原因很简单:只需从全局内存中读取就需要1个全局内存读取和零共享内存读取;首先将它读入共享内存需要1次全局内存读取和1次共享内存读取,这需要更长的时间。

Here's a simple example, where each thread in the block computes the corresponding value, squared, plus the average of both its left and right neighbors, squared:

这是一个简单的例子,其中块中的每个线程计算相应的值,平方,加上左右邻居的平均值,平方:

  __global__ void compute_it(float *data)
  {
     int tid = threadIdx.x;
     __shared__ float myblock[1024];
     float tmp;

     // load the thread's data element into shared memory
     myblock[tid] = data[tid];

     // ensure that all threads have loaded their values into
     // shared memory; otherwise, one thread might be computing
     // on unitialized data.
     __syncthreads();

     // compute the average of this thread's left and right neighbors
     tmp = (myblock[tid > 0 ? tid - 1 : 1023] + myblock[tid < 1023 ? tid + 1 : 0]) * 0.5f);
     // square the previousr result and add my value, squared
     tmp = tmp*tmp + myblock[tid] * myblock[tid];

     // write the result back to global memory
     data[tid] = myblock[tid];
  }

Note that this is envisioned to work using only one block. The extension to more blocks should be straightforward. Assumes block dimension (1024, 1, 1) and grid dimension (1, 1, 1).

请注意,这可以设想只使用一个块。更多块的扩展应该是直截了当的。假设块尺寸(1024,1,1)和网格尺寸(1,1,1)。

#2


8  

Think of shared memory as an explicitly managed cache - it's only useful if you need to access data more than once, either within the same thread or from different threads within the same block. If you're only accessing data once then shared memory isn't going to help you.

将共享内存视为显式托管缓存 ​​- 只有在需要在同一个线程内或同一个块内的不同线程中多次访问数据时才有用。如果您只访问一次数据,则共享内存不会对您有所帮助。

#1


24  

In the specific case you mention, shared memory is not useful, for the following reason: each data element is used only once. For shared memory to be useful, you must use data transferred to shared memory several times, using good access patterns, to have it help. The reason for this is simple: just reading from global memory requires 1 global memory read and zero shared memory reads; reading it into shared memory first would require 1 global memory read and 1 shared memory read, which takes longer.

在您提到的特定情况下,共享内存无用,原因如下:每个数据元素只使用一次。要使共享内存有用,必须使用良好的访问模式多次使用传输到共享内存的数据来提供帮助。原因很简单:只需从全局内存中读取就需要1个全局内存读取和零共享内存读取;首先将它读入共享内存需要1次全局内存读取和1次共享内存读取,这需要更长的时间。

Here's a simple example, where each thread in the block computes the corresponding value, squared, plus the average of both its left and right neighbors, squared:

这是一个简单的例子,其中块中的每个线程计算相应的值,平方,加上左右邻居的平均值,平方:

  __global__ void compute_it(float *data)
  {
     int tid = threadIdx.x;
     __shared__ float myblock[1024];
     float tmp;

     // load the thread's data element into shared memory
     myblock[tid] = data[tid];

     // ensure that all threads have loaded their values into
     // shared memory; otherwise, one thread might be computing
     // on unitialized data.
     __syncthreads();

     // compute the average of this thread's left and right neighbors
     tmp = (myblock[tid > 0 ? tid - 1 : 1023] + myblock[tid < 1023 ? tid + 1 : 0]) * 0.5f);
     // square the previousr result and add my value, squared
     tmp = tmp*tmp + myblock[tid] * myblock[tid];

     // write the result back to global memory
     data[tid] = myblock[tid];
  }

Note that this is envisioned to work using only one block. The extension to more blocks should be straightforward. Assumes block dimension (1024, 1, 1) and grid dimension (1, 1, 1).

请注意,这可以设想只使用一个块。更多块的扩展应该是直截了当的。假设块尺寸(1024,1,1)和网格尺寸(1,1,1)。

#2


8  

Think of shared memory as an explicitly managed cache - it's only useful if you need to access data more than once, either within the same thread or from different threads within the same block. If you're only accessing data once then shared memory isn't going to help you.

将共享内存视为显式托管缓存 ​​- 只有在需要在同一个线程内或同一个块内的不同线程中多次访问数据时才有用。如果您只访问一次数据,则共享内存不会对您有所帮助。