使用CUDA向量类型有什么好处吗?

CUDA provides built-in vector data types like uint2, uint4 and so on. Are there any advantages to using these data types?

CUDA提供内置的矢量数据类型，如uint2、uint4等。使用这些数据类型有什么好处吗?

Let's assume that I have a tuple which consists of two values, A and B. One way to store them in memory is to allocate two arrays. The first array stores all the A values and the second array stores all the B values at indexes that correspond to the A values. Another way is to allocate one array of type uint2. Which one should I use? Which way is recommended? Does members of uint3 i.e x, y, z reside side by side in memory?

假设我有一个tuple，它包含两个值，a和b。在内存中存储它们的一种方法是分配两个数组。第一个数组存储所有的值，第二个数组将所有的B值存储在对应于一个值的索引中。另一种方法是分配一个类型uint2的数组。我应该用哪一个?推荐哪个方向?uint3 i的成员。e x y z在内存中是并排的?

3 个解决方案

#1

This is going to be a bit speculative but may add to @ArchaeaSoftware's answer.

这将是一种猜测，但可能会增加@ArchaeaSoftware的答案。

I'm mainly familiar with Compute Capability 2.0 (Fermi). For this architecture, I don't think that there is any performance advantage to using the vectorized types, except maybe for 8- and 16-bit types.

我主要熟悉计算能力2.0 (Fermi)。对于这个架构，我不认为使用矢量化类型有任何性能优势，除了8和16位类型。

Looking at the declaration for char4:

查看char4的声明:

struct __device_builtin__ __align__(4) char4
{
    signed char x, y, z, w;
};

The type is aligned to 4 bytes. I don't know what __device_builtin__ does. Maybe it triggers some magic in the compiler...

类型与4字节对齐。我不知道__device_builtin__做了什么。也许它会在编译器中触发一些魔法…

Things look a bit strange for the declarations of float1, float2, float3 and float4:

对于float1、float2、float3和float4的声明，情况看起来有点奇怪:

struct __device_builtin__ float1
{
    float x;
};

__cuda_builtin_vector_align8(float2, float x; float y;);

struct __device_builtin__ float3
{
    float x, y, z;
};

struct __device_builtin__ __builtin_align__(16) float4
{
    float x, y, z, w;
};

float2 gets some form of special treatment. float3 is a struct without any alignment and float4 gets aligned to 16 bytes. I'm not sure what to make of that.

float2得到某种形式的特殊处理。float3是一个没有任何对齐的结构，float4被对齐到16个字节。我不知道该怎么做。

Global memory transactions are 128 bytes, aligned to 128 bytes. Transactions are always performed for a full warp at a time. When a warp reaches a function that performs a memory transaction, say a 32-bit load from global memory, the chip will at that time perform as many transactions as are necessary for servicing all the 32 threads in the warp. So, if all the accessed 32-bit values are within a single 128-byte line, only one transaction is necessary. If the values come from different 128-byte lines, multiple 128-byte transactions are performed. For each transaction, the warp is put on hold for around 600 cycles while the data is fetched from memory (unless it's in the L1 or L2 caches).

全局内存事务是128字节，与128字节对齐。事务总是一次执行一个完整的经线。当一个经线到达一个执行内存事务的函数时，从全局内存中说一个32位的负载，芯片将在那个时候执行许多事务，以服务于经线的所有32个线程。因此，如果所有被访问的32位值都在一个128字节的行中，那么只有一个事务是必需的。如果这些值来自不同的128字节行，则执行多个128字节的事务。对于每个事务，在从内存中获取数据(除非是在L1或L2缓存中)时，经线被放置大约600个周期。

So, I think the key to finding out what type of approach gives the best performance, is to consider which approach causes the fewest 128-byte memory transactions.

因此，我认为找出哪种方法提供最佳性能的关键是考虑哪种方法导致最少128字节的内存事务。

Assuming that the built in vector types are just structs, some of which have special alignment, using the vector types causes the values to be stored in an interleaved way in memory (array of structs). So, if the warp is loading all the x values at that point, the other values (y, z, w) will be pulled in to L1 because of the 128-byte transactions. When the warp later tries to access those, it's possible that they are no longer in L1, and so, new global memory transactions must be issued. Also, if the compiler is able to issue wider instructions to read more values in at the same time, for future use, it will be using registers for storing those between the point of the load and the point of use, perhaps increasing the register usage of the kernel.

假设在vector类型中构建的只是结构，其中一些具有特殊的对齐方式，使用vector类型会使值以交叉的方式存储在内存中(structs数组)。因此，如果经线加载所有的x值，那么其他值(y, z, w)将被拉入到L1，因为128字节的事务。当经线稍后尝试访问这些时，可能它们不再在L1中，因此必须发出新的全局内存事务。另外，如果编译器能够发出更宽的指令来同时读取更多的值，为了将来的使用，它将使用寄存器来存储在加载点和使用点之间的值，这可能会增加内核的寄存器使用。

On the other hand, if the values are packed into a struct of arrays, the load can be serviced with as few transactions as possible. So, when reading from the x array, only x values are loaded in the 128-byte transactions. This could cause fewer transactions, less reliance on the caches and a more even distribution between compute and memory operations.

另一方面，如果将值打包到数组的结构中，则可以使用尽可能少的事务来服务负载。因此，当从x数组读取数据时，在128字节的事务中只加载了x值。这可能会导致更少的事务，减少对缓存的依赖，甚至在计算和内存操作之间分配更大的分布。

#2

I don't believe the built-in tuples in CUDA ([u]int[2|4], float[2|4], double[2]) have any intrinsic advantages; they exist mostly for convenience. You could define your own C++ classes with the same layout and the compiler would operate on them efficiently. The hardware does have native 64-bit and 128-bit loads, so you'd want to check the generated microcode to know for sure.

我不相信CUDA中的内置元组([u]int[2|4]， float[2|4]， double[2])有任何内在的优势;它们的存在主要是为了方便。您可以使用相同的布局定义自己的c++类，并且编译器将有效地对它们进行操作。硬件确实有64位和128位的负载，所以您需要检查生成的微代码以确定。

As for whether you should use an array of uint2 (array of structures or AoS) or two arrays of uint (structure of arrays or SoA), there are no easy answers - it depends on the application. For built-in types of convenient size (2x32-bit or 4x32-bit), AoS has the advantage that you only need one pointer to load/store each data element. SoA requires multiple base pointers, or at least multiple offsets and separate load/sore operations per element; but it may be faster for workloads that sometimes only operate on a subset of the elements.

至于是否应该使用uint2(结构数组或AoS数组)或两个uint数组(数组或SoA的结构)，没有简单的答案——这取决于应用程序。对于内置类型的方便大小(2x32位或4x32位)，AoS的优点是只需要一个指针来加载/存储每个数据元素。SoA需要多个基本指针，或者至少是多个偏移量和每个元素单独的负载/酸痛操作;但是，对于有时只对元素的子集进行操作的工作负载，可能会更快。

As an example of a workload that uses AoS to good effect, look at the nbody sample (which uses float4 to hold XYZ+mass of each particle). The Black-Scholes sample uses SoA, presumably because float3 is an inconvenient element size.

作为一个使用AoS效果良好的工作负载的例子，查看nbody示例(它使用float4来容纳每个粒子的XYZ+质量)。Black-Scholes示例使用SoA，大概是因为float3是一个不方便的元素大小。

#3

There's some good info in another thread that contradicts much of the major conclusions said here.

在另一个线程中有一些很好的信息与这里所说的主要结论相矛盾。

#1