
时间:2021-09-28 16:20:46

I'm getting a CUDA_EXCEPTION_5, Warp Out-of-range Address error and I'm trying to figure out the various scenarios that can cause that.


I'm working on porting a C project (written by somebody else) to CUDA. The C code is very register-heavy, instantiating many arrays in the stack. I'm assuming register overflowing is very likely to be occuring and that may be triggering the warp out-of-range error.


Note that I want to get it working running first then I will begin optimizing the code.


I'm using Compute Capable 3.0 hardware which according to Wikipedia has 512KB of "local memory per thread". I read elsewhere it has 512KB of register space per SM. Is it possible to have 512KB of register space per running thread?


I'm currently executing my kernel as follows (yes I know it's ultra-slow):


dim3 grid(28800,1);

Some details (I don't know how helpful this will be):


My hardware has 7 SMs. There are 112 running blocks, so does this mean each block gets 1/16th of 512k worth of register space?

我的硬件有7个SMs。有112个运行块,这是否意味着每个块的注册空间为512k的1/16 ?

I also understand if a thread exceeds the register space it can overflow into global memory. Is it possible for concurrent threads to overflow into the same global memory space when this occurs?


1 个解决方案



512KB of "local memory per thread". I read elsewhere it has 512KB of register space per SM. Is it possible to have 512KB of register space per running thread?


See Compute Capabilities table in the CUDA C Programming Guide. Compute capbility 2.x and above devices support a maximum of 512KB of local memory per thread. The function cudaDeviceSetLimit( cudaLimitStackSize, bytesPerThread ) can be used to set the value. I believe the default is 2 KB per thread.

参见CUDA C编程指南中的计算能力表。计算能力2。x和以上设备支持每个线程最多512KB的本地内存。可以使用函数cudadesetlimit (cudalcopystacksize, bytesPerThread)来设置值。我认为默认值是每个线程2 KB。

My hardware has 7 SMs. There are 112 running blocks, so does this mean each block gets 1/16th of 512k worth of register space?

我的硬件有7个SMs。有112个运行块,这是否意味着每个块的注册空间为512k的1/16 ?

Compute capbility 3.x devices can have at most 16 resides blocks per multiprocessor. This assumes that your registers/thread, threads/block, or shared memory/block does not limit the kernel to less than the device maximum. The Visual Profiler and Nsight VSE CUDA Profiler the configuration used by your kernel.

计算能力3。在每个多处理器中,x设备最多可以有16个驻留块。这假设您的寄存器/线程、线程/块或共享内存/块不限制内核小于设备最大值。视觉分析器和Nsight VSE CUDA剖析您的内核所使用的配置。

Currently, you are only launching 1 thread/block. You should be launching a multiple of WARP_SIZE per block (32).


I also understand if a thread exceeds the register space it can overflow into global memory. Is it possible for concurrent threads to overflow into the same global memory space when this occurs?


At compile or JIT time the compiler will perform register allocation. If there are insufficient registers per thread then the compiler will spill to local memory. This operation is deterministic and not determined at runtime.


Compute capability 3.0 devices are limited to 63 registers/thread. Compute capability 3.5 devices are limited to 255 registers per thread.




512KB of "local memory per thread". I read elsewhere it has 512KB of register space per SM. Is it possible to have 512KB of register space per running thread?


See Compute Capabilities table in the CUDA C Programming Guide. Compute capbility 2.x and above devices support a maximum of 512KB of local memory per thread. The function cudaDeviceSetLimit( cudaLimitStackSize, bytesPerThread ) can be used to set the value. I believe the default is 2 KB per thread.

参见CUDA C编程指南中的计算能力表。计算能力2。x和以上设备支持每个线程最多512KB的本地内存。可以使用函数cudadesetlimit (cudalcopystacksize, bytesPerThread)来设置值。我认为默认值是每个线程2 KB。

My hardware has 7 SMs. There are 112 running blocks, so does this mean each block gets 1/16th of 512k worth of register space?

我的硬件有7个SMs。有112个运行块,这是否意味着每个块的注册空间为512k的1/16 ?

Compute capbility 3.x devices can have at most 16 resides blocks per multiprocessor. This assumes that your registers/thread, threads/block, or shared memory/block does not limit the kernel to less than the device maximum. The Visual Profiler and Nsight VSE CUDA Profiler the configuration used by your kernel.

计算能力3。在每个多处理器中,x设备最多可以有16个驻留块。这假设您的寄存器/线程、线程/块或共享内存/块不限制内核小于设备最大值。视觉分析器和Nsight VSE CUDA剖析您的内核所使用的配置。

Currently, you are only launching 1 thread/block. You should be launching a multiple of WARP_SIZE per block (32).


I also understand if a thread exceeds the register space it can overflow into global memory. Is it possible for concurrent threads to overflow into the same global memory space when this occurs?


At compile or JIT time the compiler will perform register allocation. If there are insufficient registers per thread then the compiler will spill to local memory. This operation is deterministic and not determined at runtime.


Compute capability 3.0 devices are limited to 63 registers/thread. Compute capability 3.5 devices are limited to 255 registers per thread.
