原子函数真的会使变量在CUDA中变得不稳定吗?

时间:2022-11-30 07:01:04

I've written a very simple code ask thread 0 to update a global variable while other threads keep reading that variable.But I found other threads don't really get the value.

我编写了一个非常简单的代码,请求线程0更新全局变量,而其他线程继续读取该变量。但我发现其他线程并没有真正获得该值。

Code is here, it is quite simple. Could anyone give me any suggestion how to fix it? Thanks a lot

代码在这里,很简单。任何人都可以给我任何建议如何解决它?非常感谢

__global__ void addKernel(int *c)
{
int i = threadIdx.x;
int j = 0;
if (i == 0)
{
    while(*c < 2000){
        int temp = *c;
        printf("*c = %d\n",*c);
        atomicCAS(c,temp, temp+1);
    }       
}else{
    while(*c < 1000)
    {
        j++;
    }
}

}

}

1 个解决方案

#1


2  

I'd like to make an analogy: imagine for a second that atomic operations are mutexes: for a program to be well-defined, two threads accessing a shared resource must both agree to use the mutex to access the resource exclusively. If one of the threads accesses the resource without first holding the mutex, the result is undefined.

我想做一个类比:想象一下,原子操作是互斥体:对于一个定义良好的程序,访问共享资源的两个线程必须同意使用互斥锁来独占访问资源。如果其中一个线程在没有首先持有互斥锁的情况下访问资源,则结果是未定义的。

The same thing is true for atomics: if you decide to treat a particular location in memory as an atomic variable, then all threads accessing that location should agree and treat it as such for your program to have meaning. You should only be manipulating it through atomic loads and stores, not a combination of non-atomic and atomic operations.

对于原子来说同样如此:如果你决定将内存中的特定位置视为原子变量,那么访问该位置的所有线程都应该同意并对其进行处理,以使程序具有意义。您应该只通过原子载荷和存储来操纵它,而不是非原子操作和原子操作的组合。

In other words, this:

换句话说,这个:

atomicCAS(c,temp, temp+1);

Contains an atomic load-compare-store. The resulting instruction will go all the way down to global memory to load c, do the comparison, and go all the way down to global memory to store the new value.

包含原子加载比较存储。生成的指令将一直向下到全局内存以加载c,进行比较,并一直向下到全局内存以存储新值。

But this:

但是这个:

while(*c < 2000)

Is not atomic by any means. The compiler (and the hardware) has no idea that c may have been modified by another thread. So instead of going all the way down to global memory, it will simply read from the fastest available cache. Possibly the compiler will even put the variable in a register, because it doesn't see anyone else modifying it in the current thread.

无论如何都不是原子的。编译器(和硬件)不知道c可能已被另一个线程修改过。因此,它不会一直向下到全局内存,而只是从最快的可用缓存中读取。可能编译器甚至会将变量放在寄存器中,因为它没有看到其他人在当前线程中修改它。

What you would want is something like (imaginary):

你想要的是像(想象的):

while (atomicLoad(c) < 2000)

But to the best of my knowledge there is no such construct in CUDA at the time of writing.

但据我所知,在撰写本文时,CUDA中没有这样的结构。

In this regard, the volatile qualifier may help: it tells the compiler to not optimize the variable, and consider it as "modifiable from external sources". This will trigger a load for every read of the variable, although I am not sure this load bypasses all the caches. In practice, it may work, but in theory I don't think you should rely on it. Besides, this will also disable any optimizations on that variable (such as constant propagation or promoting the variable to a register for better performance).

在这方面,volatile限定符可能有所帮助:它告诉编译器不优化变量,并将其视为“可从外部源修改”。这将触发每次读取变量的负载,但我不确定此负载是否会绕过所有缓存。在实践中,它可能有效,但理论上我认为你不应该依赖它。此外,这还将禁用对该变量的任何优化(例如,不断传播或将变量提升到寄存器以获得更好的性能)。

You may want to try the following hack (I haven't tried it):

您可能想尝试以下hack(我还没有尝试过):

while(atomicAdd(c, 0) < 2000)

This will emit an atomic instruction that does load from global memory, and therefore should see the most recent value of c. However, it also introduces an (useless in this case) atomic store.

这将发出一个从全局内存加载的原子指令,因此应该看到c的最新值。但是,它还引入了(在这种情况下无用)原子存储。

#1


2  

I'd like to make an analogy: imagine for a second that atomic operations are mutexes: for a program to be well-defined, two threads accessing a shared resource must both agree to use the mutex to access the resource exclusively. If one of the threads accesses the resource without first holding the mutex, the result is undefined.

我想做一个类比:想象一下,原子操作是互斥体:对于一个定义良好的程序,访问共享资源的两个线程必须同意使用互斥锁来独占访问资源。如果其中一个线程在没有首先持有互斥锁的情况下访问资源,则结果是未定义的。

The same thing is true for atomics: if you decide to treat a particular location in memory as an atomic variable, then all threads accessing that location should agree and treat it as such for your program to have meaning. You should only be manipulating it through atomic loads and stores, not a combination of non-atomic and atomic operations.

对于原子来说同样如此:如果你决定将内存中的特定位置视为原子变量,那么访问该位置的所有线程都应该同意并对其进行处理,以使程序具有意义。您应该只通过原子载荷和存储来操纵它,而不是非原子操作和原子操作的组合。

In other words, this:

换句话说,这个:

atomicCAS(c,temp, temp+1);

Contains an atomic load-compare-store. The resulting instruction will go all the way down to global memory to load c, do the comparison, and go all the way down to global memory to store the new value.

包含原子加载比较存储。生成的指令将一直向下到全局内存以加载c,进行比较,并一直向下到全局内存以存储新值。

But this:

但是这个:

while(*c < 2000)

Is not atomic by any means. The compiler (and the hardware) has no idea that c may have been modified by another thread. So instead of going all the way down to global memory, it will simply read from the fastest available cache. Possibly the compiler will even put the variable in a register, because it doesn't see anyone else modifying it in the current thread.

无论如何都不是原子的。编译器(和硬件)不知道c可能已被另一个线程修改过。因此,它不会一直向下到全局内存,而只是从最快的可用缓存中读取。可能编译器甚至会将变量放在寄存器中,因为它没有看到其他人在当前线程中修改它。

What you would want is something like (imaginary):

你想要的是像(想象的):

while (atomicLoad(c) < 2000)

But to the best of my knowledge there is no such construct in CUDA at the time of writing.

但据我所知,在撰写本文时,CUDA中没有这样的结构。

In this regard, the volatile qualifier may help: it tells the compiler to not optimize the variable, and consider it as "modifiable from external sources". This will trigger a load for every read of the variable, although I am not sure this load bypasses all the caches. In practice, it may work, but in theory I don't think you should rely on it. Besides, this will also disable any optimizations on that variable (such as constant propagation or promoting the variable to a register for better performance).

在这方面,volatile限定符可能有所帮助:它告诉编译器不优化变量,并将其视为“可从外部源修改”。这将触发每次读取变量的负载,但我不确定此负载是否会绕过所有缓存。在实践中,它可能有效,但理论上我认为你不应该依赖它。此外,这还将禁用对该变量的任何优化(例如,不断传播或将变量提升到寄存器以获得更好的性能)。

You may want to try the following hack (I haven't tried it):

您可能想尝试以下hack(我还没有尝试过):

while(atomicAdd(c, 0) < 2000)

This will emit an atomic instruction that does load from global memory, and therefore should see the most recent value of c. However, it also introduces an (useless in this case) atomic store.

这将发出一个从全局内存加载的原子指令,因此应该看到c的最新值。但是,它还引入了(在这种情况下无用)原子存储。