CUDA __syncthreads();不工作;在断点命中顺序中反转

时间:2022-09-13 20:43:13

I have a problem with I think __syncthreads();. I have a code like this:

我认为__syncthreads();有问题。我有这样的代码:

__device__ void prefixSumJoin(const bool *g_idata, int *g_odata, int n)
{
    __shared__ int temp[Config::bfr*Config::bfr];  // allocated on invocation  
    int thid = threadIdx.y*blockDim.x + threadIdx.x;  
    if(thid<(n>>1))
    {
        int offset = 1;
        temp[2*thid] = (g_idata[2*thid]?1:0); // load input into shared memory  
        temp[2*thid+1] = (g_idata[2*thid+1]?1:0); 
        for (int d = n>>1; d > 0; d >>= 1)                    // build sum in place up the tree  
        {   
            __syncthreads();  
            if (thid < d)  
            { 
                int ai = offset*(2*thid+1)-1; // <-- breakpoint B 
                int bi = offset*(2*thid+2)-1;
                temp[bi] += temp[ai];  
            }  
            offset *= 2; 
        } 
        if (thid == 0) { temp[n - 1] = 0; } // clear the last element

        for (int d = 1; d < n; d *= 2) // traverse down tree & build scan  
        {  
            offset >>= 1;  
            __syncthreads();  
            if (thid < d)                       
            {
                int ai = offset*(2*thid+1)-1;  
                int bi = offset*(2*thid+2)-1;
                int t = temp[ai];  
                temp[ai] = temp[bi];  
                temp[bi] += t;   
            }  
        }  
        __syncthreads();
        g_odata[2*thid] = temp[2*thid]; // write results to device memory  
        g_odata[2*thid+1] = temp[2*thid+1]; 
    }
}


__global__ void selectKernel3(...)
{
    int tidx = threadIdx.x;
    int tidy = threadIdx.y;
    int bidx = blockIdx.x;
    int bidy = blockIdx.y;
    int tid = tidy*blockDim.x + tidx;
    int bid = bidy*gridDim.x+bidx;
    int noOfRows1 = ...;
    int noOfRows2 = ...;

    __shared__ bool isRecordSelected[Config::bfr*Config::bfr];
    __shared__ int selectedRecordsOffset[Config::bfr*Config::bfr];

    isRecordSelected[tid] = false;
    selectedRecordsOffset[tid] = 0;


    __syncthreads();
    if(tidx<noOfRows1 && tidy<noOfRows2)
        if(... == ...)
            isRecordSelected[tid] = true;

    __syncthreads();
    prefixSumJoin(isRecordSelected,selectedRecordsOffset,Config::bfr*Config::bfr); // <-- breakpoint A
    __syncthreads();

    if(isRecordSelected[tid]==true){
        {
            some_instruction;// <-- breakpoint C
        ...
        }
    }
}
...
f(){
   dim3 dimGrid(13, 5);
   dim3 dimBlock(Config::bfr, Config::bfr);

   selectKernel3<<<dimGrid, dimBlock>>>(...)

}
//other file

class Config
{
public:
    static const int bfr = 16; // blocking factor = number of rows per block
public:
    Config(void);
    ~Config(void);
};

the prefixSum is from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html, with little change.

prefixSum来自http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html,几乎没有变化。

Ok, now I set 3 breakpoints: A,B,C. It should be hit in order A,B,C. Problem is that it is hit in order: A,B*x,C,B. So in point C, selectedRecordsOffset is not ready and it cause errors. After A the B is hit few times, but not all and then C is hit and it goes further in code and then again B for rest of the loop. x is different depending on input (for some inputs there isn't any inverse in breakpoints so C is last that was hit).

好的,现在我设置了3个断点:A,B,C。它应按A,B,C顺序命中。问题是按顺序命中:A,B * x,C,B。所以在C点,selectedRecordsOffset没有准备好,它会导致错误。在A被击中几次之后,但不是全部,然后C被击中并且它在代码中进一步发展,然后在循环的其余部分再次进行B. x根据输入而不同(对于某些输入,断点中没有任何反转,因此C是最后一个被击中的)。

Moreover if I look on thread numbers that cause hit it is for A and C threadIdx.y = 0 and for B threadIdx.y = 10. How is this possible while it is the same block so why some threads ommit sync? There is no conditional sync. Does someone have any idea where to look for bug?

此外,如果我查看导致命中的线程数,它是针对A和C threadIdx.y = 0以及针对B threadIdx.y = 10.如果这是相同的块,那么为什么有些线程会忽略同步?没有条件同步。有人知道在哪里寻找bug吗?

If you need some more clarification, just ask.
Thanks in advance for any advice how to work this out.
Adam

如果您需要更多说明,请询问。提前感谢任何建议如何解决这个问题。亚当

1 个解决方案

#1


4  

Thou shalt not use __syncthreads() in conditional code if the condition does not evaluate uniformly across all threads of each block.

如果条件不能在每个块的所有线程上统一评估,则不应在条件代码中使用__syncthreads()。

#1


4  

Thou shalt not use __syncthreads() in conditional code if the condition does not evaluate uniformly across all threads of each block.

如果条件不能在每个块的所有线程上统一评估,则不应在条件代码中使用__syncthreads()。