前言

之前对于CUDA的学习基本上就是不会就查，拿来就用的状况，对一些基础算法的了解不是特别深，之前在面试的时候还有被问到scan扫描算法计算数组的前缀和，表示还没有详细的了解以致只能尴尬地说不清楚，是真的贼尴尬啊，后来去学了些视频课，才逐渐有了一些些基础。（说起来之前还不知道有step complexity 和 work complexity这两种复杂度呢）

1、step and work complexity

step complexity 指可以将一个操作并行成为几个步骤，work complexity 指最终有多少步骤的工作要做；
如下图中所示：
CUDA 基础算法之reduce、scan、histogram
求一数组所有元素之和（所有元素个数为8），其step complexity为3；而work complexity为7.

2、reduce（规约）

规约的大致过程同样如下图所示：
CUDA 基础算法之reduce、scan、histogram
图中所示为规约求和，但是同样求最大最小值之类的同样可以应用规约的思想。如图中所示，如果为串行规约（数组有n个元素），那么step complexity 和work complexity的大小都是O(n-1)；利用并行规约的思想，可以达到如下效果，step complexity：O(logn)；work complexity：O(n-1)；
下面的代码样例是reduce求和的一个基本版本：

__global__ void reduce(int* dev_i, int* dev_o)
{
	extern __shared__ int sdata[];
	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
	sdata[tid] = dev_i[i];
	__syncthreads();
	for (unsigned int s = 1; s < blockDim.x; s *= 2)
	{
		int index = 2 * s * tid;
		if (index < blockDim.x)
		{
			sdata[index] += sdata[index + s];
		}
		__syncthreads();
	}
	if (tid == 0)
		dev_o[blockIdx.x] = sdata[0];
}

上面的代码只是规约的一个最基础的版本，我们可以利用减少内存冲突，循环展开和模板展开等操作逐渐优化整个核函数，直到达到我们能达到的最有性能。具体可以参考mark Harris 的pdf讲解，不再详细分析

2、scan（扫描）

扫描算法说白了就是求数组的前缀和（包括inclusive scan 和exclusive scan两种方式，略有差别但是整体算法过程相同）。假设输入数组为input，输出数组为output，那么应该有output[i] = output[i-1] + in[i]；对于串行算法，时间复杂度为O(n^2)，对于并行算法，又分为以下两种并行方式：

(1) Hillis and Steele scan

CUDA 基础算法之reduce、scan、histogram
此算法step complexity ：O(logN)；work complexity：O(NlogN)；有以下核函数代码：

__global__ void scan(float* d_odata, float* d_idata, int n)
{
	extern __shared__ float temp[];
	int tid = threadIdx.x;
	int pout = 0, pin = 1;
	temp[pout * n + tid] = (tid > 0) ? d_idata[tid] : 0;
	__syncthreads();
	for (int offset = 1; offset < n; offset *= 2)
	{
		pout = 1 - pout;
		pin = 1 - pout;
		if (tid >= offset)
			temp[pout * n + tid] += temp[pin * n + tid - offset];
		else
			temp[pout * n + tid] = temp[pin * n + tid];
		__syncthreads();
	}
	d_odata[tid] = temp[pout * n + tid];
}  //代码来则GPU Gems，逻辑上没什么问题，但是代码未经过本地测试有待考究

(2) Blelloch scan

此算法是更为有效的一种计算方式，包括规约和下扫两个步骤，具体示意图如下图所示：
CUDA 基础算法之reduce、scan、histogram
上述过程step complexity：O(logN)；work complexity：O(N)
详细kernel代码如下所示：

__global__ void prescan(float* d_odata, float *d_idata, int n)
{
	extern __shared__ float temp[];
	int tid = threadIdx.x;
	int offset = 1;
	//转移到共享内存
	temp[2 * tid] = d_idata[2 * tid];
	temp[2 * tid + 1] = d_idata[2 * tid + 1];
	//第一步先向上扫描
	for (int d = n >> 1; d > 0; d >>= 1)
	{
		__syncthreads();
		if (tid < d)
		{
			int ai = offset * (2 * tid + 1) - 1;
			int bi = offset * (2 * tid + 2) - 1;
			temp[bi] += temp[ai];
		}
		offset *= 2;
	}
	if (tid == 0)
		temp[n - 1] = 0;
	//向下扫描
	for (int d = 1; d < n; d *= 2)
	{
		offset >>= 1;
		__syncthreads();
		if (tid < d)
		{
			int ai = offset * (2 * tid + 1) - 1;
			int bi = offset * (2 * tid + 2) - 1;
			float t = temp[ai];
			temp[ai] = temp[bi];
			temp[bi] += t;
		}
	}
	__syncthreads();
	//写出到global内存
	d_odata[2 * tid] = temp[2 * tid];
	d_odata[2 * tid + 1] = temp[2 * tid + 1];
}//代码同样源自GPU Gems，已通过实际本机测试

3、Histogram

直方图统计是一个比较明确的工作，但是在实际并行的过程中存在着读写冲突的问题，所以为解决这一问题引入原子操作，但是原子操作会增加运行时间。下面是利用原子加的Histogram统计核函数代码。

__global__ void histo_kernel(char* buffer, long size, int* histo)
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int stride = blockDim.x * gridDim.x;
	while (i < size)
	{
		atomicAdd(&(histo[buffer[i]]), 1);
		i += stride;
	}
}

以上内容目前看来基本不能算作一篇博客，由于时间问题，现在只能将之前自己实际经历过的内容大致复述下来，并没有太多思考，后续如果有新想法和内容便再另行修改。也望各路神仙多提意见

4、Reference

1、https://blog.****.net/abcjennifer/article/details/43528407
2、https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html

秒客网

CUDA 基础算法之reduce、scan、histogram

前言