输出一个可变长度的数组

时间:2021-12-19 16:31:33

We are working on an assignment for a GPGPU course. We picked an algorithm, implemented it on the CPU and are now converting it to OpenCL.

我们正在做GPGPU课程的作业。我们选择了一个算法,并在CPU上实现了它,现在正在将它转换为OpenCL。

The algorithm we've chosen loads a model as a set of triangles and rasterizes them to voxels. The voxels are defined as a VBO of point data. We then use a geometry shader to convert these points to voxels as triangles.

我们选择的算法将模型加载为一组三角形并将它们光栅化为voxels。voxels定义为点数据的VBO。然后我们使用一个几何着色器将这些点转换成三角形的voxels。

So our OpenCL program needs to take a list of triangles and output a variable list of points.

所以我们的OpenCL程序需要获取一个三角形列表并输出一个点的变量列表。

And outputting a variable length array seems to be a problem.

输出可变长度数组似乎是个问题。

The solution we found is to atomically increment a counter and use that counter as both an index into the output array and a final size of the array. Except... both our GPU's don't support the extension for atomic operations.

我们发现的解决方案是原子化地增加计数器,并将该计数器用作输出数组的索引和数组的最终大小。除了……我们的GPU都不支持原子操作的扩展。

This is what we have so far:

这是我们目前所拥有的:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_int64_extended_atomics : enable

#define POS1      i0 * 3 + 0
#define POS2      i0 * 3 + 1
#define POS3      i0 * 3 + 2

void WritePosition( __global float* OutBuffer, uint inIndex, __global float* inPosition )
{
    OutBuffer[ inIndex * 3 ] = inPosition[0];
    OutBuffer[ inIndex * 3 + 1] = inPosition[1];
    OutBuffer[ inIndex * 3 + 2] = inPosition[2];
}

__kernel void Voxelize( 
    __global float* outPointcloudBuffer, 
    __global float* inTriangleBuffer, 
    __global uint* inoutIndex
)
{
    size_t i0 = get_global_id(0);
    size_t i1 = get_local_id(0);

    WritePosition( outPointcloudBuffer, inIndex[0], &inTriangleBuffer[ i0 ] );

    //atomic_inc(inoutIndex[0]);
    inoutIndex[0] = max(inoutIndex[0], i0);
}

And the output of this is very odd. We're testing a very small model (12 triangles, 36 positions, 108 floats) and the result we get is either 31, 63 or 95. Always a multiple of 16 minus 1.

它的输出非常奇怪。我们正在测试一个很小的模型(12个三角形,36个位置,108个浮动),我们得到的结果要么是31,要么是63,要么是95。总是16 - 1的倍数。

How can we get the length of our variable length output array?

如何得到可变长度输出数组的长度?

Thanks in advance.

提前谢谢。

1 个解决方案

#1


4  

I would guess that this is normally tackled as follows:

我想这通常是这样处理的:

  • First pass: Calculate the required size of the array on the GPU using a scan (parallel prefix sum) primitive. Above link contains an example implementation from Apple.
  • 第一步:使用扫描(并行前缀和)原语计算GPU上的数组所需的大小。上面的链接包含一个来自Apple的示例实现。
  • Allocate the required resources on the host side using the result of the scan algorithm. Note, the result of the scan algorithm can often be used as an index hint for results of individual work items.
  • 使用扫描算法的结果在主机端分配所需的资源。注意,扫描算法的结果通常可以作为单个工作项结果的索引提示。
  • Second pass (optional): Compact the array to those elements that need to be considered in the third pass.
  • 第二遍(可选):将数组压缩到需要在第三遍中考虑的元素。
  • Third pass: Rerun the algorithm passing the destination indices and the allocated array.
  • 第三步:重新运行算法,传递目标索引和分配的数组。

You might want to have a look at NVIDIA's OpenCL marching cubes implementation where all three passes mentioned above are implemented.

您可能想看看NVIDIA的OpenCL游行多维数据集实现,上面提到的所有三个步骤都实现了。

Best, Christoph

最好,克里斯托弗

#1


4  

I would guess that this is normally tackled as follows:

我想这通常是这样处理的:

  • First pass: Calculate the required size of the array on the GPU using a scan (parallel prefix sum) primitive. Above link contains an example implementation from Apple.
  • 第一步:使用扫描(并行前缀和)原语计算GPU上的数组所需的大小。上面的链接包含一个来自Apple的示例实现。
  • Allocate the required resources on the host side using the result of the scan algorithm. Note, the result of the scan algorithm can often be used as an index hint for results of individual work items.
  • 使用扫描算法的结果在主机端分配所需的资源。注意,扫描算法的结果通常可以作为单个工作项结果的索引提示。
  • Second pass (optional): Compact the array to those elements that need to be considered in the third pass.
  • 第二遍(可选):将数组压缩到需要在第三遍中考虑的元素。
  • Third pass: Rerun the algorithm passing the destination indices and the allocated array.
  • 第三步:重新运行算法,传递目标索引和分配的数组。

You might want to have a look at NVIDIA's OpenCL marching cubes implementation where all three passes mentioned above are implemented.

您可能想看看NVIDIA的OpenCL游行多维数据集实现,上面提到的所有三个步骤都实现了。

Best, Christoph

最好,克里斯托弗