I perform a 5-point stencil operation on a 2D array until getting a convergence computed on this 2D array. So I have multiple iterations (until convergence) and for each iteration, I am calling clEnqueueNDRangeKernel
function to compute the new values of 2D input array.
我在2D阵列上执行5点模板操作,直到在此2D阵列上计算收敛。所以我有多次迭代(直到收敛),并且对于每次迭代,我调用clEnqueueNDRangeKernel函数来计算2D输入数组的新值。
Actually, I manipulate 1D array since kernel code doesn't support 2D (at least, I believe).
实际上,我操纵一维数组,因为内核代码不支持2D(至少,我相信)。
My issue is that I don't know how to do the affectation between output and input array. After a computing iteration (stencil operation), I want to assign the output to the input for the next iteration.
我的问题是我不知道如何在输出和输入数组之间进行操作。在计算迭代(模板操作)之后,我想将输出分配给输入以进行下一次迭代。
But I am confused about how to achieve this.
但我对如何实现这一点感到困惑。
Below the function used in my main loop :
在我的主循环中使用的函数下面:
while(!convergence)
{
step = step + 1;
Compute_Stencil(command_queue, global_item_size, local_item_size, kernel, x0_mem_obj, x_mem_obj, r_mem_obj, x_input, r, size_x, size_y, &error) ;
convergence = sqrt(error);
if ((convergence<epsilon) || (step>maxStep)) break;
}
where x0_mem_obj
is the buffer associated to x_input
array and x_mem_obj
is associated to x_ouput
array.
其中x0_mem_obj是与x_input数组关联的缓冲区,x_mem_obj与x_ouput数组关联。
and the Compute_Stencil
function that interests me :
和我感兴趣的Compute_Stencil函数:
void Compute_Stencil(cl_command_queue command_queue, size_t* global_item_size, size_t* local_item_size, cl_kernel kernel, cl_mem x0_mem_obj, cl_mem x_mem_obj, cl_mem r_mem_obj, double* x, double* r, int size_x, int size_y, double* error)
{
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL,
global_item_size, local_item_size, 0, NULL, NULL);
// Read the buffer back to the array
if(clEnqueueReadBuffer(command_queue, x_mem_obj, CL_TRUE, 0,
(size_x+2) * (size_y+2) * sizeof(double), x, 0, NULL, NULL) != CL_SUCCESS)
fprintf(stderr,"Error in clEnqueueReadBuffer with x_mem_obj\n");
if(clEnqueueReadBuffer(command_queue, r_mem_obj, CL_TRUE, 0,
(size_x+2) * (size_y+2) * sizeof(double), r, 0, NULL, NULL) != CL_SUCCESS)
fprintf(stderr,"Error in clEnqueueReadBuffer with r_mem_obj\n");
status = clFlush(command_queue);
if(status)
{fprintf(stderr,"Failed to flush command Queue\n");
exit(-1);}
if(clEnqueueWriteBuffer(command_queue, x0_mem_obj, CL_TRUE, 0,
(size_x+2) * (size_y+2) * sizeof(cl_double), x, 0, NULL, NULL) != CL_SUCCESS)
fprintf(stderr,"Error in clEnqueueWriteuffer with x0_mem_obj\n");
// Set new Argument - Outputs become Inputs
status = clSetKernelArg(
kernel,
5,
sizeof(cl_mem),
(void*)&x0_mem_obj);
...
I think this is not the best method because for each iteration, I have to read the output x_mem_obj
buffer to x_input
(with clEnqueueReadBuffer
) and write x_input
to x0_mem_obj
buffer (with clEnqueueWWriteBuffer
) and finally set the x0_mem_obj
buffer to the kernelArg (5th argument) : this buffer represents the input x0_mem_obj
in main :
我认为这不是最好的方法,因为对于每次迭代,我必须将输出x_mem_obj缓冲区读取到x_input(使用clEnqueueReadBuffer)并将x_input写入x0_mem_obj缓冲区(使用clEnqueueWWriteBuffer),最后将x0_mem_obj缓冲区设置为kernelArg(第5个参数) :此缓冲区表示main中的输入x0_mem_obj:
ret = clSetKernelArg(kernel, 5, sizeof(cl_mem), (void *)&x0_mem_obj);
I think this is not the good method because performances are very bad ( I think Read and Write Buffer operations cost a lot of time).
我认为这不是一个好方法,因为性能非常糟糕(我认为读写缓冲区操作需要花费很多时间)。
I try not to use ReadBuffer and WriteBuffer in Compute_Stencil
function and put directly the output buffer x_mem_obj
in the 5th argument for the next call :
我尽量不在Compute_Stencil函数中使用ReadBuffer和WriteBuffer,并将输出缓冲区x_mem_obj直接放在第5个参数中,以便进行下一次调用:
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL,
global_item_size, local_item_size, 0, NULL, NULL);
status = clFlush(command_queue);
// Set new Argument - Outputs become Inputs
status = clSetKernelArg(
kernel,
5,
sizeof(cl_mem),
(void*)&x_mem_obj);
But the results are not valid.
但结果无效。
Anyone could tell me how to transfer simply, after a NDRangeKernel call, the output array to the input array for the next call of NDRangeKernel.
任何人都可以告诉我如何在NDRangeKernel调用之后简单地将输出数组传输到输入数组,以便下次调用NDRangeKernel。
Thanks
谢谢
UPDATE1 :
更新1:
@doqtor, thanks for your answer but I have to specify that, after the computing of new values (i.e after the call of NDRangeKernel), I need to assign the new calculated values to the input, but I think I don't need to replace the input array by the output one : the output buffer will be systematically overwritted by the new values calculated from the input buffer values.
@doqtor,谢谢你的回答,但我必须指明,在计算新值之后(即在调用NDRangeKernel之后),我需要将新的计算值分配给输入,但我想我不需要通过输出1替换输入数组:输出缓冲区将由输入缓冲区值计算的新值系统地覆盖。
In my kernel code, I have the following arguments :
在我的内核代码中,我有以下参数:
__kernel void kernelHeat2D(const double diagx, const double diagy,
const double weightx, const double weighty,
const int size_x,
__global double* tab_current,
__global double* tab_new,
__global double* r)
where tab_new
is the output array and tab_current
the input one. tab_current
is the 6th argument (so numbered by 5 in clSetKernelArg
).
其中tab_new是输出数组,tab_current是输入数组。 tab_current是第6个参数(在clSetKernelArg中编号为5)。
That's why, after NDRangeKernel call, I think that I have only to use :
这就是为什么在NDRangeKernel调用之后,我认为我只需要使用:
// Set new Argument - Outputs become Inputs
status = clSetKernelArg(
kernel,
5,
sizeof(cl_mem),
(void*)&x_mem_obj);
UPDATE2 :
更新2:
The method above in UPDATE1 doesn't work : I get at the execution random difference values in array "r
" (whose buffer is r_mem_obj
in my code ). This array allows to compute the convergence, so I get different number of steps at each execution.
UPDATE1中的上述方法不起作用:我得到数组“r”中的执行随机差值(其缓冲区是我的代码中的r_mem_obj)。这个数组允许计算收敛,所以每次执行时我得到不同的步骤数。
To work, I have to put explicitly in main loop :
为了工作,我必须明确地放在主循环中:
while (!convergence) {
clEnqueueNDRangeKernel();
// Read output buffer and put it into xOutput
clEnqueueReadBuffer( x_mem_obj, xOutput);
// Read error buffer and put it into r
clEnqueueReadBuffer( r_mem_obj, r);
// Write output array to input buffer
clEnqueueWriteBuffer( x0_mem_obj, xOutput)
// put input buffer into input argument for next call of NDRangeKernel
status = clSetKernelArg(
kernel,
5,
sizeof(cl_mem),
(void*)&x0_mem_obj);
}
I would like to avoid using ReadBuffer
and WriteBuffer
(to force setting xOutput
to input x0_mem_obj
buffer) because it gives poor performances from a time execution point of view.
我想避免使用ReadBuffer和WriteBuffer(强制将xOutput设置为输入x0_mem_obj缓冲区),因为它从时间执行的角度来看性能很差。
Any help is welcome
欢迎任何帮助
1 个解决方案
#1
2
The problem seems to be that you set output as input only and then you have the same buffer as input and output. You need to swap buffers:
问题似乎是您只将输出设置为输入,然后您具有与输入和输出相同的缓冲区。你需要交换缓冲区:
buffer1 = create buffer 1
buffer2 = create buffer 2
clEnqueueWriteBuffer(..., buffer1, ...);
clEnqueueWriteBuffer(..., buffer2, ...);
cl_mem *ptrInput = &buffer1;
cl_mem *ptrOutput = &buffer2;
for(..)
{
clSetKernelArg(..., inputIdx, ptrInput, ...);
clSetKernelArg(..., outputIdx, ptrOutout, ...);
clEnqueueNDRangeKernel(...);
// swap buffers
cl_mem *ptrTpm = ptrInput;
ptrInput = ptrOutput;
ptrOuput = ptrTmp;
}
// ...
// Read results data back
clEnqueueReadBuffer(..., ptrInput, ...); // read from ptrInput because we did extra swap
#1
2
The problem seems to be that you set output as input only and then you have the same buffer as input and output. You need to swap buffers:
问题似乎是您只将输出设置为输入,然后您具有与输入和输出相同的缓冲区。你需要交换缓冲区:
buffer1 = create buffer 1
buffer2 = create buffer 2
clEnqueueWriteBuffer(..., buffer1, ...);
clEnqueueWriteBuffer(..., buffer2, ...);
cl_mem *ptrInput = &buffer1;
cl_mem *ptrOutput = &buffer2;
for(..)
{
clSetKernelArg(..., inputIdx, ptrInput, ...);
clSetKernelArg(..., outputIdx, ptrOutout, ...);
clEnqueueNDRangeKernel(...);
// swap buffers
cl_mem *ptrTpm = ptrInput;
ptrInput = ptrOutput;
ptrOuput = ptrTmp;
}
// ...
// Read results data back
clEnqueueReadBuffer(..., ptrInput, ...); // read from ptrInput because we did extra swap