
时间:2021-04-28 19:29:17

I have a dynamically declared 2D array in my C program, the contents of which I want to transfer to a CUDA kernel for further processing. Once processed, I want to populate the dynamically declared 2D array in my C code with the CUDA processed data. I am able to do this with static 2D C arrays but not with dynamically declared C arrays. Any inputs would be welcome!

在我的C程序中,我有一个动态声明的2D数组,其中的内容我想转到CUDA内核进行进一步处理。经过处理后,我希望在C代码中使用CUDA处理过的数据填充动态声明的2D数组。我可以用静态的2D C数组而不是动态声明的C数组来实现这一点。欢迎输入!

I mean the dynamic array of dynamic arrays. The test code that I have written is as below.


#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <conio.h>
#include <math.h>
#include <stdlib.h>

const int nItt = 10;
const int nP = 5;

__device__ int d_nItt = 10;
__device__ int d_nP = 5;

__global__ void arr_chk(float *d_x_k, float *d_w_k, int row_num)

int index = (blockIdx.x * blockDim.x) + threadIdx.x;
int index1 = (row_num * d_nP) + index; 
if ( (index1 >= row_num * d_nP) && (index1 < ((row_num +1)*d_nP)))              //Modifying only one row data pertaining to one particular iteration

        d_x_k[index1] = row_num * d_nP;
        d_w_k[index1] = index;


float **mat_create2(int r, int c)
float **dynamicArray;
dynamicArray = (float **) malloc (sizeof (float)*r);
for(int i=0; i<r; i++)
    dynamicArray[i] = (float *) malloc (sizeof (float)*c);
        for(int j= 0; j<c;j++)
            dynamicArray[i][j] = 0;
return dynamicArray;

/* Freeing memory - here only number of rows are passed*/
void cleanup2d(float **mat_arr, int x)
int i;
for(i=0; i<x; i++)

int main()

//float w_k[nItt][nP]; //Static array declaration - works!
//float x_k[nItt][nP];
// if I uncomment this dynamic declaration and comment the static one, it does not work.....
float **w_k = mat_create2(nItt,nP); 
float **x_k = mat_create2(nItt,nP);
float *d_w_k, *d_x_k;       // Device variables for w_k and x_k
int nblocks, blocksize, nthreads;
for(int i=0;i<nItt;i++)
    for(int j=0;j<nP;j++)
        x_k[i][j] = (nP*i);
        w_k[i][j] = j;

for(int i=0;i<nItt;i++)
    for(int j=0;j<nP;j++)
        printf("x_k[%d][%d] = %f\t",i,j,x_k[i][j]);
        printf("w_k[%d][%d] = %f\n",i,j,w_k[i][j]);
int size1 = nItt * nP * sizeof(float);
printf("\nThe array size in memory bytes is: %d\n",size1);
cudaMalloc( (void**)&d_x_k, size1 );
cudaMalloc( (void**)&d_w_k, size1 );

    blocksize = nP*nItt;
    nblocks = 1;
    blocksize = 32;     // Defines the number of threads running per block. Taken equal to warp size
    nthreads = blocksize;
    nblocks =  ceil(float(nP*nItt) / nthreads);     // Calculated total number of blocks thus required

for(int i = 0; i< nItt; i++)
    cudaMemcpy( d_x_k, x_k, size1,cudaMemcpyHostToDevice ); //copy of x_k to device
    cudaMemcpy( d_w_k, w_k, size1,cudaMemcpyHostToDevice ); //copy of w_k to device
    arr_chk<<<nblocks, blocksize>>>(d_x_k,d_w_k,i);
    cudaMemcpy( x_k, d_x_k, size1, cudaMemcpyDeviceToHost );
    cudaMemcpy( w_k, d_w_k, size1, cudaMemcpyDeviceToHost );
printf("\nVerification after return from gpu\n");
for(int i = 0; i<nItt; i++)
    for(int j=0;j<nP;j++)
        printf("x_k[%d][%d] = %f\t",i,j,x_k[i][j]);
        printf("w_k[%d][%d] = %f\n",i,j,w_k[i][j]);
cudaFree( d_x_k );
cudaFree( d_w_k );
return 0;

1 个解决方案



I mean the dynamic array of dynamic arrays.


Well, that's exactly where the problem lies. A dynamic array of dynamic arrays consists of a whole bunch of disjoint memory blocks, one for each line in the array (as is clearly seen from the malloc inside you for loop in mat_create2). So you can't copy such a data structure to device memory with just one call to cudaMemcpy*. Instead, you have to do either


  • Also use dynamic arrays of dynamic arrays on CUDA. To do this, you have to basically recreate your mat_create2 function, using cudaMalloc instead of malloc, then copy each row seperately.


  • Use a "tight" 2d array on CUDA, like you do now (which is a good thing, at least performance-wise!). But if you keep using dyn-dyn-arrays on host memory, you still have copy each row seperately, like


    for(int i=0; i<r; ++i){
      cudaMemcpy(d_x_k + i*c, x_k[i], c*sizeof(float), cudaMemcpyHostToDevice)

You may wonder "why did it work with a static 2d array, then"? Well, static 2d arrays in C are proper, tight arrays that can be copied in one go. It's a bit confusing that these are indexed with exactly the same syntax as dyn-dyn arrays (arr[x][y]), because it actually works completely different.


But you should consider using tight arrays on host memory, too, perhaps with an object-oriented wrapper like


typedef struct {
  float* data;
  int n_rows, n_cols;
} tight2dFloatArray;

#define INDEX_TIGHT2DARRAY(arr, y, x)\
  (arr).data[(y)*(arr).n_cols + (x)]

such an approach of course can be implemented much safer as a C++ class.


*You also can't copy it inside main memory with just one memcpy: that only copies the array of pointers, not the actual data.




I mean the dynamic array of dynamic arrays.


Well, that's exactly where the problem lies. A dynamic array of dynamic arrays consists of a whole bunch of disjoint memory blocks, one for each line in the array (as is clearly seen from the malloc inside you for loop in mat_create2). So you can't copy such a data structure to device memory with just one call to cudaMemcpy*. Instead, you have to do either


  • Also use dynamic arrays of dynamic arrays on CUDA. To do this, you have to basically recreate your mat_create2 function, using cudaMalloc instead of malloc, then copy each row seperately.


  • Use a "tight" 2d array on CUDA, like you do now (which is a good thing, at least performance-wise!). But if you keep using dyn-dyn-arrays on host memory, you still have copy each row seperately, like


    for(int i=0; i<r; ++i){
      cudaMemcpy(d_x_k + i*c, x_k[i], c*sizeof(float), cudaMemcpyHostToDevice)

You may wonder "why did it work with a static 2d array, then"? Well, static 2d arrays in C are proper, tight arrays that can be copied in one go. It's a bit confusing that these are indexed with exactly the same syntax as dyn-dyn arrays (arr[x][y]), because it actually works completely different.


But you should consider using tight arrays on host memory, too, perhaps with an object-oriented wrapper like


typedef struct {
  float* data;
  int n_rows, n_cols;
} tight2dFloatArray;

#define INDEX_TIGHT2DARRAY(arr, y, x)\
  (arr).data[(y)*(arr).n_cols + (x)]

such an approach of course can be implemented much safer as a C++ class.


*You also can't copy it inside main memory with just one memcpy: that only copies the array of pointers, not the actual data.
