There are many ways to get your 2D array shipped from CPU to GPU in CUDA, but I think this particular method is great lesson in arrays and memory in general.

The basic concept is easy: flatten out the 2D array into a single dimensional array.

We have to flatten the array so that we have contiguous memory when using cudaMemcpy. There isn’t really a good way to copy over the array in 2 dimensions. We’ll have to send it to the GPU as a single-dimensional array, and then figure out the 2D structure of the data on the other side.

creating a dynamic array the “wrong” way

The way I typically define a 2D dynamic array is as an array of pointers:

int** A = new int*[rows];
for (int i = 0; i < rows; ++i) {
 A[i] = new int[cols];
 for (int j = 0; j < cols; ++j) {
  A[i][j] = i*cols+j;
 }
}

This creates the following matrix:

0 1 2
3 4 5

Of course we know that this actually creates 2 pointers to arrays (which too, are pointers…). So what we really have is:

A[0] -----> * * *
            | | |
            | | |
            v v v
            0 1 2

A[1] -----> * * *
            | | |
            | | |
            v v v
            3 4 5

why is this wrong?

It gets tricky when we try using cudaMalloc and cudaMemcpy, which are very similar to malloc and memcpy, respectively. To make this easier to read, we’ll just use the latter functions for now.

Let’s say we try to malloc and memcopy our way into a flattened 1D clone of our 2D array A.

bad copy #1

int size = rows * cols;

int* B = (int *)malloc(sizeof(int) * size);
memcpy(B, A, sizeof(int*) * size);

This doesn’t work. That’s because here we’re copying in the 2 contiguous row pointers of A into B:

B:    *     *     * * * *
      |     |     |
      |     |     |
      v     v     v
      A[0]  A[1]  ???
      |     |     
      |     +----> * * *    
      +--> * * *   | | |
           | | |   | | |
           | | |   v v v
           v v v   3 4 5
           0 1 2  

But the columns are somewhere else in non-contiguous space.

So what if we try copying A[0] instead of A?

bad copy #2

memcpy(B, A[0], sizeof(int) * size);

This is closer…

B:    *        *        *        * * *
      |        |        |        |  
      |        |        |        | 
      v        v        v        v   
      A[0][1]  A[0][2]  A[0][2]  ? ? ?

But the problem now is that A[0][2] and A[1][0] are not contiguous in memory in a dynamic array of pointers, so this also fails. Note the italics here. If this were a static array: int A[2][3] then we’d be fine, but we have a dynamic array… We can’t be certain that A[0][2] is contiguous with A[1][0].

This isn’t a problem on the host device, but it is definitely a problem when we’re trying to copy memory out to the GPU. We need the memory to be contiguous in order to use the CUDA API.

using contiguous memory

We could just copy the original array into a new single dimensional array on the host, and then ship it from there. That would get us contiguous memory on the host, but this takes rows * cols extra complexity and rows * cols extra space. That’s pretty bad since it’s not even necessary!

In order to get this to work, we’ll have to refine how we’ve created the original matrix.

static array? (still wrong)

As mentioned above, we could simply change this to a static array:

int A[rows][cols];
for (int i = 0; i < rows; ++i) {
 for (int j = 0; j < cols; ++j) {
  A[i][j] = i*cols+j;
 }
}

Now our array is neatly packed into contiguous slots in memory, so the bad copy #2 solution above would work nicely.

why is this still wrong?

It’s not necessarily wrong… It works, but we started the problem by saying we wanted to use a dynamic array, not a static array. So this solution doesn’t fit the constraints of our problem.

the right way

Obviously this isn’t the only “right” way to solve this problem. Of course there are many solutions, but for now just consider this the “right” one.

int** A = new int*[rows];
A[0] = new int[rows * cols];
for (int i = 1; i < rows; ++i) A[i] = A[i-1] + cols;

for (int i = 0; i < rows; ++i) {
 for (int j = 0; j < cols; ++j) {
  A[i][j] = i*cols+j;
 }
}

why is this “right?”

I consider this “right” because it accomplishes two things:

  1. The matrix is a 2D dynamic matrix on the host
  2. The memory is contiguous on the host

The magic here happens in the first 3 lines of this solution.

int** A = new int*[rows];

Here we point A to rows pointers to arrays, which currently are uninitialized.

A: * *

Next..

A[0] = new int[rows * cols];

We allocate the entire array at A[0].

A: * *
   |
   |
   +-> * * * * * *

This means that all of the memory to be used for this array is contiguous after A[0]. At this point we could have just flattened the array to begin with, since we’ve essentially created a 1D version of the array in memory. But don’t forget about that first line, where we created rows pointers to int*. Now we can tie that back into the picture:

for (int i = 1; i < rows; ++i) A[i] = A[i-1] + cols;

A[0] is already pointing to the beginning of our full array, so we start at A[1]. A[1] is set to A[0] (which is a pointer) plus an offset of cols spaces in memory (which means pointer arithmetic).

The result after the first iteration is as follows:

A: *         * 
   |         |
   |         v
   +-> * * * * * *

This would repeat for however many rows we have. In this example we’ve only had 2, but a 3 x 2 array would end up looking more like this:

A: *       *     *
   |       |     |
   |       v     v
   +-> * * * * * *

Now we have a dynamic array that is contiguous in memory! We can use memcpy and/or cudaMemcpy by referencing A[0], and we can also use 2D array syntax to find elements in the array on the host device.

The last step is to just copy this with malloc-like and memcpy-like syntax:

// get pointer to allocated device (GPU) memory
int *dA;
cudaMalloc((void **)&dA, sizeof(int) * rows * cols);

// copy host memory to device (pointing at A[0])
cudaMemcpy(dA, A[0], sizeof(int) * rows * cols, cudaMemcpyHostToDevice);

Don’t mind the weird void ** syntax here… That’s another discussion. For consistency with the rest of the example, here’s what it looks like with plain memcpy and malloc:

int* B = (int *)malloc(sizeof(int) * rows * cols);
memcpy(B, A[0], sizeof(int) * rows * cols);

flat index to [row][col]

The last thing you’ll need is the translation of the flat index to [row][col]:

index to row / col

// need to pass `rows` and `cols` to GPU, too
for (int i = 0; i < rows * cols; ++i) {
	int row = i / cols;
	int col = i % cols;
}