There are many ways to get your 2D array shipped from CPU to GPU in CUDA, but I think this particular method is great lesson in arrays and memory in general.

The basic concept is easy: flatten out the 2D array into a single dimensional array.

We have to flatten the array so that we have contiguous memory when using `cudaMemcpy`. There isn’t really a good way to copy over the array in 2 dimensions. We’ll have to send it to the GPU as a single-dimensional array, and then figure out the 2D structure of the data on the other side.

### creating a dynamic array the “wrong” way

The way I typically define a 2D dynamic array is as an array of pointers:

This creates the following matrix:

``````0 1 2
3 4 5
``````

Of course we know that this actually creates 2 pointers to arrays (which too, are pointers…). So what we really have is:

``````A -----> * * *
| | |
| | |
v v v
0 1 2

A -----> * * *
| | |
| | |
v v v
3 4 5
``````

#### why is this wrong?

It gets tricky when we try using `cudaMalloc` and `cudaMemcpy`, which are very similar to `malloc` and `memcpy`, respectively. To make this easier to read, we’ll just use the latter functions for now.

Let’s say we try to `malloc` and `memcopy` our way into a flattened 1D clone of our 2D array `A`.

#### bad copy #1

This doesn’t work. That’s because here we’re copying in the 2 contiguous row pointers of A into B:

``````B:    *     *     * * * *
|     |     |
|     |     |
v     v     v
A  A  ???
|     |
|     +----> * * *
+--> * * *   | | |
| | |   | | |
| | |   v v v
v v v   3 4 5
0 1 2
``````

But the columns are somewhere else in non-contiguous space.

So what if we try copying `A` instead of `A`?

#### bad copy #2

This is closer…

``````B:    *        *        *        * * *
|        |        |        |
|        |        |        |
v        v        v        v
A  A  A  ? ? ?
``````

But the problem now is that `A` and `A` are not contiguous in memory in a dynamic array of pointers, so this also fails. Note the italics here. If this were a static array: `int A` then we’d be fine, but we have a dynamic array… We can’t be certain that `A` is contiguous with `A`.

This isn’t a problem on the host device, but it is definitely a problem when we’re trying to copy memory out to the GPU. We need the memory to be contiguous in order to use the CUDA API.

### using contiguous memory

We could just copy the original array into a new single dimensional array on the host, and then ship it from there. That would get us contiguous memory on the host, but this takes `rows * cols` extra complexity and `rows * cols` extra space. That’s pretty bad since it’s not even necessary!

In order to get this to work, we’ll have to refine how we’ve created the original matrix.

#### static array? (still wrong)

As mentioned above, we could simply change this to a static array:

Now our array is neatly packed into contiguous slots in memory, so the bad copy #2 solution above would work nicely.

#### why is this still wrong?

It’s not necessarily wrong… It works, but we started the problem by saying we wanted to use a dynamic array, not a static array. So this solution doesn’t fit the constraints of our problem.

## the right way

Obviously this isn’t the only “right” way to solve this problem. Of course there are many solutions, but for now just consider this the “right” one.

#### why is this “right?”

I consider this “right” because it accomplishes two things:

1. The matrix is a 2D dynamic matrix on the host
2. The memory is contiguous on the host

The magic here happens in the first 3 lines of this solution.

Here we point `A` to `rows` pointers to arrays, which currently are uninitialized.

``````A: * *
``````

Next..

We allocate the entire array at `A`.

``````A: * *
|
|
+-> * * * * * *
``````

This means that all of the memory to be used for this array is contiguous after `A`. At this point we could have just flattened the array to begin with, since we’ve essentially created a 1D version of the array in memory. But don’t forget about that first line, where we created `rows` pointers to `int*`. Now we can tie that back into the picture:

`A` is already pointing to the beginning of our full array, so we start at `A`. `A` is set to `A` (which is a pointer) plus an offset of `cols` spaces in memory (which means pointer arithmetic).

The result after the first iteration is as follows:

``````A: *         *
|         |
|         v
+-> * * * * * *
``````

This would repeat for however many rows we have. In this example we’ve only had 2, but a `3 x 2` array would end up looking more like this:

``````A: *       *     *
|       |     |
|       v     v
+-> * * * * * *
``````

Now we have a dynamic array that is contiguous in memory! We can use `memcpy` and/or `cudaMemcpy` by referencing `A`, and we can also use 2D array syntax to find elements in the array on the host device.

The last step is to just copy this with `malloc`-like and `memcpy`-like syntax:

Don’t mind the weird `void **` syntax here… That’s another discussion. For consistency with the rest of the example, here’s what it looks like with plain `memcpy` and `malloc`:

## flat index to [row][col]

The last thing you’ll need is the translation of the flat index to [row][col]: