Adding Two Integers
Overview
Teaching: 10 min
Exercises: 15 minQuestions
How does GPU memory work?
Objectives
To send data from CPU to GPU and back
In this section, we will write some code that will have the GPU add two numbers. Trivial, right? Not as trivial as you might think, because a GPU card has completely separate memory from the CPU. Things stored in one are not accessible from the other, they have to be copied back and forth.
So to add two numbers on the GPU that start on our keyboard, we need to first store them in the CPU (or “host”) memory, then move them from there to the GPU (or “device”) memory, and finally move the result back from the device memory to the host memory.
This language will come up again and again in CUDA: The CPU and its associated memory is the host, while the GPU and its associated memory is the device.
Here’s a kernel function that will do the addition on the GPU:
__global__ void add(int *da, int *db, int *dc) {
*dc = *da + *db;
}
Those asterisks may be unfamiliar to you if you haven’t done much C
programming. They mean that the parameters being supplied are not the
integers to be added, but instead pointers to where those integers
are in memory. This is because the kernel function is called from
the host CPU, but executes on the GPU and hence needs to point to
memory locations within the GPU memory. The line *dc = *da + *db
says
“take the values at addresses da
and db
, add them together, and
store the result at the address dc
.” So da, db
and dc
are locations
in memory, and *da, *db
and *dc
are the values stored at those locations.
I’ve prefixed all the names with “d” for “device” to remind us that they’re
locations in the GPU memory.
We’ll also need to determine the address (that is, the storage location)
of a few variables. The C operator to do that is the ampersand, &
.
&x
returns the address of x
, which is to say, a pointer to x
.
Memory Allocation
In C programs, there are two ways that memory for data can be allocated. The first is define it statically when a variable is declared.
int a;
…declares an integer
a
and the space to hold it.The second way is to allocate it dynamically and keep a pointer to that area of memory.
int *a; a = (int *)malloc(sizeof(int));
…declares a pointer to an integer, and then
malloc
finds space to put it and we put theaddress
of that space intoa
. This is what is almost always done for data arrays, since it allows you to choose the size of the array at run time rather than compile time.If you use
malloc()
or any of its cousins to allocate memory, you are responsible for giving it back again when you are done with thefree()
function.free(a);
There are CUDA variants for malloc
and free
that handle allocation of
memory on the GPU. These functions deal with pointers to pointers(!), so it
looks like the following:
int *d_a;
cudaMalloc((void **)&d_a, sizeof(int));
You then need to copy data from the CPU memory to the GPU memory with another function from the CUDA library. This looks like:
cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
The order of arguments here is
- destination address,
- source address,
- number of bytes, and
- cudaMemcpyKind.
That last is a symbolic constant defined by the CUDA library:
Either cudaMemcpyHostToDevice
or cudaMemcpyDeviceToHost
.
To copy results back to the host memory you use the same function, with
the destination and source addresses in the correct order and the correct
constant in the last position.
Here’s web documentation for:
Exercise: Complete the code
Using the template below and the bits and pieces just shown, build and test the code to have the GPU card to add two integers.
You’ll need these pieces:
- The kernel function
add()
at the top of this page- Patterns for
cudaMalloc()
andcudaMemcpy()
- You’ll need to allocate GPU memory for two inputs and one output
- Call the kernel function with
add<<<1,1>>>(...)
- Print the result with
printf("%d plus %d equals %d\n", a, b, c);
- Release the allocated GPU memory with
cudaFree()
/* TEMPLATE CODE */ #include <stdio.h> #include <stdlib.h> // ... define function 'add' ... int main(int argc, char **argv) { int a, b, c; // We've chosen static allocation here for host storage.. int *d_a, *d_b, *d_c; // ...but device storage must be dynamically allocated a = atoi(argv[1]); // Read the addends from the command line args b = atoi(argv[2]); // ... manage memory ... // ... move data ... add<<<1,1>>>(d_a, d_b, d_c); // call kernel function on GPU // ... move data ... printf("%d + %d -> %d\n", a, b, c); // ... manage memory ... }
Solution
#include <stdio.h> #include <stdlib.h> __global__ void add(int *da, int *db, int *dc) { *dc = *da + *db; } int main(int argc, char **argv) { int a, b, c; // We've chosen static allocation here for host storage.. int *d_a, *d_b, *d_c; // ...but device storage must be dynamically allocated a = atoi(argv[1]); // Read the addends from the command line args b = atoi(argv[2]); cudaMalloc((void **)&d_a, sizeof(int)); cudaMalloc((void **)&d_b, sizeof(int)); cudaMalloc((void **)&d_c, sizeof(int)); cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice); add<<<1,1>>>(d_a, d_b, d_c); cudaMemcpy(&c, d_c, sizeof(int), cudaMemcpyDeviceToHost); printf("%d + %d -> %d\n", a, b, c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); }
Compile with
nvcc add.cu -o add
, test withsrun --gres=gpu:1 add 1 2
Oh, and one more thing: We should add cudaDeviceSynchronize()
just before we
copy back the result. The CPU code is not required to wait for the GPU code to
complete before it continues. This call will force it to wait. This is another
one of those errors that will probably not occur until long after you’ve
stopped expecting it, and then when it does, you’ll have no clue what’s going
on.
Result is zero?
If you get:
$ srun --gres=gpu:1 ./add 1 2
1 + 2 -> 0
…you may have forgotten to load the CUDA module
module load cuda/11.4
.
We still haven’t done anything in parallel. That’s next.
Key Points
The CPU (the ‘host’) and the GPU (the ‘device’) have separate memory banks
This requires explicit copying of data to and from the GPU