Adding Two Integers

Overview

Teaching: 10 min
Exercises: 15 min
Questions
  • How does GPU memory work?

Objectives
  • To send data from CPU to GPU and back

In this section, we will write some code that will have the GPU add two numbers. Trivial, right? Not as trivial as you might think, because a GPU card has completely separate memory from the CPU. Things stored in one are not accessible from the other, they have to be copied back and forth.

So to add two numbers on the GPU that start on our keyboard, we need to first store them in the CPU (or “host”) memory, then move them from there to the GPU (or “device”) memory, and finally move the result back from the device memory to the host memory.

This language will come up again and again in CUDA: The CPU and its associated memory is the host, while the GPU and its associated memory is the device.

Here’s a kernel function that will do the addition on the GPU:

__global__ void add(int *da, int *db, int *dc) {
   *dc = *da + *db;
}

Those asterisks may be unfamiliar to you if you haven’t done much C programming. They mean that the parameters being supplied are not the integers to be added, but instead pointers to where those integers are in memory. This is because the kernel function is called from the host CPU, but executes on the GPU and hence needs to point to memory locations within the GPU memory. The line *dc = *da + *db says “take the values at addresses da and db, add them together, and store the result at the address dc.” So da, db and dc are locations in memory, and *da, *db and *dc are the values stored at those locations. I’ve prefixed all the names with “d” for “device” to remind us that they’re locations in the GPU memory.

We’ll also need to determine the address (that is, the storage location) of a few variables. The C operator to do that is the ampersand, &. &x returns the address of x, which is to say, a pointer to x.

Memory Allocation

In C programs, there are two ways that memory for data can be allocated. The first is define it statically when a variable is declared.

int a;

…declares an integer a and the space to hold it.

The second way is to allocate it dynamically and keep a pointer to that area of memory.

int *a;
a = (int *)malloc(sizeof(int));

…declares a pointer to an integer, and then malloc finds space to put it and we put the address of that space into a. This is what is almost always done for data arrays, since it allows you to choose the size of the array at run time rather than compile time.

If you use malloc() or any of its cousins to allocate memory, you are responsible for giving it back again when you are done with the free() function.

free(a);

There are CUDA variants for malloc and free that handle allocation of memory on the GPU. These functions deal with pointers to pointers(!), so it looks like the following:

int *d_a;
cudaMalloc((void **)&d_a, sizeof(int));

You then need to copy data from the CPU memory to the GPU memory with another function from the CUDA library. This looks like:

cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);

The order of arguments here is

That last is a symbolic constant defined by the CUDA library: Either cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost. To copy results back to the host memory you use the same function, with the destination and source addresses in the correct order and the correct constant in the last position.

Here’s web documentation for:

Exercise: Complete the code

Using the template below and the bits and pieces just shown, build and test the code to have the GPU card to add two integers.

You’ll need these pieces:

  • The kernel function add() at the top of this page
  • Patterns for cudaMalloc() and cudaMemcpy()
    • You’ll need to allocate GPU memory for two inputs and one output
  • Call the kernel function with add<<<1,1>>>(...)
  • Print the result with printf("%d plus %d equals %d\n", a, b, c);
  • Release the allocated GPU memory with cudaFree()
/* TEMPLATE CODE */
#include <stdio.h>
#include <stdlib.h>

// ... define function 'add' ...

int main(int argc, char **argv) {
   int a, b, c;        // We've chosen static allocation here for host storage..
   int *d_a, *d_b, *d_c;  // ...but device storage must be dynamically allocated
   a = atoi(argv[1]);  // Read the addends from the command line args
   b = atoi(argv[2]);

   // ... manage memory ...

   // ... move data ...

   add<<<1,1>>>(d_a, d_b, d_c); // call kernel function on GPU

   // ... move data ...

   printf("%d + %d -> %d\n", a, b, c);

   // ... manage memory ...
}

Solution

#include <stdio.h>
#include <stdlib.h>

__global__ void add(int *da, int *db, int *dc) {
   *dc = *da + *db;
}

int main(int argc, char **argv) {
  int a, b, c;        // We've chosen static allocation here for host storage..
  int *d_a, *d_b, *d_c;  // ...but device storage must be dynamically allocated
  a = atoi(argv[1]);  // Read the addends from the command line args
  b = atoi(argv[2]);

  cudaMalloc((void **)&d_a, sizeof(int));
  cudaMalloc((void **)&d_b, sizeof(int));
  cudaMalloc((void **)&d_c, sizeof(int));
  cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);

  add<<<1,1>>>(d_a, d_b, d_c);
  cudaMemcpy(&c, d_c, sizeof(int), cudaMemcpyDeviceToHost);

  printf("%d + %d -> %d\n", a, b, c);
  cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
}

Compile with nvcc add.cu -o add, test with srun --gres=gpu:1 add 1 2

Oh, and one more thing: We should add cudaDeviceSynchronize() just before we copy back the result. The CPU code is not required to wait for the GPU code to complete before it continues. This call will force it to wait. This is another one of those errors that will probably not occur until long after you’ve stopped expecting it, and then when it does, you’ll have no clue what’s going on.

Result is zero?

If you get:

$ srun --gres=gpu:1 ./add 1 2
1 + 2 -> 0

…you may have forgotten to load the CUDA module module load cuda/11.4.

We still haven’t done anything in parallel. That’s next.

Key Points

  • The CPU (the ‘host’) and the GPU (the ‘device’) have separate memory banks

  • This requires explicit copying of data to and from the GPU