ACENET Summer School - GPGPU

Introduction

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • What is a GPU, and why do we use them?

  • What is CUDA?

Objectives
  • To know the names of some GPU programming tools

  • To get access to a GPU node with Slurm

What is a GPU, and why?

A Graphics Processing Unit is a specialized piece of computer circuitry designed to accelerate the creation of images. GPUs are key to the performance of many current computer games; a machine with only CPUs cannot update the picture on the screen fast enough to make the game playable.

A GPU is effectively a small, highly specialized, parallel computer. “The GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations.” (CUDA C Programming Guide)

What is CUDA?

There are a number of packages you can use to program GPUs. Three prominent ones are CUDA, OpenCL, and OpenACC.

CUDA is an NVidia product. It once stood for Compute Unified Device Architecture, but not even NVidia uses that expansion any more. CUDA has two components. One part is a driver that actually handles communications with the card. The second part is a set of libraries that allow your code to interact with the driver portion. CUDA only works with Nvidia graphics cards.

Monitoring your graphics card

Along with drivers, library files, and compilers, CUDA comes with several utilities that you can use to manage your work. One very useful tool is ‘nvidia-smi’ which lets you check the basic status of your GPU card, like, “Do I have one?” and “Is it working”. There are also tools for profiling and debugging CUDA code.

nvidia-smi

Running on a GPU node

If you run nvidia-smi on a login node or on a regular compute node, it will complain that it can’t communicate with the NVIDIA driver. This should be no surprise: No NVIDIA driver needs to be running on a node that has no GPU.

To get access to a GPU we have to go through Slurm. The “normal” way is to use sbatch, which will queue up a job for execution later:

$ cat testjob
#!/bin/bash
#SBATCH --gres=gpu:1    # THIS IS THE KEY LINE
#SBATCH --cpus-per-task=10
#SBATCH --mem=40G
#SBATCH --time=0:5:0
nvidia-smi
$ sbatch testjob

That asks for one GPU card and 10 CPU cores. This would be perfect for the national Béluga cluster, since the GPU nodes there have 4 GPUs, 40 CPU cores, and more than 160GB of RAM.
Five minutes of run time is foolishly short for a production job, but we’re testing, so this should be okay.

Alternatively we could use salloc to request a GPU node, or part of one, and start an interactive shell there. Because we don’t have enough nodes on our virtual cluster to provide a GPU for each person in the course, we’ll use yet a third form that you already saw in a previous week, srun:

$ srun --gres=gpu:1  nvidia-smi
Fri Jun 19 14:40:32 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.56       Driver Version: 440.56       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID V100D-8C       On   | 00000000:00:05.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |    560MiB /  8192MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Key Points

  • A GPU (Graphics Processing Unit) is best at data-parallel, arithmetic-intense calculations

  • CUDA is one of several programming interfaces for general-purpose computing on GPUs

  • Alliance clusters have special GPU-equipped nodes, which must be requested from the scheduler


Hello World

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How to compile and run Hello World

Objectives
  • Compile and run some CUDA code

The traditional C-language Hello World program is:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
   printf("Hello World\n");
}

The CUDA compiler is a C compiler that can generate binary code that will run on CPUs as well as code that will run on GPUs. The compiler is called nvcc. Use it to compile this Hello World with the following command:

nvcc -o hello_world hello_world.c

Command not found?

If you get:

$ nvcc -o hello_world hello_world.c
-bash: nvcc: command not found

…you may have forgotten to run module load nvhpc/22.7 cuda/11.4.

To get this to run with a GPU, you need to add some code to create and launch a kernel. A kernel is a portion of code that can be transferred to the GPU and run there. A simple example would look like the following.

#include <stdio.h>
#include <stdlib.h>

__global__ void mykernel(void) {
}

int main(int argc, char **argv) {
   mykernel<<<1,1>>>();
   printf("Hello world\n");
}

You can compile this code with the same command as before, except the file extension must be ‘.cu’ instead of ‘.c’. We added two extra parts to the program. The first part, __global__, tells the compiler that this function will be something that runs on the GPU, but is called from the main program. The second part is the use of angle brackets added to the function call. This is extra syntax used to tell the compiler that this function is to be used as a kernel and to launch it on the GPU. In this case, our kernel function doesn’t do anything. We will add more to it in the next section.

The point of using a GPU is to do massively parallel programs. The two numbers within the angle brackets define how the work is to be divided up among the threads available. The first number defines the number of blocks to break the work up across. These blocks run in parallel. The second number defines the number threads that will work within each of the blocks. So if you have the following, in general,

mykernel<<<M,N>>>();

this sets up your work so that everything is divided into M blocks, with each block having N threads. Since both parameters provide ways of parallelizing your work, you may ask why are there two? Threads not only handle parallelism, but also provide ways of communicating and synchronizing between the threads. This gives you the tools you need to handle more complex algorithms.

Running on a GPU node

You could execute hello_world on the login node simply by naming it:

$ ./hello_world
Hello World

But the login nodes on the cluster we’re using don’t have GPUs, so how does that work? You could run it on a GPU node using

$ srun --gres=gpu:1 hello_world
Hello World

as shown in the previous episode, but the result will be the same because this code doesn’t actually do anything with the GPU yet. Let’s fix that next.

Key Points

  • Use nvcc to compile

  • CUDA source files are suffixed .cu

  • Use salloc to get an interactive session on a GPU node for testing


Adding Two Integers

Overview

Teaching: 10 min
Exercises: 20 min
Questions
  • How does GPU memory work?

Objectives
  • To send data from CPU to GPU and back

In this section, we will write some code that will have the GPU add two numbers. Trivial, right? Not as trivial as you might think, because a GPU card has completely separate memory from the CPU. Things stored in one are not accessible from the other, they have to be copied back and forth.

So to add two numbers on the GPU that start on our keyboard, we need to first store them in the CPU (or “host”) memory, then move them from there to the GPU (or “device”) memory, and finally move the result back from the device memory to the host memory.

This language will come up again and again in CUDA: The CPU and its associated memory is the host, while the GPU and its associated memory is the device.

Here’s a kernel function that will do the addition on the GPU:

__global__ void add(int *da, int *db, int *dc) {
   *dc = *da + *db;
}

Those asterisks may be unfamiliar to you if you haven’t done much C programming. They mean that the parameters being supplied are not the integers to be added, but instead pointers to where those integers are in memory. This is because the kernel function is called from the host CPU, but executes on the GPU and hence needs to point to memory locations within the GPU memory. The line *dc = *da + *db says “take the values at addresses da and db, add them together, and store the result at the address dc.” So da, db and dc are locations in memory, and *da, *db and *dc are the values stored at those locations. I’ve prefixed all the names with “d” for “device” to remind us that they’re locations in the GPU memory.

We’ll also need to determine the address (that is, the storage location) of a few variables. The C operator to do that is the ampersand, &. &x returns the address of x, which is to say, a pointer to x.

Memory Allocation

In C programs, there are two ways that memory for data can be allocated. The first is define it statically when a variable is declared.

int a;

…declares an integer a and the space to hold it.

The second way is to allocate it dynamically and keep a pointer to that area of memory.

int *a;
a = (int *)malloc(sizeof(int));

…declares a pointer to an integer, and then malloc finds space to put it and we put the address of that space into a. This is what is almost always done for data arrays, since it allows you to choose the size of the array at run time rather than compile time.

If you use malloc() or any of its cousins to allocate memory, you are responsible for giving it back again when you are done with the free() function.

free(a);

There are CUDA variants for malloc and free that handle allocation of memory on the GPU. These functions deal with pointers to pointers(!), so it looks like the following:

int *d_a;
cudaMalloc((void **)&d_a, sizeof(int));

You then need to copy data from the CPU memory to the GPU memory with another function from the CUDA library. This looks like:

cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);

The order of arguments here is

That last is a symbolic constant defined by the CUDA library: Either cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost. To copy results back to the host memory you use the same function, with the destination and source addresses in the correct order and the correct constant in the last position.

Here’s web documentation for:

Exercise: Complete the code

Using the template below and the bits and pieces just shown, build and test the code to have the GPU card to add two integers.

You’ll need these pieces:

  • The kernel function add() at the top of this page
  • Patterns for cudaMalloc() and cudaMemcpy()
    • You’ll need to allocate GPU memory for two inputs and one output
  • Call the kernel function with add<<<1,1>>>(...)
  • Print the result with printf("%d plus %d equals %d\n", a, b, c);
  • Release the allocated GPU memory with cudaFree()
/* TEMPLATE CODE */
#include <stdio.h>
#include <stdlib.h>

// ... define function 'add' ...

int main(int argc, char **argv) {
   int a, b, c;        // We've chosen static allocation here for host storage..
   int *da, *db, *dc;  // ...but device storage must be dynamically allocated
   a = atoi(argv[1]);  // Read the addends from the command line args
   b = atoi(argv[2]);

   // ... manage memory ...

   // ... move data ...

   add<<<1,1>>>(da, db, dc); // call kernel function on GPU

   // ... move data ...

   printf("%d + %d -> %d\n", a, b, c);

   // ... manage memory ...
}

Solution

#include <stdio.h>
#include <stdlib.h>

__global__ void add(int *da, int *db, int *dc) {
   *dc = *da + *db;
}

int main(int argc, char **argv) {
  int a, b, c;        // We've chosen static allocation here for host storage..
  int *da, *db, *dc;  // ...but device storage must be dynamically allocated
  a = atoi(argv[1]);  // Read the addends from the command line args
  b = atoi(argv[2]);
  cudaMalloc((void **)&da, sizeof(int));
  cudaMalloc((void **)&db, sizeof(int));
  cudaMalloc((void **)&dc, sizeof(int));
  cudaMemcpy(da, &a, sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(db, &b, sizeof(int), cudaMemcpyHostToDevice);
  add<<<1,1>>>(da, db, dc);
  cudaMemcpy(&c, dc, sizeof(int), cudaMemcpyDeviceToHost);
  printf("%d + %d -> %d\n", a, b, c);
  cudaFree(da); cudaFree(db); cudaFree(dc);
}

Compile with nvcc add.cu -o add, test with srun --gres=gpu:1 add 1 2

Oh, and one more thing: We should add cudaDeviceSynchronize() just before we copy back the result. The CPU code is not required to wait for the GPU code to complete before it continues. This call will force it to wait. This is another one of those errors that will probably not occur until long after you’ve stopped expecting it, and then when it does, you’ll have no clue what’s going on.

Result is zero?

If you get:

$ srun --gres=gpu:1 ./add 1 2
1 + 2 -> 0

…you may have forgotten to load the CUDA module module load cuda/11.4.

We still haven’t done anything in parallel. That’s next.

Key Points

  • The CPU (the ‘host’) and the GPU (the ‘device’) have separate memory banks

  • This requires explicit copying of data to and from the GPU


Adding vectors with GPU threads

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • How can we parallelize code on a GPU?

Objectives
  • Use CUDA threads

A GPU is not meant for doing one thing at a time; it’s meant for doing arithmetic on massive amounts of data all at the same time. It’s time we scaled up what we’re doing.

Let’s generalize the code we just wrote to add two vectors of integers, instead of two integers. Instead of having statically defined variables to hold single integers in the host (CPU) memory, we’ll declare pointers to integers and then call malloc to create space for the vectors:

int *a; 
int bytes = N * sizeof(int);
a = (int *)malloc(bytes); 

Because of the relation between pointers and arrays in C, we can store and retrieve data from an allocated block of memory using array syntax, expressions like a[i].

We’ll set the number of GPU threads to something larger than one by changing the second argument in the <<<M,N>>> when we call the kernel function. Back in Episode 2 we mentioned that the first number, M, is the block count and the second, N, is the thread count. For the moment let’s use 512 threads, leave the block count at 1, and we’ll come back and discuss those numbers later.

We also need to change the kernel function to match. The CUDA library provides several variables for indexing. We’ll talk more about those later too, but for now use threadIdx.x. Change the kernel function definition to the following:

__global__ void add(int *da, int *db, int *dc) {
   dc[threadIdx.x] = da[threadIdx.x] + db[threadIdx.x];
}

Notice that this kernel function just adds one pair of integers and stores them in one matching location in array dc. But because we’ve used the CUDA-provided index variable, each thread will execute that kernel function on the thread’s own little piece of the data. In other words, the kernel function gets called potentially thousands of times in parallel— which is why there’s no loop in it. The looping is implicit in the <<<M,N>>> decorator when we call the kernel.

Exercise: From single integers to vectors

Combine your code from the previous exercise and the pieces described above to add two vectors on the GPU. You can just print the first and last results. While it might be more realistic to populate the input vectors with random numbers or something, for simplicity you can just populate each one with N copies of the same number.

If you’re not accustomed to C programming or confused by malloc, double up with another student who is familiar with C. They’ll help you.

Solution

#include <stdio.h>
#include <stdlib.h>

__global__ void add(int *da, int *db, int *dc) {
   dc[threadIdx.x] = da[threadIdx.x] + db[threadIdx.x];
}

int main(int argc, char **argv) {
   int a_in = atoi(argv[1]);  // first addend
   int b_in = atoi(argv[2]);  // second addend
   int N = atoi(argv[3]);     // length of arrays
   int numThreads = 512;
   int *a, *b, *c;
   int *d_a, *d_b, *d_c;
   int size = N * sizeof(int);
   a = (int *)malloc(size);
   b = (int *)malloc(size);
   c = (int *)malloc(size);
   // Initialize the input vectors
   for (int i=0; i<N; ++i) {
      a[i] = a_in; b[i] = b_in; c[i] = 0; }
   cudaMalloc((void **)&d_a, size);
   cudaMalloc((void **)&d_b, size);
   cudaMalloc((void **)&d_c, size);
   cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
   cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
   add<<<1,numThreads>>>(d_a, d_b, d_c);
   cudaDeviceSynchronize();
   cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
   cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
   printf("%d + %d = %d\n", a[0],   b[0],   c[0]);
   printf("...\n");
   printf("%d + %d = %d\n", a[N-1], b[N-1], c[N-1]);
   free(a); free(b); free(c);
}

Don’t forget: To get to a GPU you may need to do something like srun --gres=gpu:1 addvec 1 2 512.

Bonus exercise: (Move fast and) Break things

Once you’ve got the code working on 512 elements, vary the number of elements and see what happens. Do you have any idea why it does that?

For even more fun, vary the number of threads.

Key Points

  • Threads are the lowest level of parallelization on a GPU

  • A kernel function replaces the code inside the loop to be parallelized

  • The CUDA <<<M,N>>> notation replaces the loop itself


Using blocks instead of threads

Overview

Teaching: 10 min
Exercises: 20 min
Questions
  • What is the difference between blocks and threads?

  • Is one faster than the other?

Objectives
  • To use blocks to parallelize work

  • Examine CUDA performance

In the Hello World example we saw the <<<M,N>>> syntax used in CUDA to call a kernel function, and we told you that it creates M blocks and N threads per block. In the Adding Vectors example we just finished, we created threads with <<<1,N>>>. Now we will use the first parameter to create blocks instead.

What’s a block? A GPU typically has several streaming multiprocessors (SMs). A block is handled by one SM, and each SM may handle many blocks in succession. Each SM supports up to 1024 threads, typically in multiples of 32 called “warps”. Threads can quickly access and share the data within a block. The pictures below from CUDA C Programming Guide should help:

Blocks and SMs Thread Blocks

The P100 GPU model available at Graham Cedar has 56 SMs, each supporting 64 single-precision threads or 32 double-precision threads. So if you are doing double-precision calculations, each GPU has effectively 56*32 = 1792 cores. At Béluga there are newer V100 GPUs with 80 SMs, which again support 64 single-precision or 32 double-precision threads each, for 2560 effective cores.

But to take advantage of all these “CUDA cores” you need to use both blocks and threads.

We’ll start by doing something a little silly, and just switch from threads to blocks. Change the kernel function to use CUDA’s block index, blockIdx.x, like so:

__global__ void add(int *a, int *b, int *c) {
   c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

Exercise: Change threads to blocks

Copy the Adding Vectors example you just finished, and change the copy to use blocks instead of threads. Verify that it still produces correct results.

Solution

#include <stdio.h>
#include <stdlib.h>

__global__ void add(int *da, int *db, int *dc) {
   dc[blockIdx.x] = da[blockIdx.x] + db[blockIdx.x];
}

int main(int argc, char **argv) {
   int a_in = atoi(argv[1]);  // first addend
   int b_in = atoi(argv[2]);  // second addend
   int N = atoi(argv[3]);     // length of arrays
   int numBlocks = 512;
   int *a, *b, *c;
   int *d_a, *d_b, *d_c;
   int size = N * sizeof(int);
   a = (int *)malloc(size);
   b = (int *)malloc(size);
   c = (int *)malloc(size);
   // Initialize the input vectors
   for (int i=0; i<N; ++i) {
      a[i] = a_in; b[i] = b_in; c[i] = 0; }
   cudaMalloc((void **)&d_a, size);
   cudaMalloc((void **)&d_b, size);
   cudaMalloc((void **)&d_c, size);
   cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
   cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
   add<<<numBlocks,1>>>(d_a, d_b, d_c);
   cudaDeviceSynchronize();
   cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
   cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
   printf("%d + %d = %d\n", a[0],   b[0],   c[0]);
   printf("...\n");
   printf("%d + %d = %d\n", a[N-1], b[N-1], c[N-1]);
   free(a); free(b); free(c);
}

Measuring speed

The point of using a GPU is speed, so how do we measure the speed of our kernel code and the various CUDA calls like cudaMemcpy? CUDA provides the utility nvprof for this. Here’s some sample output:

$ nvprof ./addvec_blocks 1 2 512
==6473== NVPROF is profiling process 6473, command: ./addvec_blocks
==6473== Profiling application: ./addvec_blocks
1 + 2 = 3
 ...
1 + 2 = 3
==6473== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   39.36%  3.1360us         2  1.5680us  1.3440us  1.7920us  [CUDA memcpy HtoD]
                   38.96%  3.1040us         1  3.1040us  3.1040us  3.1040us  add(int*, int*, int*)
                   21.69%  1.7280us         1  1.7280us  1.7280us  1.7280us  [CUDA memcpy DtoH]
      API calls:   99.20%  191.29ms         3  63.765ms  9.4530us  191.27ms  cudaMalloc
                    0.35%  681.62us         1  681.62us  681.62us  681.62us  cuDeviceTotalMem
                    0.21%  413.84us        96  4.3100us     108ns  167.14us  cuDeviceGetAttribute
                    0.08%  158.83us         3  52.942us  4.9910us  139.28us  cudaFree
                    0.07%  136.64us         1  136.64us  136.64us  136.64us  cudaLaunchKernel
                    0.05%  90.485us         3  30.161us  14.610us  58.564us  cudaMemcpy
                    0.03%  52.872us         1  52.872us  52.872us  52.872us  cuDeviceGetName
                    0.01%  11.735us         1  11.735us  11.735us  11.735us  cudaDeviceSynchronize
                    0.00%  3.5890us         1  3.5890us  3.5890us  3.5890us  cuDeviceGetPCIBusId
                    0.00%  1.4540us         3     484ns     120ns     923ns  cuDeviceGetCount
                    0.00%     651ns         2     325ns     189ns     462ns  cuDeviceGet
                    0.00%     204ns         1     204ns     204ns     204ns  cuDeviceGetUuid

This tells us the GPU spent 3.1040 microseconds in our add() kernel, but spent slightly longer (3.1360us) just copying the data from the host to the device, and then half that again copying the results back. You might also note that, among the “API calls”, cudaMalloc takes 99% of the time!

This should drive home what we meant when we said the GPU is for arithmetically intense operations. All the examples we’ve done in this workshop are too small. They could have been done just as quickly and even more simply on the CPU— but that is often the nature of example problems!

Getting your data on to and off of the GPU is the price you pay to access the massive parallelism of the GPU. To make that worth while, you have to need to do a lot of arithmetic! And your program should be designed in such a way that data movement between main memory and GPU is minimized. Ideally the data moves onto the GPU once, and moves off once.

Nonetheless, as long as you bear in mind which parts of these problems are artificial, you can still use nvprof to answer some interesting questions, like, “Is it faster to use threads or blocks?”

Speed trial: Threads versus blocks

Compare the performance of the add() kernel in the two codes you just produced. Which is faster, using threads or using blocks? How much faster?

What happens if you vary the size of the array? Make sure you still get correct answers!

Key Points

  • Use nvprof to profile CUDA functions

  • Blocks are the batches in which a GPU handles data

  • Blocks are handled by streaming multiprocessors (SMs)

  • Each block can have up to 1024 threads (on our current GPU cards)


Putting It All Together

Overview

Teaching: 10 min
Exercises: 20 min
Questions
  • How do blocks and threads work together?

  • What other GPU programming resources are there?

Objectives
  • To show how blocks and threads typically work together

To take advantage of all these “CUDA cores” you need to use both blocks and threads.

The family of CUDA variables defining blocks and threads can be explained by referring to this image from “An Even Easier Introduction to CUDA”:

CUDA indexing

The number of blocks is in gridDim.x— we’ve been calling that numBlocks in our CPU-side code— and the number of threads in a block is blockDim.x which we’ve been calling numThreads. CUDA also provides the index of the current block in blockIdx.x and the thread index within that block is threadIdx.x, both of which we saw earlier.

All the indexing is zero-based, like in C.

If you’re wondering about the .x on all of these, it’s there because you have the option of specifying 2- or 3-dimensional arrays of blocks, and threads within blocks, for natural indexing of 2- and 3-dimensional data structures likes matrices or volumes. See the CUDA Programming Guide if you want to learn more about this.

We’ll change our kernel function one more time, to escape the limitation of the thread count and get (we hope) maximal performance. We’ll also put a loop back in, in case the size of our data is greater than the number of (blocks times threads) we have:

__global__ void add(int n, int *a, int *b, int *c) {
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (int i = index; i < n; i += stride)
      c[i] = a[i] + b[i];
}

Now the size of dataset (i.e. the length of the vectors) we can handle correctly no longer depends on the number of blocks and threads.

We still need to choose a number of blocks and a thread count.

We’ve seen that threads-per-block can’t go higher than 1024, and 256 is a fairly conventional choice. We could choose a number of blocks that exactly covers the size of our arrays by dividing N / numthreads, and adding an extra block to handle any remainder from the division. But by introducing a loop into the kernel we just wrote, it should be able to handle any choice of block and thread count that we give it. So now we’re free to play with numBlocks and numThreads all we like, checking performance with nvprof to see how much difference various choices make.

Putting it all together

Introduce the add() kernel above into your code from the previous episodes, and adjust the rest of the program to make it work.

  • Verify that it still produces correct results.
  • Use nvprof to compare the performance to that of the two previous solutions.

Solution

Here’s a version of the code that includes a few more “bells and whistles”. Most importantly, you need to specify the addends, the size of the arrays, and the number of threads and blocks. This makes it easier to explore the correctness and performance of these choices.

#include <stdio.h> 
#include <stdlib.h>
 
__global__ void add(int N, int *da, int *db, int *dc) {
   // This is a CUDA idiom called the grid-stride loop.
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (int i = index; i < N; i += stride)
      dc[i] = da[i] + db[i];
}

int main(int argc, char **argv) {
   // Read values from cmd line.
   if (argc < 6) {
      printf("Usage:\n %s a b N threads blocks\n", argv[0]);
      return(-1);
   }
   int a_in = atoi(argv[1]);
   int b_in = atoi(argv[2]);
   int N = atoi(argv[3]);
   int numThreads = atoi(argv[4]);
   int numBlocks = atoi(argv[5]);
   // Or to get the block count that covers N elements:
   // int numBlocks = (N + numThreads - 1) / numThreads;

   // Calculate size of arrays in bytes.
   int size = N * sizeof(int);
   // Allocate host storage.
   int *a, *b, *c;
   a = (int *)malloc(size);
   b = (int *)malloc(size);
   c = (int *)malloc(size);
   // Initialize the input vectors.
   for (int i=0; i<N; ++i) {
      a[i] = a_in; 
      b[i] = b_in;
      c[i] = 0;
   }

   // Allocate device storage.
   int *da, *db, *dc;
   cudaMalloc((void **)&da, size);
   cudaMalloc((void **)&db, size);
   cudaMalloc((void **)&dc, size);

   // Copy data to GPU.
   cudaMemcpy(da, a, size, cudaMemcpyHostToDevice);
   cudaMemcpy(db, b, size, cudaMemcpyHostToDevice);

   // Execute the kernel on the GPU.
   add<<<numBlocks,numThreads>>>(N, da, db, dc);
   cudaDeviceSynchronize();

   // Copy results back from GPU.
   cudaMemcpy(c, dc, size, cudaMemcpyDeviceToHost);
   
   // Print results from each end of the array.
   printf("%d plus %d equals %d\n", a[0], b[0], c[0]);
   printf(" ...\n");
   printf("%d plus %d equals %d\n", a[N-1], b[N-1], c[N-1]);

   // Check for stray errors somewhere in the middle.
   // We won't check them all, quit after first error.
   int expected = a_in + b_in;
   for (int i=0; i<N; ++i) {
      if (c[i] != expected) {
         printf("Wrong sum %d at element %d!\n", c[i], i);
	 break;
      }
   }

   // Free all allocated memory.
   cudaFree(da); cudaFree(db); cudaFree(dc);
   free(a); free(b); free(c);
}

Compile it and invoke it like this:

$ nvcc addvec_final.cu -o final
$ srun --gres=gpu:1 final 2 2 10000 256 80

Remember, you can also use salloc in place of srun to get a shell prompt on a GPU node so you can more easily run multiple trials, and using nvprof final ... will give you performance information. (Except on our virtual cluster. Sorry!)

Other bits and pieces

You can define shared variables in your kernel functions that are visible to all running threads in a block. Maybe you want to have a flag to record whether some unusual condition arose while processing a block of data?

__global__ void add(int *a, int *b, int *c) {
   __shared__ int block_status;
   c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

You can also synchronize the execution of code by setting barriers to get threads to reach common points in the problem. Let’s say that you are evolving some system and need all of the threads to finish their work before going onto the next step. This is where synchronization comes into play.

# Do the current time step
__syncthreads();
# Go on to do the next time step

This helps avoid problems like race conditions, where incorrect data is being used in your calculation. GPU programming is basically a type of shared-memory programming, so the same problems and cautions we saw with OpenMP apply here.

Where to go next

This has been the barest of introductions to CUDA and GPU programming. Don’t forget the CUDA Programming Guide we mentioned earlier:

As mentioned in episode 1, there are other ways to program GPUs than CUDA. Here are two OpenACC tutorials, including one from the Alliance:

If you’re using GPUs, then performance obviously matters to you. A lot. Here’s a great blog post that summarizes the most important performance issues around GPUs. It’s from a perspective of Deep Learning, but the thoughts are quite general:

Key Points

  • A typical kernel indexes data using both blocks and threads