Introduction
Overview
Teaching: 10 min
Exercises: 5 minQuestions
What is a GPU, and why do we use them?
What is CUDA?
Objectives
To know the names of some GPU programming tools
To get access to a GPU node with Slurm
What is a GPU, and why?
A Graphics Processing Unit is a specialized piece of computer circuitry designed to accelerate the creation of images. GPUs are key to the performance of many current computer games; a machine with only CPUs cannot update the picture on the screen fast enough to make the game playable.
A GPU is effectively a small, highly specialized, parallel computer. “The GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations.” (CUDA C Programming Guide)
What is CUDA?
There are a number of packages you can use to program GPUs. Three prominent ones are CUDA, OpenCL, and OpenACC.
CUDA is an NVidia product. It once stood for Compute Unified Device Architecture, but not even NVidia uses that expansion any more. CUDA has two components. One part is a driver that actually handles communications with the card. The second part is a set of libraries that allow your code to interact with the driver portion. CUDA only works with Nvidia graphics cards.
Monitoring your graphics card
Along with drivers, library files, and compilers, CUDA comes with several utilities that you can use to manage your work. One very useful tool is ‘nvidia-smi’ which lets you check the basic status of your GPU card, like, “Do I have one?” and “Is it working”. There are also tools for profiling and debugging CUDA code.
nvidia-smi
Running on a GPU node
If you run nvidia-smi
on a login node or on a regular compute node, it will complain that
it can’t communicate with the NVIDIA driver. This should be no surprise: No NVIDIA driver
needs to be running on a node that has no GPU.
To get access to a GPU we have to go through
Slurm.
The “normal” way is to use sbatch
, which will queue up a job for execution later:
$ cat testjob
#!/bin/bash
#SBATCH --gres=gpu:1 # THIS IS THE KEY LINE
#SBATCH --cpus-per-task=10
#SBATCH --mem=40G
#SBATCH --time=0:5:0
nvidia-smi
$ sbatch testjob
That asks for one GPU card and 10 CPU cores. This would be perfect for the national
Béluga cluster,
since the GPU nodes there have 4 GPUs, 40 CPU cores, and more than 160GB of RAM.
Five minutes of run time is foolishly short for a production job, but we’re testing,
so this should be okay.
Alternatively we could use salloc
to request a GPU node, or part of one,
and start an interactive shell there. Because we don’t have enough nodes
on our virtual cluster to provide a GPU for each person in the course, we’ll use
yet a third form that you already saw in a previous week, srun
:
$ srun --gres=gpu:1 nvidia-smi
Fri Jun 19 14:40:32 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.56 Driver Version: 440.56 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID V100D-8C On | 00000000:00:05.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 560MiB / 8192MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Key Points
A GPU (Graphics Processing Unit) is best at data-parallel, arithmetic-intense calculations
CUDA is one of several programming interfaces for general-purpose computing on GPUs
Alliance clusters have special GPU-equipped nodes, which must be requested from the scheduler
Hello World
Overview
Teaching: 15 min
Exercises: 5 minQuestions
How to compile and run Hello World
Objectives
Compile and run some CUDA code
The traditional C-language Hello World program is:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
printf("Hello World\n");
}
The CUDA compiler is a C compiler that can generate binary code that will run on CPUs as well as code that will run on GPUs. The compiler is called nvcc
. Use it to compile this Hello World with the following command:
nvcc -o hello_world hello_world.c
Command not found?
If you get:
$ nvcc -o hello_world hello_world.c
-bash: nvcc: command not found
…you may have forgotten to run
module load nvhpc/22.7 cuda/11.4
.
To get this to run with a GPU, you need to add some code to create and launch a kernel. A kernel is a portion of code that can be transferred to the GPU and run there. A simple example would look like the following.
#include <stdio.h>
#include <stdlib.h>
__global__ void mykernel(void) {
}
int main(int argc, char **argv) {
mykernel<<<1,1>>>();
printf("Hello world\n");
}
You can compile this code with the same command as before, except the file extension must be ‘.cu’ instead of ‘.c’. We added two extra parts to the program. The first part, __global__, tells the compiler that this function will be something that runs on the GPU, but is called from the main program. The second part is the use of angle brackets added to the function call. This is extra syntax used to tell the compiler that this function is to be used as a kernel and to launch it on the GPU. In this case, our kernel function doesn’t do anything. We will add more to it in the next section.
The point of using a GPU is to do massively parallel programs. The two numbers within the angle brackets define how the work is to be divided up among the threads available. The first number defines the number of blocks to break the work up across. These blocks run in parallel. The second number defines the number threads that will work within each of the blocks. So if you have the following, in general,
mykernel<<<M,N>>>();
this sets up your work so that everything is divided into M blocks, with each block having N threads. Since both parameters provide ways of parallelizing your work, you may ask why are there two? Threads not only handle parallelism, but also provide ways of communicating and synchronizing between the threads. This gives you the tools you need to handle more complex algorithms.
Running on a GPU node
You could execute hello_world
on the login node simply by naming it:
$ ./hello_world
Hello World
But the login nodes on the cluster we’re using don’t have GPUs, so how does that work? You could run it on a GPU node using
$ srun --gres=gpu:1 hello_world
Hello World
as shown in the previous episode, but the result will be the same because this code doesn’t actually do anything with the GPU yet. Let’s fix that next.
Key Points
Use nvcc to compile
CUDA source files are suffixed .cu
Use salloc to get an interactive session on a GPU node for testing
Adding Two Integers
Overview
Teaching: 10 min
Exercises: 20 minQuestions
How does GPU memory work?
Objectives
To send data from CPU to GPU and back
In this section, we will write some code that will have the GPU add two numbers. Trivial, right? Not as trivial as you might think, because a GPU card has completely separate memory from the CPU. Things stored in one are not accessible from the other, they have to be copied back and forth.
So to add two numbers on the GPU that start on our keyboard, we need to first store them in the CPU (or “host”) memory, then move them from there to the GPU (or “device”) memory, and finally move the result back from the device memory to the host memory.
This language will come up again and again in CUDA: The CPU and its associated memory is the host, while the GPU and its associated memory is the device.
Here’s a kernel function that will do the addition on the GPU:
__global__ void add(int *da, int *db, int *dc) {
*dc = *da + *db;
}
Those asterisks may be unfamiliar to you if you haven’t done much C
programming. They mean that the parameters being supplied are not the
integers to be added, but instead pointers to where those integers
are in memory. This is because the kernel function is called from
the host CPU, but executes on the GPU and hence needs to point to
memory locations within the GPU memory. The line *dc = *da + *db
says
“take the values at addresses da
and db
, add them together, and
store the result at the address dc
.” So da, db
and dc
are locations
in memory, and *da, *db
and *dc
are the values stored at those locations.
I’ve prefixed all the names with “d” for “device” to remind us that they’re
locations in the GPU memory.
We’ll also need to determine the address (that is, the storage location)
of a few variables. The C operator to do that is the ampersand, &
.
&x
returns the address of x
, which is to say, a pointer to x
.
Memory Allocation
In C programs, there are two ways that memory for data can be allocated. The first is define it statically when a variable is declared.
int a;
…declares an integer
a
and the space to hold it.The second way is to allocate it dynamically and keep a pointer to that area of memory.
int *a; a = (int *)malloc(sizeof(int));
…declares a pointer to an integer, and then
malloc
finds space to put it and we put theaddress
of that space intoa
. This is what is almost always done for data arrays, since it allows you to choose the size of the array at run time rather than compile time.If you use
malloc()
or any of its cousins to allocate memory, you are responsible for giving it back again when you are done with thefree()
function.free(a);
There are CUDA variants for malloc
and free
that handle allocation of
memory on the GPU. These functions deal with pointers to pointers(!), so it
looks like the following:
int *d_a;
cudaMalloc((void **)&d_a, sizeof(int));
You then need to copy data from the CPU memory to the GPU memory with another function from the CUDA library. This looks like:
cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
The order of arguments here is
- destination address,
- source address,
- number of bytes, and
- cudaMemcpyKind.
That last is a symbolic constant defined by the CUDA library:
Either cudaMemcpyHostToDevice
or cudaMemcpyDeviceToHost
.
To copy results back to the host memory you use the same function, with
the destination and source addresses in the correct order and the correct
constant in the last position.
Here’s web documentation for:
Exercise: Complete the code
Using the template below and the bits and pieces just shown, build and test the code to have the GPU card to add two integers.
You’ll need these pieces:
- The kernel function
add()
at the top of this page- Patterns for
cudaMalloc()
andcudaMemcpy()
- You’ll need to allocate GPU memory for two inputs and one output
- Call the kernel function with
add<<<1,1>>>(...)
- Print the result with
printf("%d plus %d equals %d\n", a, b, c);
- Release the allocated GPU memory with
cudaFree()
/* TEMPLATE CODE */ #include <stdio.h> #include <stdlib.h> // ... define function 'add' ... int main(int argc, char **argv) { int a, b, c; // We've chosen static allocation here for host storage.. int *da, *db, *dc; // ...but device storage must be dynamically allocated a = atoi(argv[1]); // Read the addends from the command line args b = atoi(argv[2]); // ... manage memory ... // ... move data ... add<<<1,1>>>(da, db, dc); // call kernel function on GPU // ... move data ... printf("%d + %d -> %d\n", a, b, c); // ... manage memory ... }
Solution
#include <stdio.h> #include <stdlib.h> __global__ void add(int *da, int *db, int *dc) { *dc = *da + *db; } int main(int argc, char **argv) { int a, b, c; // We've chosen static allocation here for host storage.. int *da, *db, *dc; // ...but device storage must be dynamically allocated a = atoi(argv[1]); // Read the addends from the command line args b = atoi(argv[2]); cudaMalloc((void **)&da, sizeof(int)); cudaMalloc((void **)&db, sizeof(int)); cudaMalloc((void **)&dc, sizeof(int)); cudaMemcpy(da, &a, sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(db, &b, sizeof(int), cudaMemcpyHostToDevice); add<<<1,1>>>(da, db, dc); cudaMemcpy(&c, dc, sizeof(int), cudaMemcpyDeviceToHost); printf("%d + %d -> %d\n", a, b, c); cudaFree(da); cudaFree(db); cudaFree(dc); }
Compile with
nvcc add.cu -o add
, test withsrun --gres=gpu:1 add 1 2
Oh, and one more thing: We should add cudaDeviceSynchronize()
just before we
copy back the result. The CPU code is not required to wait for the GPU code to
complete before it continues. This call will force it to wait. This is another
one of those errors that will probably not occur until long after you’ve
stopped expecting it, and then when it does, you’ll have no clue what’s going
on.
Result is zero?
If you get:
$ srun --gres=gpu:1 ./add 1 2
1 + 2 -> 0
…you may have forgotten to load the CUDA module
module load cuda/11.4
.
We still haven’t done anything in parallel. That’s next.
Key Points
The CPU (the ‘host’) and the GPU (the ‘device’) have separate memory banks
This requires explicit copying of data to and from the GPU
Adding vectors with GPU threads
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How can we parallelize code on a GPU?
Objectives
Use CUDA threads
A GPU is not meant for doing one thing at a time; it’s meant for doing arithmetic on massive amounts of data all at the same time. It’s time we scaled up what we’re doing.
Let’s generalize the code we just wrote to add two vectors of integers,
instead of two integers. Instead of having statically defined variables to hold
single integers in the host (CPU) memory, we’ll declare pointers to integers
and then call malloc
to create space for the vectors:
int *a;
int bytes = N * sizeof(int);
a = (int *)malloc(bytes);
Because of the relation between pointers and arrays in C, we can store and
retrieve data from an allocated block of memory using array syntax,
expressions like a[i]
.
We’ll set the number of GPU threads to something larger than one by changing
the second argument in the <<<M,N>>>
when we call the kernel function. Back
in Episode 2 we mentioned that the first number, M
, is the block count and
the second, N
, is the thread count. For the moment let’s use 512 threads,
leave the block count at 1, and we’ll come back and discuss those numbers
later.
We also need to change the kernel function to match. The CUDA library provides
several variables for indexing. We’ll talk more about those later too, but for
now use threadIdx.x
. Change the kernel function definition to the following:
__global__ void add(int *da, int *db, int *dc) {
dc[threadIdx.x] = da[threadIdx.x] + db[threadIdx.x];
}
Notice that this kernel function just adds one pair of integers and stores
them in one matching location in array dc
. But because we’ve used the
CUDA-provided index variable, each thread will execute that kernel function on
the thread’s own little piece of the data. In other words, the kernel
function gets called potentially thousands of times in parallel— which is why
there’s no loop in it. The looping is implicit in the <<<M,N>>>
decorator
when we call the kernel.
Exercise: From single integers to vectors
Combine your code from the previous exercise and the pieces described above to add two vectors on the GPU. You can just print the first and last results. While it might be more realistic to populate the input vectors with random numbers or something, for simplicity you can just populate each one with
N
copies of the same number.If you’re not accustomed to C programming or confused by
malloc
, double up with another student who is familiar with C. They’ll help you.Solution
#include <stdio.h> #include <stdlib.h> __global__ void add(int *da, int *db, int *dc) { dc[threadIdx.x] = da[threadIdx.x] + db[threadIdx.x]; } int main(int argc, char **argv) { int a_in = atoi(argv[1]); // first addend int b_in = atoi(argv[2]); // second addend int N = atoi(argv[3]); // length of arrays int numThreads = 512; int *a, *b, *c; int *d_a, *d_b, *d_c; int size = N * sizeof(int); a = (int *)malloc(size); b = (int *)malloc(size); c = (int *)malloc(size); // Initialize the input vectors for (int i=0; i<N; ++i) { a[i] = a_in; b[i] = b_in; c[i] = 0; } cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); add<<<1,numThreads>>>(d_a, d_b, d_c); cudaDeviceSynchronize(); cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); printf("%d + %d = %d\n", a[0], b[0], c[0]); printf("...\n"); printf("%d + %d = %d\n", a[N-1], b[N-1], c[N-1]); free(a); free(b); free(c); }
Don’t forget: To get to a GPU you may need to do something like
srun --gres=gpu:1 addvec 1 2 512
.
Bonus exercise: (Move fast and) Break things
Once you’ve got the code working on 512 elements, vary the number of elements and see what happens. Do you have any idea why it does that?
For even more fun, vary the number of threads.
Key Points
Threads are the lowest level of parallelization on a GPU
A kernel function replaces the code inside the loop to be parallelized
The CUDA
<<<M,N>>>
notation replaces the loop itself
Using blocks instead of threads
Overview
Teaching: 10 min
Exercises: 20 minQuestions
What is the difference between blocks and threads?
Is one faster than the other?
Objectives
To use blocks to parallelize work
Examine CUDA performance
In the Hello World example we saw the <<<M,N>>>
syntax used in CUDA to call a
kernel function, and we told you that it creates M
blocks and N
threads per
block. In the Adding Vectors example we just finished, we created threads with
<<<1,N>>>
. Now we will use the first parameter to create blocks instead.
What’s a block? A GPU typically has several streaming multiprocessors (SMs). A block is handled by one SM, and each SM may handle many blocks in succession. Each SM supports up to 1024 threads, typically in multiples of 32 called “warps”. Threads can quickly access and share the data within a block. The pictures below from CUDA C Programming Guide should help:
Blocks and SMs | Thread Blocks |
---|---|
The P100 GPU model available at Graham Cedar has 56 SMs, each supporting 64 single-precision threads or 32 double-precision threads. So if you are doing double-precision calculations, each GPU has effectively 56*32 = 1792 cores. At Béluga there are newer V100 GPUs with 80 SMs, which again support 64 single-precision or 32 double-precision threads each, for 2560 effective cores.
But to take advantage of all these “CUDA cores” you need to use both blocks and threads.
We’ll start by doing something a little silly, and just switch from threads to
blocks. Change the kernel function to use CUDA’s block index,
blockIdx.x
, like so:
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
Exercise: Change threads to blocks
Copy the Adding Vectors example you just finished, and change the copy to use blocks instead of threads. Verify that it still produces correct results.
Solution
#include <stdio.h> #include <stdlib.h> __global__ void add(int *da, int *db, int *dc) { dc[blockIdx.x] = da[blockIdx.x] + db[blockIdx.x]; } int main(int argc, char **argv) { int a_in = atoi(argv[1]); // first addend int b_in = atoi(argv[2]); // second addend int N = atoi(argv[3]); // length of arrays int numBlocks = 512; int *a, *b, *c; int *d_a, *d_b, *d_c; int size = N * sizeof(int); a = (int *)malloc(size); b = (int *)malloc(size); c = (int *)malloc(size); // Initialize the input vectors for (int i=0; i<N; ++i) { a[i] = a_in; b[i] = b_in; c[i] = 0; } cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); add<<<numBlocks,1>>>(d_a, d_b, d_c); cudaDeviceSynchronize(); cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); printf("%d + %d = %d\n", a[0], b[0], c[0]); printf("...\n"); printf("%d + %d = %d\n", a[N-1], b[N-1], c[N-1]); free(a); free(b); free(c); }
Measuring speed
The point of using a GPU is speed, so how do we measure the speed of
our kernel code and the various CUDA calls like cudaMemcpy
? CUDA
provides the utility nvprof
for this. Here’s some sample output:
$ nvprof ./addvec_blocks 1 2 512
==6473== NVPROF is profiling process 6473, command: ./addvec_blocks
==6473== Profiling application: ./addvec_blocks
1 + 2 = 3
...
1 + 2 = 3
==6473== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 39.36% 3.1360us 2 1.5680us 1.3440us 1.7920us [CUDA memcpy HtoD]
38.96% 3.1040us 1 3.1040us 3.1040us 3.1040us add(int*, int*, int*)
21.69% 1.7280us 1 1.7280us 1.7280us 1.7280us [CUDA memcpy DtoH]
API calls: 99.20% 191.29ms 3 63.765ms 9.4530us 191.27ms cudaMalloc
0.35% 681.62us 1 681.62us 681.62us 681.62us cuDeviceTotalMem
0.21% 413.84us 96 4.3100us 108ns 167.14us cuDeviceGetAttribute
0.08% 158.83us 3 52.942us 4.9910us 139.28us cudaFree
0.07% 136.64us 1 136.64us 136.64us 136.64us cudaLaunchKernel
0.05% 90.485us 3 30.161us 14.610us 58.564us cudaMemcpy
0.03% 52.872us 1 52.872us 52.872us 52.872us cuDeviceGetName
0.01% 11.735us 1 11.735us 11.735us 11.735us cudaDeviceSynchronize
0.00% 3.5890us 1 3.5890us 3.5890us 3.5890us cuDeviceGetPCIBusId
0.00% 1.4540us 3 484ns 120ns 923ns cuDeviceGetCount
0.00% 651ns 2 325ns 189ns 462ns cuDeviceGet
0.00% 204ns 1 204ns 204ns 204ns cuDeviceGetUuid
This tells us the GPU spent 3.1040 microseconds in our add()
kernel,
but spent slightly longer (3.1360us) just copying the data from the
host to the device, and then half that again copying the results back.
You might also note that, among the “API calls”, cudaMalloc
takes 99%
of the time!
This should drive home what we meant when we said the GPU is for arithmetically intense operations. All the examples we’ve done in this workshop are too small. They could have been done just as quickly and even more simply on the CPU— but that is often the nature of example problems!
Getting your data on to and off of the GPU is the price you pay to access the massive parallelism of the GPU. To make that worth while, you have to need to do a lot of arithmetic! And your program should be designed in such a way that data movement between main memory and GPU is minimized. Ideally the data moves onto the GPU once, and moves off once.
Nonetheless, as long as you bear in mind which parts of these problems are
artificial, you can still use nvprof
to answer some interesting questions, like,
“Is it faster to use threads or blocks?”
Speed trial: Threads versus blocks
Compare the performance of the
add()
kernel in the two codes you just produced. Which is faster, using threads or using blocks? How much faster?What happens if you vary the size of the array? Make sure you still get correct answers!
Key Points
Use
nvprof
to profile CUDA functionsBlocks are the batches in which a GPU handles data
Blocks are handled by streaming multiprocessors (SMs)
Each block can have up to 1024 threads (on our current GPU cards)
Putting It All Together
Overview
Teaching: 10 min
Exercises: 20 minQuestions
How do blocks and threads work together?
What other GPU programming resources are there?
Objectives
To show how blocks and threads typically work together
To take advantage of all these “CUDA cores” you need to use both blocks and threads.
The family of CUDA variables defining blocks and threads can be explained by referring to this image from “An Even Easier Introduction to CUDA”:
The number of blocks is in gridDim.x
— we’ve been calling that numBlocks
in our CPU-side code— and the number of threads in a block is blockDim.x
which we’ve been calling numThreads
. CUDA also provides the index of the
current block in blockIdx.x
and the thread index within that block is
threadIdx.x
, both of which we saw earlier.
All the indexing is zero-based, like in C.
If you’re wondering about the .x
on all of these, it’s there because
you have the option of specifying 2- or 3-dimensional arrays of blocks, and
threads within blocks, for natural indexing of 2- and 3-dimensional data
structures likes matrices or volumes. See the
CUDA Programming Guide
if you want to learn more about this.
We’ll change our kernel function one more time, to escape the limitation of the thread count and get (we hope) maximal performance. We’ll also put a loop back in, in case the size of our data is greater than the number of (blocks times threads) we have:
__global__ void add(int n, int *a, int *b, int *c) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
c[i] = a[i] + b[i];
}
Now the size of dataset (i.e. the length of the vectors) we can handle correctly no longer depends on the number of blocks and threads.
We still need to choose a number of blocks and a thread count.
We’ve seen that threads-per-block can’t go higher than 1024, and 256 is a
fairly conventional choice. We could choose a number of blocks that exactly
covers the size of our arrays by dividing N / numthreads
, and adding an extra
block to handle any remainder from the division. But by introducing a loop
into the kernel we just wrote, it should be able to handle any choice of block
and thread count that we give it. So now we’re free to play with numBlocks
and numThreads
all we like, checking performance with nvprof
to see how
much difference various choices make.
Putting it all together
Introduce the
add()
kernel above into your code from the previous episodes, and adjust the rest of the program to make it work.
- Verify that it still produces correct results.
- Use
nvprof
to compare the performance to that of the two previous solutions.Solution
Here’s a version of the code that includes a few more “bells and whistles”. Most importantly, you need to specify the addends, the size of the arrays, and the number of threads and blocks. This makes it easier to explore the correctness and performance of these choices.
#include <stdio.h> #include <stdlib.h> __global__ void add(int N, int *da, int *db, int *dc) { // This is a CUDA idiom called the grid-stride loop. int index = blockIdx.x * blockDim.x + threadIdx.x; int stride = blockDim.x * gridDim.x; for (int i = index; i < N; i += stride) dc[i] = da[i] + db[i]; } int main(int argc, char **argv) { // Read values from cmd line. if (argc < 6) { printf("Usage:\n %s a b N threads blocks\n", argv[0]); return(-1); } int a_in = atoi(argv[1]); int b_in = atoi(argv[2]); int N = atoi(argv[3]); int numThreads = atoi(argv[4]); int numBlocks = atoi(argv[5]); // Or to get the block count that covers N elements: // int numBlocks = (N + numThreads - 1) / numThreads; // Calculate size of arrays in bytes. int size = N * sizeof(int); // Allocate host storage. int *a, *b, *c; a = (int *)malloc(size); b = (int *)malloc(size); c = (int *)malloc(size); // Initialize the input vectors. for (int i=0; i<N; ++i) { a[i] = a_in; b[i] = b_in; c[i] = 0; } // Allocate device storage. int *da, *db, *dc; cudaMalloc((void **)&da, size); cudaMalloc((void **)&db, size); cudaMalloc((void **)&dc, size); // Copy data to GPU. cudaMemcpy(da, a, size, cudaMemcpyHostToDevice); cudaMemcpy(db, b, size, cudaMemcpyHostToDevice); // Execute the kernel on the GPU. add<<<numBlocks,numThreads>>>(N, da, db, dc); cudaDeviceSynchronize(); // Copy results back from GPU. cudaMemcpy(c, dc, size, cudaMemcpyDeviceToHost); // Print results from each end of the array. printf("%d plus %d equals %d\n", a[0], b[0], c[0]); printf(" ...\n"); printf("%d plus %d equals %d\n", a[N-1], b[N-1], c[N-1]); // Check for stray errors somewhere in the middle. // We won't check them all, quit after first error. int expected = a_in + b_in; for (int i=0; i<N; ++i) { if (c[i] != expected) { printf("Wrong sum %d at element %d!\n", c[i], i); break; } } // Free all allocated memory. cudaFree(da); cudaFree(db); cudaFree(dc); free(a); free(b); free(c); }
Compile it and invoke it like this:
$ nvcc addvec_final.cu -o final $ srun --gres=gpu:1 final 2 2 10000 256 80
Remember, you can also use
salloc
in place ofsrun
to get a shell prompt on a GPU node so you can more easily run multiple trials, and usingnvprof final ...
will give you performance information. (Except on our virtual cluster. Sorry!)
Other bits and pieces
You can define shared variables in your kernel functions that are visible to all running threads in a block. Maybe you want to have a flag to record whether some unusual condition arose while processing a block of data?
__global__ void add(int *a, int *b, int *c) {
__shared__ int block_status;
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
You can also synchronize the execution of code by setting barriers to get threads to reach common points in the problem. Let’s say that you are evolving some system and need all of the threads to finish their work before going onto the next step. This is where synchronization comes into play.
# Do the current time step
__syncthreads();
# Go on to do the next time step
This helps avoid problems like race conditions, where incorrect data is being used in your calculation. GPU programming is basically a type of shared-memory programming, so the same problems and cautions we saw with OpenMP apply here.
Where to go next
This has been the barest of introductions to CUDA and GPU programming. Don’t forget the CUDA Programming Guide we mentioned earlier:
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Here are two shorter tutorials, from NVidia:
- https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/
- https://devblogs.nvidia.com/even-easier-introduction-cuda/
As mentioned in episode 1, there are other ways to program GPUs than CUDA. Here are two OpenACC tutorials, including one from the Alliance:
- https://www.openacc.org/get-started (videos)
- https://docs.alliancecan.ca/wiki/OpenACC_Tutorial (text)
If you’re using GPUs, then performance obviously matters to you. A lot. Here’s a great blog post that summarizes the most important performance issues around GPUs. It’s from a perspective of Deep Learning, but the thoughts are quite general:
Key Points
A typical kernel indexes data using both blocks and threads