Putting It All Together
Overview
Teaching: 10 min
Exercises: 20 minQuestions
How do blocks and threads work together?
What other GPU programming resources are there?
Objectives
To show how blocks and threads typically work together
To take advantage of all these “CUDA cores” you need to use both blocks and threads.
The family of CUDA variables defining blocks and threads can be explained by referring to this image from “An Even Easier Introduction to CUDA”:
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
The number of blocks is in gridDim.x
— we’ve been calling that numBlocks
in our CPU-side code— and the number of threads in a block is blockDim.x
which we’ve been calling numThreads
. CUDA also provides the index of the
current block in blockIdx.x
and the thread index within that block is
threadIdx.x
, both of which we saw earlier.
All the indexing is zero-based, like in C.
If you’re wondering about the .x
on all of these, it’s there because
you have the option of specifying 2- or 3-dimensional arrays of blocks, and
threads within blocks, for natural indexing of 2- and 3-dimensional data
structures likes matrices or volumes. See the
CUDA Programming Guide
if you want to learn more about this.
We’ll change our kernel function one more time, to escape the limitation of the thread count and get (we hope) maximal performance. We’ll also put a loop back in, in case the size of our data is greater than the number of (blocks times threads) we have:
__global__ void add(int n, int *a, int *b, int *c) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
c[i] = a[i] + b[i];
}
Now the size of dataset (i.e. the length of the vectors) we can handle correctly no longer depends on the number of blocks and threads.
We still need to choose a number of blocks and a thread count.
We’ve seen that threads-per-block can’t go higher than 1024, and 256 is a
fairly conventional choice. We could choose a number of blocks that exactly
covers the size of our arrays by dividing N / numthreads
, and adding an extra
block to handle any remainder from the division. But by introducing a loop
into the kernel we just wrote, it should be able to handle any choice of block
and thread count that we give it. So now we’re free to play with numBlocks
and numThreads
all we like, checking performance with nvprof
to see how
much difference various choices make.
Putting it all together
Introduce the
add()
kernel above into your code from the previous episodes, and adjust the rest of the program to make it work.
- Verify that it still produces correct results.
- Use
nvprof
to compare the performance to that of the two previous solutions.Solution
Other bits and pieces
You can define shared variables in your kernel functions that are visible to all running threads in a block. Maybe you want to have a flag to record whether some unusual condition arose while processing a block of data?
__global__ void add(int *a, int *b, int *c) {
__shared__ int block_status;
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
You can also synchronize the execution of code by setting barriers to get threads to reach common points in the problem. Let’s say that you are evolving some system and need all of the threads to finish their work before going onto the next step. This is where synchronization comes into play.
# Do the current time step
__syncthreads();
# Go on to do the next time step
This helps avoid problems like race conditions, where incorrect data is being used in your calculation. GPU programming is basically a type of shared-memory programming, so the same problems and cautions we saw with OpenMP apply here.
Key Points
A typical kernel indexes data using both blocks and threads