Introduction
|
A GPU (Graphics Processing Unit) is best at data-parallel, arithmetic-intense calculations
CUDA is one of several programming interfaces for general-purpose computing on GPUs
Alliance clusters have special GPU-equipped nodes, which must be requested from the scheduler
|
Hello World
|
|
Adding Two Integers
|
|
Adding vectors with GPU threads
|
Threads are the lowest level of parallelization on a GPU
A kernel function replaces the code inside the loop to be parallelized
The CUDA <<<M,N>>> notation replaces the loop itself
|
Using blocks instead of threads
|
Use nvprof to profile CUDA functions
Blocks are the batches in which a GPU handles data
Blocks are handled by streaming multiprocessors (SMs)
Each block can have up to 1024 threads (on our current GPU cards)
|
Putting It All Together
|
|
GPU Hardware Architecture
|
Compute capability is a version number that represents the features supported by a GPU.
Shared memory is a small, but fast memory available for each multiprocessor that needs to be managed manually.
Since compute capability 2, each block can consist of up to 1024 threads, which can further be organized in up to three dimensions.
Active threads within a warp can only ever execute the same instructions at the same time. If some threads branch, they will be set aside for later.
|
Thread Addressing
|
|
Memory Performance
|
GPU memory has much higher bandwidth than CPU memory, which comes at a cost of higher latency.
Time to copy data between CPU- and GPU memory can be a large fraction of total runtime.
Access of consecutive memory addresses is much faster than strided access.
Using shared memory can help when strided access cannot be avoided, but needs to be managed manually.
|
Exercise: Julia Set
|
|
Where To Go Next?
|
|