ACENET Summer School - GPGPU: Glossary

Key Points

Introduction	A GPU (Graphics Processing Unit) is best at data-parallel, arithmetic-intense calculations CUDA is one of several programming interfaces for general-purpose computing on GPUs Alliance clusters have special GPU-equipped nodes, which must be requested from the scheduler
Hello World	Use nvcc to compile CUDA source files are suffixed .cu Use salloc to get an interactive session on a GPU node for testing
Adding Two Integers	The CPU (the ‘host’) and the GPU (the ‘device’) have separate memory banks This requires explicit copying of data to and from the GPU
Adding vectors with GPU threads	Threads are the lowest level of parallelization on a GPU A kernel function replaces the code inside the loop to be parallelized The CUDA `<<<M,N>>>` notation replaces the loop itself
Using blocks instead of threads	Use `nvprof` to profile CUDA functions Blocks are the batches in which a GPU handles data Blocks are handled by streaming multiprocessors (SMs) Each block can have up to 1024 threads (on our current GPU cards)
Putting It All Together	A typical kernel indexes data using both blocks and threads
GPU Hardware Architecture	Compute capability is a version number that represents the features supported by a GPU. Shared memory is a small, but fast memory available for each multiprocessor that needs to be managed manually. Since compute capability 2, each block can consist of up to 1024 threads, which can further be organized in up to three dimensions. Active threads within a warp can only ever execute the same instructions at the same time. If some threads branch, they will be set aside for later.
Thread Addressing	Using 2D or 3D GridDefs and BlockDefs can make it easier to address multi-dimensional data. CUDA has a special type `dim3` to define multi-dimensional grid and block definitions.
Memory Performance	GPU memory has much higher bandwidth than CPU memory, which comes at a cost of higher latency. Time to copy data between CPU- and GPU memory can be a large fraction of total runtime. Access of consecutive memory addresses is much faster than strided access. Using shared memory can help when strided access cannot be avoided, but needs to be managed manually.
Exercise: Julia Set
Where To Go Next?	Many software libraries implement highly-optimized solutions for common problems.

Glossary

FIXME