ACENET Summer School - GPGPU: Glossary

Key Points

Introduction
  • A GPU (Graphics Processing Unit) is best at data-parallel, arithmetic-intense calculations

  • CUDA is one of several programming interfaces for general-purpose computing on GPUs

  • Alliance clusters have special GPU-equipped nodes, which must be requested from the scheduler

Hello World
  • Use nvcc to compile

  • CUDA source files are suffixed .cu

  • Use salloc to get an interactive session on a GPU node for testing

Adding Two Integers
  • The CPU (the ‘host’) and the GPU (the ‘device’) have separate memory banks

  • This requires explicit copying of data to and from the GPU

Adding vectors with GPU threads
  • Threads are the lowest level of parallelization on a GPU

  • A kernel function replaces the code inside the loop to be parallelized

  • The CUDA <<<M,N>>> notation replaces the loop itself

Using blocks instead of threads
  • Use nvprof to profile CUDA functions

  • Blocks are the batches in which a GPU handles data

  • Blocks are handled by streaming multiprocessors (SMs)

  • Each block can have up to 1024 threads (on our current GPU cards)

Putting It All Together
  • A typical kernel indexes data using both blocks and threads

GPU Hardware Architecture
  • Compute capability is a version number that represents the features supported by a GPU.

  • Shared memory is a small, but fast memory available for each multiprocessor that needs to be managed manually.

  • Since compute capability 2, each block can consist of up to 1024 threads, which can further be organized in up to three dimensions.

  • Active threads within a warp can only ever execute the same instructions at the same time. If some threads branch, they will be set aside for later.

Thread Addressing
  • Using 2D or 3D GridDefs and BlockDefs can make it easier to address multi-dimensional data.

  • CUDA has a special type dim3 to define multi-dimensional grid and block definitions.

Memory Performance
  • GPU memory has much higher bandwidth than CPU memory, which comes at a cost of higher latency.

  • Time to copy data between CPU- and GPU memory can be a large fraction of total runtime.

  • Access of consecutive memory addresses is much faster than strided access.

  • Using shared memory can help when strided access cannot be avoided, but needs to be managed manually.

Exercise: Julia Set
Where To Go Next?
  • Many software libraries implement highly-optimized solutions for common problems.

Glossary

FIXME