Programming GPUs with OpenMP

Overview

Teaching: 35 min
Exercises: 5 min

Questions

How to program GPU with OpenMP?

Objectives

Learn about tools available for programming GPU

Learn about programming GPU with openMP

Introduction to OpenMP Device Offload

GPUs can deliver very high performance per compute node.
Computing workloads are often accelerated by GPUs in HPC clusters.

The GPU Architecture

CPUs are designed to handle a wide-range of tasks quickly, but are limited in the number of tasks that can be running concurrently.
GPUs consist of thousands of smaller and more specialized processing units (cores).

Nvidia calls its parallel processing platform Compute Unified Device Architecture (CUDA).

Nvidia H100 GPUs
- up to 132 Streaming Multiprocessors
- 128 FP32 CUDA cores per SM
- total 16896 CUDA cores.
Intel® Xeon® Platinum 9282
- 56 cores
- 1 AVX-512 vector unit per core can process 16 FP32 numbers at a time.

GPU architecture enables to process massive amounts of data simultaneously at extremely fast speeds.

Tools for Programming GPUs

There are several tools available for programming GPU.

CUDA is NVIDIA-specific programming model and language. With CUDA, you can maximize your program performance. CUDA requires significant refactoring of existing C/C++ or Fortran code.
PyCuDA. Gives access to CUDA functionality from Python code.
OpenCL. Programs written in OpenCL can run on CPUs, GPUs, FPGAs, and other hardware accelerators. Very complex and hard to program, not widely used.
OpenMP and OpenACC. Directive-based, offer a quick path to accelerated computing with less programming effort.
Read more about OpenMP on GPUs

Differences between OpenMP and OpenACC

OpenMP approach is more prescriptive, programmers tell compiler explicitly where and how to parallelize the code
OpenACC programs are more descriptive of parallelism and rely more on the compiler to implement the parallelism efficiently.

The OpenMP GPU programming model

GPU programming with OpenMP is based on a host-device model.

The host is where the initial thread of a program begins to run.
The host can be connected to one or more GPU devices.
- Each Device is composed of one or more Compute Units
- Each Compute Unit is composed of one or more Processing Elements
Memory is divided into host memory and device memory

How many devices are available?

The omp_get_num_devices routine returns the number of target devices.

#include <stdio.h>
#include <omp.h>
 
int main() 
{
  int num_devices = omp_get_num_devices();
  printf("Number of devices: %d\n", num_devices);
}

device_count.c

Compile with gcc/9.3.0:

module load gcc/9.3.0 
gcc device_count.c -fopenmp -odevice_count

Run it on the login node. How many devices are available on the login node?

Then run it on a GPU node:

srun --gpus-per-node=1 ./device_count

Request two GPUs. How many devices are detected now?

How to offload to a GPU?

Identify routines (compute kernels) suitable for accelerators.
Explicitly express parallelism in the kernel.
Manage data transfer between CPU and Device.

Example: compute sum of two vectors.

Start from the serial code vadd_gpu_template.c.

/* Compute sum of vectors A and B */
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[])
{
    double start, end;
    int size = 1e8;
    float *A;
    float *B;
    float *C;
    double sum = 0.0f;

    int ncycles=atoi(argv[1]);
    A = (float *)malloc(size * sizeof(float *));
    B = (float *)malloc(size * sizeof(float *));
    C = (float *)malloc(size * sizeof(float *));

    /* Initialize vectors */
    for (int i = 0; i < size; i++)
    {
        A[i] = (float)rand() / RAND_MAX;
        B[i] = (float)rand() / RAND_MAX;
    }

    start = omp_get_wtime();
    for (int k = 0; k < ncycles; k++)
        for (int i = 0; i < size; i++)
        {
            C[i] = A[i] + B[i];
            sum += C[i];
        }
    end = omp_get_wtime();

    printf("Time: %f seconds\n", end - start);
    sum = sum / size;
    printf("Sum = %f\n ", sum / ncycles);
}

vadd_gpu_template.c

gcc vadd_gpu_template.c -fopenmp -O3 -ovadd_serial 

In order to get an accurate timing on GPUs, we must run summation multiple times. The number of repeats can be specified on the command line:

srun --gpus-per-node 1 --mem=5000 -c4 ./vadd_serial 10

Execution time of the serial code is around 1.6 sec.

How to Offload: the OpenMP Target Directive

OpenMP offload constructs were introduced in OpenMP 4.0.

OpenMP target instructs the compiler to generate a target task, which executes the enclosed code on the target device.

Make a copy of the code:

cp vadd_gpu_template.c vadd_gpu.c

Identify code block we want to offload to GPU and generate a target task:

...
    for (int k = 0; k < ncycles; k++)
    #pragma omp target 
        for (int i = 0; i < size; i++)
        {
            C[i] = A[i] + B[i];
            sum += C[i];
        }
...

Compile and run on a GPU node:

gcc vadd_gpu.c -fopenmp -O3 -ovadd_gpu 
srun --gpus-per-node 1 --mem=5000 -c4 ./vadd_gpu 10

libgomp: cuCtxSynchronize error: an illegal memory access was encountered

The compiler created target task and offloaded it to a GPU, but the data is on the host!

Moving Data Between Hosts and Devices: the OpenMP Map Directive.

map(map-type: list[0:N])

map-type may be to, from, tofrom, or alloc. The clause defines how the variables in list are moved between the host and the device.

...
    for (int k = 0; k < ncycles; k++)
    #pragma omp target map(to: A[0:size],B[0:size]) map(from: C[0:size])
        for (int i = 0; i < size; i++)
        {
            C[i] = A[i] + B[i];
            sum += C[i];
        }
...        

Compile and run on a GPU node:

gcc vadd_gpu.c -fopenmp -O3 -ovadd_gpu 
srun --gpus-per-node 1 --mem=5000 -c4 ./vadd_gpu 1

Execution time of the offloaded code is slower, around 8 sec. It is expected because:

By using the omp target directive, a target task is created, offloaded to the GPU, but is not parallelized. It executes on a single thread.
The processors on GPUs are slower than the CPUs.
It takes some time for data to be transferred from the host to the device.

The OpenMP Teams directive.

teams construct creates a league of teams.
the master thread of each team executes the code

Using the teams construct is the first step to utilizing multiple threads on a device. A teams construct creates a league of teams. Each team consists of some number of threads, when the directive teams is encountered the master thread of each team executes the code in the teams region.

...
    for (int k = 0; k < ncycles; k++) 
    #pragma omp target teams map(to: A[0:size],B[0:size]) map(from: C[0:size])
        for (int i = 0; i < size; i++)
        {
            C[i] = A[i] + B[i];
            sum += C[i];
        }
...

Compile and run on a GPU node:

gcc vadd_gpu.c -fopenmp -O3 -ovadd_gpu 
srun --gpus-per-node 1 --mem=5000 -c4 ./vadd_gpu 1

Execution time is the same!

Now we have multiple threads executing on the device, however they are all performing the same work.

It is also important to note that only the master thread for each team is active.

Review

Right now we have:

Multiple thread teams working on a device
All thread teams are performing the same work
Only the master thread of each team is executing

The OpenMP Distribute directive.

The distribute construct can be used to spread iterations of a loop among teams.

...
    for (int k = 0; k < ncycles; k++) 
    #pragma omp target teams distribute \
    map(to: A[0:size],B[0:size]) map(from: C[0:size])
        for (int i = 0; i < size; i++)
        {
            C[i] = A[i] + B[i];
            sum += C[i];
        }
...

Compile and run on a GPU node:

gcc vadd_gpu.c -fopenmp -O3 -ovadd_gpu 
srun --gpus-per-node 1 --mem=5000 -c4 ./vadd_gpu 1

Execution time is faster, about 1 sec, but how it compares with the CPU version?

gcc vadd_gpu_template.c -O3 -fopenmp -ovadd_serial
srun  --mem=5000 -c4 ./vadd_serial 1

This is disappointing, serial version running on a single CPU is 5 times faster!

The master thread is the only one we have used so far out of all the threads available in a team!

At this point, we have teams performing independent iterations of this loop, however each team is only executing serially. In order to work in parallel within a team we can use threads as we normally would with the basic OpenMP usage. Composite constructs can help.

The distribute parallel for construct specifies a loop that can be executed in parallel by multiple threads that are members of multiple teams.

...
    for (int k = 0; k < ncycles; k++) 
    #pragma omp target teams distribute \
    parallel for \
    map(to: A[0:size],B[0:size]) map(from: C[0:size])
        for (int i = 0; i < size; i++)
        {
            C[i] = A[i] + B[i];
            sum += C[i];
        }
...

Compile and run on a GPU node:

gcc vadd_gpu.c -fopenmp -O3 -ovadd_gpu 
srun --gpus-per-node 1 --mem=5000 -c4 ./vadd_gpu 1

Now, we have teams splitting the work into large chunks on the device, while threads within the team will further split the work into smaller chunks and execute in parallel.

We have two problems:

It did not take less time to execute.
The sum is not calculated correctly.

Maybe the work is not large enough for the GPU? To increase the amount of calculations, we could repeat the loop multiple times. Try running 20 cycles of addition. Did the execution time increase 20 times? It took only 7 seconds instead of the expected 20 seconds. We are on the right track, but what else can be improved in our code to make it more efficient?

The OpenMP Target Data directive.

Each time we iterate, we transfer data from the host to the device, and we pay a high cost for data transfer. Arrays A and B can only be transferred once at the beginning of calculations, and the results can only be returned to the host once the calculations are complete. The target data directive can help with this.

The target data directive maps variables to a device data environment for the extent of the region.

...
#pragma omp target data map(to: A[0:size], B[0:size]) map(from: C[0:size])
    for (int k = 0; k < ncycles; k++)
        #pragma omp target teams distribute \
        parallel for \
        map(tofrom: sum)
        for (int i = 0; i < size; i++)
        {
            C[i] = A[i] + B[i];
            sum += C[i];
        }
...

Compile and run on a GPU node for 20 cycles:

gcc vadd_gpu.c -fopenmp -O3 -ovadd_gpu 
srun --gpus-per-node 1 --mem=5000 -c4 ./vadd_gpu 20

The time decreased from 7 to 1 sec! Try running for 100 cycles. Did the time increase proportionally to the work?

Let’s take care of issue with sum.

We get the wrong result with sum, it is zero, so it looks like it is not transferred back from the device.

...
#pragma omp target data map(to: A[0:size], B[0:size]) map(from: C[0:size])
    for (int k = 0; k < ncycles; k++)
        #pragma omp target teams distribute \
        parallel for \
        map(tofrom: sum) reduction(+:sum)
        for (int i = 0; i < size; i++)
        {
            C[i] = A[i] + B[i];
            sum += C[i];
        }
...

We need to add a reduction clause and map sum tofrom the device since we parallelized the for loop.

Using Nvidia HPC compiler

Nvidia HPC compiler generates faster binary, but it does not run on the training cluster.

module load nvhpc/22.7
nvc vadd_gpu.c -O3 -fopenmp -mp=gpu -Minfo=mp 
nvc vadd_gpu.c -O3 -fopenmp -mp=multicore -Minfo=mp 

Getting info about available GPUs

Using the nvidia-smi command

nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory -format=csv -l 1 >& gpu.log&

Using OpenMP functions

#include <stdio.h>
#include <omp.h>

int main()
{
    int nteams, nthreads;
    int num_devices = omp_get_num_devices();
    printf("Number of available devices %d\n", num_devices);

#pragma omp target teams defaultmap(tofrom : scalar)
    {
        if (omp_is_initial_device())
            printf("Running on host\n");
        else
        {
            nteams = omp_get_num_teams();
#pragma omp parallel
#pragma omp single
            if (!omp_get_team_num())
            {
                nthreads = omp_get_num_threads();
                printf("Running on device with %d teams x %d threads\n", nteams, nthreads);
            }
        }
    }
}

Compile with gcc/9.3.0:

module load gcc/9.3.0 
gcc get_device_info.c -fopenmp -ogpuinfo

srun --gpus-per-node=1 --mem=1000 ./gpuinfo 

With gcc/11 you will get 216 teams x 8 threads (Quadro RTX 6000)

Compile with nvidia compiler:

module load nvhpc/22.7
nvc get_device_info.c -mp=gpu -ogpuinfo
srun --gpus-per-node=1 --mem=1000 ./gpuinfo 

With nvhpc you’ll get 72 teams x 992 threads on the same GPU.

Parallelizing with OpenACC

cp vadd_gpu_template.c vadd_gpu_acc.c

The acc kernels Construct

Let’s add a single, simple OpenACC directive before the block containing for loop nest.

#pragma acc kernels

This pragma tells the compiler to generate parallel accelerator kernels (CUDA kernels in our case).

Compiling with Nvidia HPC SDK

Load the Nvidia HPC module:

module load StdEnv/2020 nvhpc/22

To compile the code with OpenACC we need to add the -acc flag

nvc vadd_gpu_acc.c -O3 -acc

Let’s run the program. It runs slow compared even to our serial CPU code.

Were the loops parallelized?

enable printing information about the parallelization ( add the option -Minfo=accel) and recompile

nvc vadd_gpu_acc.c -O3 -acc -Minfo=accel

     33, Generating implicit copyin(A[:100000000],B[:100000000]) [if not already present]
         Generating implicit copy(sum) [if not already present]
         Generating implicit copyout(C[:100000000]) [if not already present]
     35, Complex loop carried dependence of B->,A-> prevents parallelization
         Loop carried dependence due to exposed use of C prevents parallelization
         Complex loop carried dependence of C-> prevents parallelization
         Accelerator serial kernel generated
         Generating NVIDIA GPU code
         35, #pragma acc loop seq
         36, #pragma acc loop seq
     35, Complex loop carried dependence of C-> prevents parallelization
     36, Complex loop carried dependence of B->,A->,C-> prevents parallelization

The __restrict Type Qualifier

Our data arrays (A, B, C) are dynamically allocated and are accessed via pointers. Pointers may create data dependency because:

More than one pointer can access the same chunk of memory in the C language.
Arithmetic operations on pointers are allowed.

Using the __restrict qualifier to a pointer ensures that it is safe and enables better optimization by the compiler.

float * __restrict A;
float * __restrict B;
float * __restrict C;

After this modification The compiler parallelizes the loop and even automatically determines what data should be copied to the device and what reduction operations are needed.

It is important to examine optimixzation reports
It is best to add explicit data transfer direcives and reduction clauses to ensure your program will work correctly when compiled with different compilers.

Key Points

OpenMP offers a quick path to accelerated computing with less programming effort

previous episode

ACENET Summer School - Directive-Based Parallel Programming with OpenMP and OpenACC

next episode

Programming GPUs with OpenMP

Overview

Introduction to OpenMP Device Offload

The GPU Architecture

Tools for Programming GPUs

Differences between OpenMP and OpenACC

The OpenMP GPU programming model

How many devices are available?

How to offload to a GPU?

Example: compute sum of two vectors.

How to Offload: the OpenMP Target Directive

Moving Data Between Hosts and Devices: the OpenMP Map Directive.

map(map-type: list[0:N])

The OpenMP Teams directive.

Review

The OpenMP Distribute directive.

The OpenMP Target Data directive.

Using Nvidia HPC compiler

Getting info about available GPUs

Using the nvidia-smi command

Using OpenMP functions

Parallelizing with OpenACC

The acc kernels Construct

Compiling with Nvidia HPC SDK

The __restrict Type Qualifier

Key Points

previous episode

next episode