Introduction
Overview
Teaching: 30 min
Exercises: 0 minQuestions
What is shared-memory programming?
What is OpenMP?
Objectives
Understand the shared-memory programming model
Know that OpenMP is a standard
Parallel programs come in two broad flavors: shared-memory and message-passing. In this workshop, we will be looking at shared-memory programming, with a focus on OpenMP programming.
What is shared-memory programming? In any parallel program, the general idea is to have multiple threads of execution so that you can break up your problem and have each thread handle one part. These multiple threads need to be able to communicate with each other as your program runs. In a shared-memory program, this communication happens through the use of global variables stored in the global memory of the local machine. This means that communication between the various threads is extremely fast, as it happens at the speed of RAM access. But your program will be limited to a single physical machine, since all threads need to be able to see the same RAM.
OpenMP is one way of writing shared-memory parallel programs. OpenMP is actually a specification, which has been implemented by many interested groups.
The standard describes extensions to a C/C++ or FORTRAN compiler. This means that you need use a compiler that supports OpenMP. Unfortunately, there have been several versions of the OpenMP specification, and several sections of it are optional, so it is up to the programmer to investigate the compiler they want to use and see if it supports the parts of OpenMP that they wish to use. Luckily, the vast majority of OpenMP behaves the way you expect it to with most modern compilers. When possible, we will try and highlight any odd behaviors.
Since OpenMP is meant to be used with either C/C++ or FORTRAN, you will need to know how to work with at least one of these languages. This workshop will use C as the language for the examples. As a reminder, a simple hello world program in C would look like the following.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
printf("Hello World\n");
}
In order to compile this code, you would need to use the following command:
gcc -o hello_world hello_world.c
This gives you an executable file that will print out the text “Hello World”. You can do this with the command:
./hello_world
Hello World
GCC on Compute Canada
The default environment on the general purpose clusters (Beluga, Cedar, Graham) includes a gcc compiler. Not all systems are set up this way, though. On Niagara, for instance, the default environment is very minimal and you must load a module explicitly to access any compiler:
$ module load gcc $ gcc --version gcc (GCC) 7.3.0
Key Points
OpenMP programs are limited to a single physical machine
You use OpenMP with C, C++, or Fortran
Hello World
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How do you compile and run an OpenMP program?
What are OpenMP pragmas?
Objectives
Compiling and running an OpenMP program
Since OpenMP is an extension to the compiler, you need to be able to tell the compiler when and where to add the code necessary to create and use threads for the parallel sections. This is handled through special statements called pragmas. To a compiler that doesn’t understand OpenMP, pragmas look like comments. The basic forms are:
C/C++
#pragma omp ...
FORTRAN
!$OMP ...
Hello World
How do we add in parallelism to the basic hello world program? The very first pragma that we will look at is the parallel
pragma.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv) {
#pragma omp parallel
printf("Hello World\n");
}
To compile it, you’ll need to add an extra flag to tell the compiler to treat the source code as an OpenMP program.
gcc -fopenmp -o hello hello.c
If you prefer Intel compilers to GCC, use:
icc -qopenmp -o hello hello.c
When you run this program, you should see the output “Hello World” multiple times. But how many?
The standard says this is implementation dependent. But the usual default is, OpenMP will look at the machine that it is running on and see how many cores there are. It will then launch a thread for each core. You can control the number of threads, however, with environment variables. If you want only 3 threads, do the following:
export OMP_NUM_THREADS=3
./hello_world
Using multiple cores
Try running the hello world program with different numbers of threads. Can you use more threads than the cores on the machine?
OpenMP with Slurm
When you wish to submit an OpenMP job to the job scheduler Slurm, you can use the following boilerplate.
#SBATCH --account=acenet-wa #SBATCH --reservation=acenet-wr_cpu #SBATCH --time=0:15:0 #SBATCH --cpus-per-task=3 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./hello_world
You could also ask for an interactive session with multiple cores like so:
salloc --account=acenet-wa --reservation=acenet-wr_cpu --cpus-per-task=4 --mem=12G --time=3:0:0
Identifying threads
How can you tell which thread is doing what? The OpenMP specification includes a number of functions that are made available through the included header file “omp.h”. One of them is the function “omp_get_thread_num()”, to get an ID of the thread running the code.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv) {
int id;
#pragma omp parallel
{
id = omp_get_thread_num();
printf("Hello World from thread %d\n", id);
}
}
Here, you will get each thread tagging their output with their unique ID, a number between 0 and NUM_THREADS-1.
Pragmas and code blocks
An OpenMP pragma applies to the following code block in C or C++. Code blocks are either a single line, or a series of lines wrapped by curly brackets. Because Fortran doesn’t have an analogous construction, many OpenMP pragmas in Fortran are paired with an “end” pragma, such as
!$omp parallel end
.
Thread ordering
What order do the threads write out their messages in? Try running the program a few times. What’s going on?
Solution
Conditional compilation
We said earlier that you should be able to use the same code for both OpenMP and serial work. Try compiling the code without the -fopenmp flag. What happens? Can you figure out how to fix it?
Hint: The compiler defines a preprocessor variable _OPENMP
Solution
Execution model
It’s a good idea to have a mental model of what happens when OpenMP does something in parallel. When the program begins there is only one thread active, called the master thread. When the code reaches a point where parallel operations are required, the master thread creates (or activates) additional threads. We call this “forking”. The master and the created threads work concurrently for a while. Then at the end of the parallel code the created threads die or are suspended, and the flow of control returns to the master thread. We call this “joining”.
OpenMP also divides the memory into two types: Global (or shared) memory, and thread-local memory. Every thread can read and write the global memory, but each thread also gets a little slice of memory that can only be read or written by that thread.
Key Points
Pragmas are directives to the compiler to parallelize something
Thread number is typically controlled with an environment variable, OMP_NUM_THREADS
Order of execution of parallel elements is not guaranteed.
If the compiler doesn’t recognize OpenMP pragmas, it will compile a single-threaded program. But you may need to escape OpenMP function calls.
Linear algebra
Overview
Teaching: 20 min
Exercises: 30 minQuestions
How do I parallelize a loop?
Objectives
Use the PARALLEL FOR pragma (or PARALLEL DO)
Use the PRIVATE clause
Be able to identify data dependencies
One of the classic applications of programming is linear algebra, in all of its beauty and complexity. We will use a few simple problems to see how to execute loops in parallel using OpenMP.
Multiply an array by a constant
The simplest problem is applying some function to an array of numbers. An example is multiplying each value by some constant number. In serial you would use a for loop (in C; a DO loop in Fortran) to do this multiplication for each element, like this:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int argc, char **argv) {
struct timespec ts_start, ts_end;
int size = 1000000;
int multiplier = 2;
int a[size];
int c[size];
int i;
float time_total;
clock_gettime(CLOCK_MONOTONIC, &ts_start);
for (i = 0; i<size; i++) {
c[i] = multiplier * a[i];
}
clock_gettime(CLOCK_MONOTONIC, &ts_end);
time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
printf("Total time is %f ms\n", time_total/1000000);
}
We added some calls to ‘clock_gettime()’ from the ‘time.h’ header file to get the start and end times of the heavy work being done by the for loop. In this case, we get a count of how many seconds and how many nanoseconds elapsed, given in two parts of the time structure. We did some math to get the elapsed time in milliseconds.
Time and Size
What happens to the run time of your program through multiple runs? What happens if you change the size and recompile and rerun?
How complicated is it to turn this program into a parallel program?
Not complicated at all: We add two lines, #include <omp.h>
near the top,
and #pragma omp parallel for
just before the for (...)
statement:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
int main(int argc, char **argv) {
struct timespec ts_start, ts_end;
int size = 1000000;
int multiplier = 2;
int a[size];
int c[size];
int i;
float time_total;
clock_gettime(CLOCK_MONOTONIC, &ts_start);
#pragma omp parallel for
for (i = 0; i<size; i++) {
c[i] = multiplier * a[i];
}
clock_gettime(CLOCK_MONOTONIC, &ts_end);
time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
printf("Time in loop is %f ms. %d iterations.\n", time_total/1000000, size);
}
Compiling and running
Don’t forget to include the
-fopenmp
flag when compiling, and remember to setOMP_NUM_THREADS
when running.
Using more threads
How many threads did you use? What happens to the runtime when you change the number of threads?
In this case, the number of iterations around the for loop gets divided across the number of threads available. In order to do this, however, OpenMP needs to know how many iterations there are in the for loop. It also can’t change part way through the loops. To ensure this, there are some rules:
- You must not change the value of ‘size’ within one of the iterations.
- You must not use a call to ‘break()’ or ‘exit()’ within the for loop, either. These functions pop you out of the for loop before it is done.
Summing the values in a matrix
Now let’s try adding up the elements of a matrix. Moving up to two dimensions adds a new layer of looping. The basic code looks like the following.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
int main(int argc, char **argv) {
struct timespec ts_start, ts_end;
int size = 1000;
int a[size][size];
int i, j;
// Set the matrix values to 1
for (i=0; i<size; i++) {
for (j=0; j<size; j++) {
a[i][j] = 1;
}
}
int c[size];
// Zero the accumulator
for (i=0; i<size; i++) {
c[i] = 0;
}
int total = 0;
float time_total;
clock_gettime(CLOCK_MONOTONIC, &ts_start);
#pragma omp parallel for
for (i = 0; i<size; i++) {
for (j=0; j<size; j++) {
c[i] = c[i] + a[i][j];
}
}
for (i=0; i<size; i++) {
total = total + c[i];
}
clock_gettime(CLOCK_MONOTONIC, &ts_end);
time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
printf("Total is %d, time is %f ms\n", total, time_total/1000000);
}
Is the result correct?
What should be the result of this code? Is that what it does? If not, what might be wrong? Does it do the same for different values of OMP_NUM_THREADS?
Solution
Data Dependencies
Definition
If two statements read or write the same memory location, and at least one of the statements writes that memory location, then there is a ‘'’data dependency’’’ on that memory location between the two statements.
In turning a serial program into a parallel program, it’s vital to maintain the correctness of the serial program. (Assuming the serial program is correct, of course, but that’s a separate question.) ‘'’Data dependency’’’ is an important concept we use to guarantee that we’re transforming a serial program into an equivalent parallel program— equivalent in terms of its output, that is.
If there is a data dependency between two statements, then the order in which those statements are executed may affect the output of the program, and hence its correctness.
Consider this loop that computes a cumulative sum:
for ( i=2; i<N; i=i+1 ) {
a[i] = a[i] + a[i-1];
}
The iteration with i==2
for example reads locations a[1]
and a[2]
and
writes location a[2]
. The iteration with i==3
then reads locations
a[2]
and a[3]
. Since a[2]
was read by both and written by one, there is
a data dependency on a[2]
between the assignment statement in successive loop
iterations. So, the “statements” in the definition don’t need to be separate
lines of code.
Now think about what it would mean to run this loop in parallel: Different iterations would be carried out by different threads of execution, possibly at the same time. If any two iterations didn’t happen in the order dictated by the serial code, the results would be wrong.
There are three types of data dependencies:
- Flow dependencies, like the last example, when one statement uses the results of another
- Anti-dependencies, when one statement should write to a location only ‘‘after’’ another has read what’s there
- Output dependencies, when two statements write to a location, so the result will depend on which one wrote to it last There’s a wikipedia page on data depencies: https://en.wikipedia.org/wiki/Data_dependency
Is There a Dependency?
Which of the following loops have data dependencies?
/* loop #1 */ for ( i=2; i<N; i=i+2 ) { a[i] = a[i] + a[i-1]; } /* loop #2 */ for ( i=1; i<N/2; i=i+1 ) { a[i] = a[i] + a[i+N/2]; } /* loop #3 */ for ( i=1; i<N/2+1; i=i+1 ) { a[i] = a[i] + a[i+N/2]; } /* loop #4 */ for ( i=1; i<N; i=i+1 ) { a[idx[i]] = a[idx[i]] + b[idx[i]]; }
Solution
Thread-safe functions
Another important concept is that of a ‘'’thread-safe function’’’. A thread-safe function is one which can be called simultaneously by multiple threads of execution. You need to be aware when writing multi-threaded programs whether functions you are using are thread-safe or not. A classic example of this is when generating pseudo-random numbers. Most random number generators maintain state between calls to them. If they aren’t written to be thread-safe, then that internal state can get mixed up by multiple threads calling them at the same time. For more, see https://en.wikipedia.org/wiki/Thread_safety.
Key Points
The PARALLEL FOR (or PARALLEL DO) pragma makes a loop execute in parallel
A single variable accessed by different threads can cause wrong results
The PRIVATE clause makes a copy of a variable for each thread
Numeric Integration - Calculating Areas
Overview
Teaching: 15 min
Exercises: 5 minQuestions
How can we calculate integrals?
Objectives
Learn about critical sections
In this section, we will use the problem of numeric integration, i.e. calculating areas under curves, to look at how to control access to global variables. As our example, let’s say we wanted to integrate the sine function from 0 to Pi. This is the same as the area under the first half of a sine curve. The single-threaded version is below.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char **argv) {
int steps = 1000;
double delta = M_PI/steps;
double total = 0.0;
int i;
for (i=0; i<steps; i++) {
total = total + sin(delta*i) * delta;
}
printf("The integral of sine from 0 to Pi is %.12f\n", total);
}
Compiling with math
In order to include the math functions, you need to link in the math library. In GCC, you would use the following:
gcc -o pi-serial pi.c -lm ./pi
The answer in this case should be 2. It will be off by a small amount because of the limits of computer representations of numbers.
Step size
How would you change the step size. What happens if you do?
Solution
To see what happens to the time this program takes, we’ll use a new tool. Since
we just want to see the total time, we can use the program time
.
Timing
You can use the time utility to get the amount of time it takes for a program to run.
$ time ./pi-serial Using 1000 steps The integral of sine from 0 to Pi is 1.999998355066 real 0m0.005s user 0m0.000s sys 0m0.002s
The
real
output is the useful one; this example took 0.005 seconds to run. Theuser
andsys
lines describe how much time was spent in “user” code and how much in “system” code, a distinction that doesn’t interest us today.
Parallelizing numerical integration
How would you parallelize this code to get it to run faster?
Obviously, we could add #pragma parallel for
. But do we make total
private, or not?
The data dependency on total
leads to what we call a race condition. Since we are
updating a global variable, there is a race between the various threads as to
who can read and then write the value of total
. Multiple threads could read
the current value, before a working thread can write the result of its addition. So these
reading threads essentially miss out on some additions to the total. This can be
handled by adding a critical section. A critical section only allows one thread
at a time to run some code block.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
int main(int argc, char **argv) {
int steps = 1000;
float delta = M_PI/steps;
float total = 0.0;
int i;
#pragma omp parallel for
for (i=0; i<steps; i++) {
#pragma omp critical
total = total + sin(delta*i) * delta;
}
printf("The integral of sine from 0 to Pi is %f\n", total);
}
The critical
pragma is a very general construct that lets you ensure a code
line is executed exclusively. However, making a sum is a very common operation
in computing so OpenMP provides a specific mechanism to handle this case:
Reduction variables. We’ll look at those in the next section.
Key Points
You can use a critical section to control access to a global variable
Searching through data
Overview
Teaching: 20 min
Exercises: 15 minQuestions
How to search in parallel
Objectives
Use general parallel sections
Have a single thread execute code
Use a reduction operator
So far, we have looked at parallelizing loops. OpenMP also allows you to use general parallel sections where code blocks are executed by all of the available threads. In these cases, it is up to you as a programmer to manage what work gets done by each thread. A basic example would look like the following.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv) {
int id, size;
#pragma omp parallel private(id,size)
{
size = omp_get_num_threads();
id = omp_get_thread_num();
printf("This is thread %d of %d\n", id, size);
}
}
Private variables
What happens if you forget the private keyword?
Using this as a starting point, we could use this code to have each available thread do something interesting. For example, we could write the text out to a series of individual files.
Single threaded function
There are times when you may need to drop out of a parallel section in order to have a single one of the threads executing some code. There is a keyword available, ‘single’, that allows the first thread to see it to execute some single code chunk.
Which thread runs first?
Modify the following code to find out which thread gets to run first in the parallel section. You should be able to do it by adding only one line. Here’s a (rather dense) reference guide in which you can look for ideas: Directives and Constructs for C/C++.
#include <stdio.h> #include <stdlib.h> #include <omp.h> int main(int argc, char **argv) { int id, size; #pragma omp parallel private(id,size) { size = omp_get_num_threads(); id = omp_get_thread_num(); printf("This thread %d of %d is first\n", id, size); } }
Solution
Searching
Let’s say that we need to search through an array to find the largest value. How could we do this type of search in parallel? Let’s begin with the serial version.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
int size = 10000;
float rand_nums[size];
int i;
float curr_max;
for (i=0; i<size; i++) {
rand_nums[i] = rand();
}
curr_max = 0.0;
for (i=0; i<size; i++) {
if (curr_max < rand_nums[i]) {
curr_max = rand_nums[i];
}
}
printf("Max value is %f\n", curr_max);
}
The first stab would be to make the for loop a parallel for loop. You would want to make sure that each thread had a private copy of the ‘curr_max’ variable, since it will be written to. But, how do you find out which thread has the largest value?
Reduction Operators
Key Points
Reduction operators handle the common case of summation, and analogous operations
OpenMP can manage general parallel sections
You can use ‘pragma omp single’ to have a single thread execute something
Calculating Fibonacci numbers
Overview
Teaching: 10 min
Exercises: 5 minQuestions
How to recurse in parallel
Objectives
Use general parallel tasks
The last technique we will look at is the task structure in OpenMP. In this
case, you can use the task
keyword within a parallel section to define sets of
tasks that get queued up to run within the threads of the parallel section. They
could run in any order. If you need to have results from the tasks before you
can continue, you can use the taskwait
keyword to tell OpenMP to pause your
program until the tasks are done. This would look like the following:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int fib(int n) {
int i, j;
if (n<2)
return n;
else {
#pragma omp task shared(i)
i=fib(n-1);
#pragma omp task shared(j)
j=fib(n-2);
#pragma omp taskwait
return i+j;
}
}
int main(int argc, char **argv){
int n, result;
char *a = argv[1];
n = atoi(a);
#pragma omp parallel
{
#pragma omp single
result = fib(n);
}
printf("Result is %d\n", result);
}
In the main function, you need to start the parallel section so that all of the threads are launched. Since we only want the parent call to fib done once, we need to use the single keyword.
Task gotchas
There are a few gotchas to be aware of. While the intention is that tasks will run in parallel, there is nothing in the specification that guarantees this behavior. You may need to check how your particular environment works.
Tasks?
How do tasks work with different number of threads?
Key Points
OpenMP can manage general parallel tasks
Drawing the Mandelbrot set
Overview
Teaching: 20 min
Exercises: 100 minQuestions
How do we handle irregular tasks?
Objectives
Learn about the schedule() clause
Here’s today’s only example in Fortran. Sorry.
The Mandelbrot set was a hot subject of computer art in the 1980s. The algorithm is quite simple: For each point on the screen, do an iterative calculation and decide whether the calculation diverges or not. Color that spot on the screen according to how many iterations it took to diverge — or black if it didn’t diverge in 1000 iterations.
For simplicity this implementation just prints out a number instead of coloring a pixel.
program mandel
! Mandelbrot set generator, after
! Chandra et al, "Parallel Programming in OpenMP", and
! http://en.wikipedia.org/wiki/Mandelbrot_set#For_programmers
! Ross Dickson, ACEnet, 2010.06.24
integer,parameter:: maxiter=999, m=30, n=20
double precision,parameter:: xmin=-2.d0,xmax=1.d0
double precision,parameter:: ymin=-1.d0,ymax=1.d0
integer depth(n,m)
integer iter, j, k
integer starttime, endtime, tickspersec
double precision x, y
call system_clock(starttime)
!$omp parallel do private(..?...)
do j = 1, m
do k = 1, n
x = xmin + j*(xmax-xmin)/m
y = ymin + k*(ymax-ymin)/n
depth(k, j) = mandel_val(x, y, maxiter)
end do
end do
!$omp end parallel do
call system_clock(endtime,tickspersec)
do k = n, 1, -1
write(*,'(30I4)') depth(k,:)
end do
write(*,*) 'working time: ',real(endtime-starttime)/real(tickspersec),' sec'
end program
integer function mandel_val(x0, y0, maxiter)
double precision x0, y0
integer maxiter
double precision x, y, xtemp
integer iter
x = 0.d0
y = 0.d0
iter = 0
do while (( x*x + y*y <= 4.d0 ) .and. ( iter < maxiter ))
xtemp = x*x - y*y + x0
y = 2*x*y + y0
x = xtemp
iter = iter + 1
end do
mandel_val = iter
end function
First, compile and run the program without OpenMP like so:
gfortran mandel.f90 -o mandel-serial
./mandel-serial
Appreciate the retro ASCII art, and note how long it took to run. A millisecond is not enough to get good performance measurements on.
Next, increase the dimensions m,n
to 3000,2000
and recompile. You don’t
want to dump 6,000,000 numbers to your screen, so send the output to a file:
./mandel-serial >mandel-serial.out
.
Check the run time: tail -1 mandel-serial.out
Now comes the parallelization.
Exercise 1: Parallelize the code
There is a
parallel do
directive in the code, exactly analogous to theparallel for
in C. But theprivate()
clause is not complete. Decide what variable or variables should be made private, and then compile and test the code like so:gfortran -fopenmp mandel.f90 -o mandel-omp OMP_NUM_THREADS=2 ./mandel-omp >mandel-2.out
Try a few different values of
OMP_NUM_THREADS
. How does the performance scale?
The schedule() clause
OpenMP loop directives (parallel for, parallel do
) can take several other
clauses besides the private()
clause we’ve already seen. One is schedule()
,
which allows us to specify how loop iterations are divided up among the
threads.
The default is static scheduling, in which all iterations are allocated to threads before they execute any loop iterations. In dynamic scheduling, only some of the iterations are allocated to threads at the beginning of the loop’s execution. Threads that complete their iterations are then eligible to get additional work. The allocation process continues until all the iterations have been distributed to threads.
There’s a tradeoff between overhead (i.e., how much time is spent setting up the schedule) and load balancing (i.e., how much time is spent waiting for the most heavily-worked thread to catch up). Static scheduling has low overhead but may be badly balanced; dynamic scheduling has higher overhead. Both can also take a chunk size; larger chunks mean less overhead and greater memory locality, smaller chunks may mean finer load balancing. You can omit the chunk size, it defaults to 1.
Bad load balancing might be what’s causing this Mandelbrot code not to
parallelize very well. Let’s add a schedule()
clause and see what happens.
!$omp parallel do private(...?...) schedule(dynamic,?)
Exercise 2: Play with the schedule() clause
Try different
schedule()
clauses and tabulate the run times with different thread numbers. What seems to work best for this problem?Does it change much if you grow the problem size? That is, if you make
m,n
bigger?There’s a third option,
guided
, which starts with large chunks and gradually decreases the chunk size as it works through the iterations. Try it out too, if you like. Withschedule(guided,<chunk>)
, the chunk parameter is the smallest chunk size it will try.
Key Points
Different loop scheduling may compensate for unbalanced loop iterations