ACENET Summer School - OpenMP

Introduction

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What is shared-memory programming?

  • What is OpenMP?

Objectives
  • Understand the shared-memory programming model

  • Know that OpenMP is a standard

Parallel programs come in two broad flavors: shared-memory and message-passing. In this workshop, we will be looking at shared-memory programming, with a focus on OpenMP programming.

What is shared-memory programming? In any parallel program, the general idea is to have multiple threads of execution so that you can break up your problem and have each thread handle one part. These multiple threads need to be able to communicate with each other as your program runs. In a shared-memory program, this communication happens through the use of global variables stored in the global memory of the local machine. This means that communication between the various threads is extremely fast, as it happens at the speed of RAM access. But your program will be limited to a single physical machine, since all threads need to be able to see the same RAM.

OpenMP is one way of writing shared-memory parallel programs. OpenMP is actually a specification, which has been implemented by many interested groups.

OpenMP specifications

The standard describes extensions to a C/C++ or FORTRAN compiler. This means that you need use a compiler that supports OpenMP. Unfortunately, there have been several versions of the OpenMP specification, and several sections of it are optional, so it is up to the programmer to investigate the compiler they want to use and see if it supports the parts of OpenMP that they wish to use. Luckily, the vast majority of OpenMP behaves the way you expect it to with most modern compilers. When possible, we will try and highlight any odd behaviors.

Since OpenMP is meant to be used with either C/C++ or FORTRAN, you will need to know how to work with at least one of these languages. This workshop will use C as the language for the examples. As a reminder, a simple hello world program in C would look like the following.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
   printf("Hello World\n");
}

In order to compile this code, you would need to use the following command:

gcc -o hello_world hello_world.c

This gives you an executable file that will print out the text “Hello World”. You can do this with the command:

./hello_world
Hello World

GCC on Compute Canada

The default environment on the general purpose clusters (Beluga, Cedar, Graham) includes a gcc compiler. Not all systems are set up this way, though. On Niagara, for instance, the default environment is very minimal and you must load a module explicitly to access any compiler:

$ module load gcc
$ gcc --version
gcc (GCC) 7.3.0

Key Points

  • OpenMP programs are limited to a single physical machine

  • You use OpenMP with C, C++, or Fortran


Hello World

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • How do you compile and run an OpenMP program?

  • What are OpenMP pragmas?

Objectives
  • Compiling and running an OpenMP program

Since OpenMP is an extension to the compiler, you need to be able to tell the compiler when and where to add the code necessary to create and use threads for the parallel sections. This is handled through special statements called pragmas. To a compiler that doesn’t understand OpenMP, pragmas look like comments. The basic forms are:

C/C++

#pragma omp ...

FORTRAN

!$OMP ...

Hello World

How do we add in parallelism to the basic hello world program? The very first pragma that we will look at is the parallel pragma.

#include <stdio.h>
#include <stdlib.h>

#include <omp.h>

int main(int argc, char **argv) {
   #pragma omp parallel
   printf("Hello World\n");
}

To compile it, you’ll need to add an extra flag to tell the compiler to treat the source code as an OpenMP program.

gcc -fopenmp -o hello hello.c

If you prefer Intel compilers to GCC, use:

icc -qopenmp -o hello hello.c

When you run this program, you should see the output “Hello World” multiple times. But how many?

The standard says this is implementation dependent. But the usual default is, OpenMP will look at the machine that it is running on and see how many cores there are. It will then launch a thread for each core. You can control the number of threads, however, with environment variables. If you want only 3 threads, do the following:

export OMP_NUM_THREADS=3
./hello_world

Using multiple cores

Try running the hello world program with different numbers of threads. Can you use more threads than the cores on the machine?

OpenMP with Slurm

When you wish to submit an OpenMP job to the job scheduler Slurm, you can use the following boilerplate.

#SBATCH --account=acenet-wa
#SBATCH --reservation=acenet-wr_cpu
#SBATCH --time=0:15:0
#SBATCH --cpus-per-task=3
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./hello_world

You could also ask for an interactive session with multiple cores like so:

salloc --account=acenet-wa --reservation=acenet-wr_cpu --cpus-per-task=4 --mem=12G --time=3:0:0

Identifying threads

How can you tell which thread is doing what? The OpenMP specification includes a number of functions that are made available through the included header file “omp.h”. One of them is the function “omp_get_thread_num()”, to get an ID of the thread running the code.

#include <stdio.h>
#include <stdlib.h>

#include <omp.h>

int main(int argc, char **argv) {
   int id;
   #pragma omp parallel
   {
   id = omp_get_thread_num();
   printf("Hello World from thread %d\n", id);
   }
}

Here, you will get each thread tagging their output with their unique ID, a number between 0 and NUM_THREADS-1.

Pragmas and code blocks

An OpenMP pragma applies to the following code block in C or C++. Code blocks are either a single line, or a series of lines wrapped by curly brackets. Because Fortran doesn’t have an analogous construction, many OpenMP pragmas in Fortran are paired with an “end” pragma, such as !$omp parallel end.

Thread ordering

What order do the threads write out their messages in? Try running the program a few times. What’s going on?

Solution

You should find that the messages are emitted in random order. This is an important rule of not only OpenMP programming, but parallel programming in general: Order of execution of parallel elements is not guaranteed.

Conditional compilation

We said earlier that you should be able to use the same code for both OpenMP and serial work. Try compiling the code without the -fopenmp flag. What happens? Can you figure out how to fix it?

Hint: The compiler defines a preprocessor variable _OPENMP

Solution


#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
   int id = 0;
   #pragma omp parallel
   {
#ifdef _OPENMP
   id = omp_get_thread_num();
#else
   id = 0;
#endif
   printf("Hello World from thread %d\n", id);
   }
}

Execution model

It’s a good idea to have a mental model of what happens when OpenMP does something in parallel. When the program begins there is only one thread active, called the master thread. When the code reaches a point where parallel operations are required, the master thread creates (or activates) additional threads. We call this “forking”. The master and the created threads work concurrently for a while. Then at the end of the parallel code the created threads die or are suspended, and the flow of control returns to the master thread. We call this “joining”.

OpenMP-execution-model

OpenMP also divides the memory into two types: Global (or shared) memory, and thread-local memory. Every thread can read and write the global memory, but each thread also gets a little slice of memory that can only be read or written by that thread.

Key Points

  • Pragmas are directives to the compiler to parallelize something

  • Thread number is typically controlled with an environment variable, OMP_NUM_THREADS

  • Order of execution of parallel elements is not guaranteed.

  • If the compiler doesn’t recognize OpenMP pragmas, it will compile a single-threaded program. But you may need to escape OpenMP function calls.


Linear algebra

Overview

Teaching: 20 min
Exercises: 30 min
Questions
  • How do I parallelize a loop?

Objectives
  • Use the PARALLEL FOR pragma (or PARALLEL DO)

  • Use the PRIVATE clause

  • Be able to identify data dependencies

One of the classic applications of programming is linear algebra, in all of its beauty and complexity. We will use a few simple problems to see how to execute loops in parallel using OpenMP.

Multiply an array by a constant

The simplest problem is applying some function to an array of numbers. An example is multiplying each value by some constant number. In serial you would use a for loop (in C; a DO loop in Fortran) to do this multiplication for each element, like this:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char **argv) {
   struct timespec ts_start, ts_end;
   int size = 1000000;
   int multiplier = 2;
   int a[size];
   int c[size];
   int i;
   float time_total;
   clock_gettime(CLOCK_MONOTONIC, &ts_start);
   for (i = 0; i<size; i++) {
      c[i] = multiplier * a[i];
   }
   clock_gettime(CLOCK_MONOTONIC, &ts_end);
   time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
   printf("Total time is %f ms\n", time_total/1000000);
}

We added some calls to ‘clock_gettime()’ from the ‘time.h’ header file to get the start and end times of the heavy work being done by the for loop. In this case, we get a count of how many seconds and how many nanoseconds elapsed, given in two parts of the time structure. We did some math to get the elapsed time in milliseconds.

Time and Size

What happens to the run time of your program through multiple runs? What happens if you change the size and recompile and rerun?

How complicated is it to turn this program into a parallel program? Not complicated at all: We add two lines, #include <omp.h> near the top, and #pragma omp parallel for just before the for (...) statement:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>

int main(int argc, char **argv) {
   struct timespec ts_start, ts_end;
   int size = 1000000;
   int multiplier = 2;
   int a[size];
   int c[size];
   int i;
   float time_total;
   clock_gettime(CLOCK_MONOTONIC, &ts_start);
   #pragma omp parallel for
   for (i = 0; i<size; i++) {
      c[i] = multiplier * a[i];
   }
   clock_gettime(CLOCK_MONOTONIC, &ts_end);
   time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
   printf("Time in loop is %f ms.  %d iterations.\n", time_total/1000000, size);
}

Compiling and running

Don’t forget to include the -fopenmp flag when compiling, and remember to set OMP_NUM_THREADS when running.

Using more threads

How many threads did you use? What happens to the runtime when you change the number of threads?

In this case, the number of iterations around the for loop gets divided across the number of threads available. In order to do this, however, OpenMP needs to know how many iterations there are in the for loop. It also can’t change part way through the loops. To ensure this, there are some rules:

Summing the values in a matrix

Now let’s try adding up the elements of a matrix. Moving up to two dimensions adds a new layer of looping. The basic code looks like the following.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>

int main(int argc, char **argv) {
   struct timespec ts_start, ts_end;
   int size = 1000;
   int a[size][size];
   int i, j;
   // Set the matrix values to 1
   for (i=0; i<size; i++) {
      for (j=0; j<size; j++) {
         a[i][j] = 1;
      }
   }
   int c[size];
   // Zero the accumulator 
   for (i=0; i<size; i++) {
      c[i] = 0;
   }
   int total = 0;
   float time_total;
   clock_gettime(CLOCK_MONOTONIC, &ts_start);
   #pragma omp parallel for
   for (i = 0; i<size; i++) {
      for (j=0; j<size; j++) {
         c[i] = c[i] + a[i][j];
      }
   }
   for (i=0; i<size; i++) {
      total = total + c[i];
   }
   clock_gettime(CLOCK_MONOTONIC, &ts_end);
   time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
   printf("Total is %d, time is %f ms\n", total, time_total/1000000);
}

Is the result correct?

What should be the result of this code? Is that what it does? If not, what might be wrong? Does it do the same for different values of OMP_NUM_THREADS?

Solution

The elements all have value 1, and there are 1000*1000 of them, so the total should be 1,000,000. Why isn’t it?

Remember that OpenMP threads share memory. This means that every thread can see and access all of memory for the process. In the above case, multiple threads are all accessing the global variable j at the same time.

OpenMP includes a method to manage this correctly with the addition of a keyword, private().

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>

int main(int argc, char **argv) {
   struct timespec ts_start, ts_end;
   int size = 1000;
   int a[size][size];
   int i, j;
   // Set the matrix values to 1
   for (i=0; i<size; i++) {
      for (j=0; j<size; j++) {
         a[i][j] = 1;
      }
   }
   int c[size];
   // Zero the accumulator
   for (i=0; i<size; i++) {
      c[i] = 0;
   }
   int total = 0;
   float time_total;
   clock_gettime(CLOCK_MONOTONIC, &ts_start);
   #pragma omp parallel for private(j)
   for (i=0; i<size; i++) {
      for (j=0; j<size; j++) {
         c[i] = c[i] + a[i][j];
      }
   }
   for (i=0; i<size; i++) {
      total = total + c[i];
   }
   clock_gettime(CLOCK_MONOTONIC, &ts_end);
   time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
   printf("Total is %d, time is %f ms\n", total, time_total/1000000);
}

This makes sure that every thread has their own private copy of j to be used for the inner for loop.

Data Dependencies

Definition

If two statements read or write the same memory location, and at least one of the statements writes that memory location, then there is a ‘'’data dependency’’’ on that memory location between the two statements.

In turning a serial program into a parallel program, it’s vital to maintain the correctness of the serial program. (Assuming the serial program is correct, of course, but that’s a separate question.) ‘'’Data dependency’’’ is an important concept we use to guarantee that we’re transforming a serial program into an equivalent parallel program— equivalent in terms of its output, that is.

If there is a data dependency between two statements, then the order in which those statements are executed may affect the output of the program, and hence its correctness.

Consider this loop that computes a cumulative sum:

for ( i=2; i<N; i=i+1 ) {
    a[i] = a[i] + a[i-1];
}

The iteration with i==2 for example reads locations a[1] and a[2] and writes location a[2]. The iteration with i==3 then reads locations a[2] and a[3]. Since a[2] was read by both and written by one, there is a data dependency on a[2] between the assignment statement in successive loop iterations. So, the “statements” in the definition don’t need to be separate lines of code.

Now think about what it would mean to run this loop in parallel: Different iterations would be carried out by different threads of execution, possibly at the same time. If any two iterations didn’t happen in the order dictated by the serial code, the results would be wrong.

There are three types of data dependencies:

Is There a Dependency?

Which of the following loops have data dependencies?

/* loop #1 */
for ( i=2; i<N; i=i+2 ) {
    a[i] = a[i] + a[i-1]; }

/* loop #2 */
for ( i=1; i<N/2; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #3 */
for ( i=1; i<N/2+1; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #4 */
for ( i=1; i<N; i=i+1 ) {
    a[idx[i]] = a[idx[i]] + b[idx[i]]; }

Solution

Loop #1 does not.

Loop #2 does not.

Loop #3 does.

Loop #4 might or might not, depending on the contents of array idx. If any two entries of idx are the same, then there’s a dependency.

Thread-safe functions

Another important concept is that of a ‘'’thread-safe function’’’. A thread-safe function is one which can be called simultaneously by multiple threads of execution. You need to be aware when writing multi-threaded programs whether functions you are using are thread-safe or not. A classic example of this is when generating pseudo-random numbers. Most random number generators maintain state between calls to them. If they aren’t written to be thread-safe, then that internal state can get mixed up by multiple threads calling them at the same time. For more, see https://en.wikipedia.org/wiki/Thread_safety.

Key Points

  • The PARALLEL FOR (or PARALLEL DO) pragma makes a loop execute in parallel

  • A single variable accessed by different threads can cause wrong results

  • The PRIVATE clause makes a copy of a variable for each thread


Numeric Integration - Calculating Areas

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How can we calculate integrals?

Objectives
  • Learn about critical sections

In this section, we will use the problem of numeric integration, i.e. calculating areas under curves, to look at how to control access to global variables. As our example, let’s say we wanted to integrate the sine function from 0 to Pi. This is the same as the area under the first half of a sine curve. The single-threaded version is below.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char **argv) {
   int steps = 1000;
   double delta = M_PI/steps;
   double total = 0.0;
   int i;
   for (i=0; i<steps; i++) {
      total = total + sin(delta*i) * delta;
   }
   printf("The integral of sine from 0 to Pi is %.12f\n", total);
}

Compiling with math

In order to include the math functions, you need to link in the math library. In GCC, you would use the following:

gcc -o pi-serial pi.c -lm
./pi

The answer in this case should be 2. It will be off by a small amount because of the limits of computer representations of numbers.

Step size

How would you change the step size. What happens if you do?

Solution

You can decrease the step size by increasing the steps variable. We normally expect this to increase the accuracy of the result. Does it? Is there a noticeable effect on the run time?

To see what happens to the time this program takes, we’ll use a new tool. Since we just want to see the total time, we can use the program time.

Timing

You can use the time utility to get the amount of time it takes for a program to run.

$ time ./pi-serial
Using 1000 steps
The integral of sine from 0 to Pi is 1.999998355066

real    0m0.005s
user    0m0.000s
sys     0m0.002s

The real output is the useful one; this example took 0.005 seconds to run. The user and sys lines describe how much time was spent in “user” code and how much in “system” code, a distinction that doesn’t interest us today.

Parallelizing numerical integration

How would you parallelize this code to get it to run faster?

Obviously, we could add #pragma parallel for. But do we make total private, or not?

The data dependency on total leads to what we call a race condition. Since we are updating a global variable, there is a race between the various threads as to who can read and then write the value of total. Multiple threads could read the current value, before a working thread can write the result of its addition. So these reading threads essentially miss out on some additions to the total. This can be handled by adding a critical section. A critical section only allows one thread at a time to run some code block.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

int main(int argc, char **argv) {
   int steps = 1000;
   float delta = M_PI/steps;
   float total = 0.0;
   int i;
   #pragma omp parallel for
   for (i=0; i<steps; i++) {
      #pragma omp critical
      total = total + sin(delta*i) * delta;
   }
   printf("The integral of sine from 0 to Pi is %f\n", total);
}

The critical pragma is a very general construct that lets you ensure a code line is executed exclusively. However, making a sum is a very common operation in computing so OpenMP provides a specific mechanism to handle this case: Reduction variables. We’ll look at those in the next section.

Key Points

  • You can use a critical section to control access to a global variable


Searching through data

Overview

Teaching: 20 min
Exercises: 15 min
Questions
  • How to search in parallel

Objectives
  • Use general parallel sections

  • Have a single thread execute code

  • Use a reduction operator

So far, we have looked at parallelizing loops. OpenMP also allows you to use general parallel sections where code blocks are executed by all of the available threads. In these cases, it is up to you as a programmer to manage what work gets done by each thread. A basic example would look like the following.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
   int id, size;
   #pragma omp parallel private(id,size)
   {
      size = omp_get_num_threads();
      id = omp_get_thread_num();
      printf("This is thread %d of %d\n", id, size);
   }
}

Private variables

What happens if you forget the private keyword?

Using this as a starting point, we could use this code to have each available thread do something interesting. For example, we could write the text out to a series of individual files.

Single threaded function

There are times when you may need to drop out of a parallel section in order to have a single one of the threads executing some code. There is a keyword available, ‘single’, that allows the first thread to see it to execute some single code chunk.

Which thread runs first?

Modify the following code to find out which thread gets to run first in the parallel section. You should be able to do it by adding only one line. Here’s a (rather dense) reference guide in which you can look for ideas: Directives and Constructs for C/C++.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
   int id, size;
   #pragma omp parallel private(id,size)
   {
      size = omp_get_num_threads();
      id = omp_get_thread_num();
      printf("This thread %d of %d is first\n", id, size);
   }
}

Solution

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv) {
   int id, size;
   #pragma omp parallel private(id,size)
   {
      size = omp_get_num_threads();
      id = omp_get_thread_num();
      #pragma omp single
      printf("This thread %d of %d is first\n", id, size);
   }
}

Searching

Let’s say that we need to search through an array to find the largest value. How could we do this type of search in parallel? Let’s begin with the serial version.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
   int size = 10000;
   float rand_nums[size];
   int i;
   float curr_max;
   for (i=0; i<size; i++) {
      rand_nums[i] = rand();
   }
   curr_max = 0.0;
   for (i=0; i<size; i++) {
      if (curr_max < rand_nums[i]) {
         curr_max = rand_nums[i];
      }
   }
   printf("Max value is %f\n", curr_max);
}

The first stab would be to make the for loop a parallel for loop. You would want to make sure that each thread had a private copy of the ‘curr_max’ variable, since it will be written to. But, how do you find out which thread has the largest value?

Reduction Operators

You could create an array of curr_maxes, but getting that to work right would be messy— How do you adapt to different NUM_THREADS?

The keys here are 1) to recognize the analogy with the problem of total from the last episode, and 2) to know about reduction variables.

A reduction variable is used to accumulate some value over parallel threads, like a sum, a global maximum, a global minimum, etc. The reduction operators that you can use are: +, *, -, &, |, ^, &&, ||, max, min.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
   int size = 10000;
   float rand_nums[size];
   int i;
   float curr_max;
   for (i=0; i<size; i++) {
      rand_nums[i] = rand();
   }
   curr_max = 0.0;
   #pragma omp parallel for reduction(max:curr_max)
   for (i=0; i<size; i++) {
      if (curr_max < rand_nums[i]) {
         curr_max = rand_nums[i];
      }
   }
   printf("Max value is %f\n", curr_max);
}

Key Points

  • Reduction operators handle the common case of summation, and analogous operations

  • OpenMP can manage general parallel sections

  • You can use ‘pragma omp single’ to have a single thread execute something


Calculating Fibonacci numbers

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How to recurse in parallel

Objectives
  • Use general parallel tasks

The last technique we will look at is the task structure in OpenMP. In this case, you can use the task keyword within a parallel section to define sets of tasks that get queued up to run within the threads of the parallel section. They could run in any order. If you need to have results from the tasks before you can continue, you can use the taskwait keyword to tell OpenMP to pause your program until the tasks are done. This would look like the following:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int fib(int n) {
   int i, j;
   if (n<2)
      return n;
   else {
      #pragma omp task shared(i)
      i=fib(n-1);
      #pragma omp task shared(j)
      j=fib(n-2);
      #pragma omp taskwait
      return i+j;
   }
}

int main(int argc, char **argv){
   int n, result;
   char *a = argv[1];
   n = atoi(a);
   
   #pragma omp parallel
   {
      #pragma omp single
      result = fib(n);
   }
   printf("Result is %d\n", result);
}

In the main function, you need to start the parallel section so that all of the threads are launched. Since we only want the parent call to fib done once, we need to use the single keyword.

Task gotchas

There are a few gotchas to be aware of. While the intention is that tasks will run in parallel, there is nothing in the specification that guarantees this behavior. You may need to check how your particular environment works.

Tasks?

How do tasks work with different number of threads?

Key Points

  • OpenMP can manage general parallel tasks


Drawing the Mandelbrot set

Overview

Teaching: 20 min
Exercises: 100 min
Questions
  • How do we handle irregular tasks?

Objectives
  • Learn about the schedule() clause

Here’s today’s only example in Fortran. Sorry.

The Mandelbrot set was a hot subject of computer art in the 1980s. The algorithm is quite simple: For each point on the screen, do an iterative calculation and decide whether the calculation diverges or not. Color that spot on the screen according to how many iterations it took to diverge — or black if it didn’t diverge in 1000 iterations.

For simplicity this implementation just prints out a number instead of coloring a pixel.

program mandel

! Mandelbrot set generator, after
!   Chandra et al, "Parallel Programming in OpenMP", and
!   http://en.wikipedia.org/wiki/Mandelbrot_set#For_programmers
! Ross Dickson, ACEnet, 2010.06.24

integer,parameter:: maxiter=999, m=30, n=20
double precision,parameter:: xmin=-2.d0,xmax=1.d0
double precision,parameter:: ymin=-1.d0,ymax=1.d0
integer depth(n,m)
integer iter, j, k
integer starttime, endtime, tickspersec
double precision x, y

call system_clock(starttime)
!$omp parallel do private(..?...) 
do j = 1, m
    do k = 1, n
        x = xmin + j*(xmax-xmin)/m
        y = ymin + k*(ymax-ymin)/n
        depth(k, j) = mandel_val(x, y, maxiter)
    end do
end do
!$omp end parallel do
call system_clock(endtime,tickspersec)

do k = n, 1, -1
    write(*,'(30I4)') depth(k,:)
end do
write(*,*) 'working time: ',real(endtime-starttime)/real(tickspersec),' sec'

end program

integer function mandel_val(x0, y0, maxiter)
double precision x0, y0
integer maxiter
double precision x, y, xtemp
integer iter

x = 0.d0
y = 0.d0
iter = 0
do while (( x*x + y*y <= 4.d0 ) .and. ( iter < maxiter ))
   xtemp = x*x - y*y + x0
   y = 2*x*y + y0
   x = xtemp
   iter = iter + 1
end do
mandel_val = iter
end function

First, compile and run the program without OpenMP like so:

 gfortran mandel.f90 -o mandel-serial 
 ./mandel-serial

Appreciate the retro ASCII art, and note how long it took to run. A millisecond is not enough to get good performance measurements on.

Next, increase the dimensions m,n to 3000,2000 and recompile. You don’t want to dump 6,000,000 numbers to your screen, so send the output to a file: ./mandel-serial >mandel-serial.out.
Check the run time: tail -1 mandel-serial.out

Now comes the parallelization.

Exercise 1: Parallelize the code

There is a parallel do directive in the code, exactly analogous to the parallel for in C. But the private() clause is not complete. Decide what variable or variables should be made private, and then compile and test the code like so:

gfortran -fopenmp mandel.f90 -o mandel-omp
OMP_NUM_THREADS=2 ./mandel-omp >mandel-2.out

Try a few different values of OMP_NUM_THREADS. How does the performance scale?

The schedule() clause

OpenMP loop directives (parallel for, parallel do) can take several other clauses besides the private() clause we’ve already seen. One is schedule(), which allows us to specify how loop iterations are divided up among the threads.

The default is static scheduling, in which all iterations are allocated to threads before they execute any loop iterations. In dynamic scheduling, only some of the iterations are allocated to threads at the beginning of the loop’s execution. Threads that complete their iterations are then eligible to get additional work. The allocation process continues until all the iterations have been distributed to threads.

There’s a tradeoff between overhead (i.e., how much time is spent setting up the schedule) and load balancing (i.e., how much time is spent waiting for the most heavily-worked thread to catch up). Static scheduling has low overhead but may be badly balanced; dynamic scheduling has higher overhead. Both can also take a chunk size; larger chunks mean less overhead and greater memory locality, smaller chunks may mean finer load balancing. You can omit the chunk size, it defaults to 1.

Bad load balancing might be what’s causing this Mandelbrot code not to parallelize very well. Let’s add a schedule() clause and see what happens.

!$omp parallel do private(...?...) schedule(dynamic,?)

Exercise 2: Play with the schedule() clause

Try different schedule() clauses and tabulate the run times with different thread numbers. What seems to work best for this problem?

Does it change much if you grow the problem size? That is, if you make m,n bigger?

There’s a third option, guided, which starts with large chunks and gradually decreases the chunk size as it works through the iterations. Try it out too, if you like. With schedule(guided,<chunk>), the chunk parameter is the smallest chunk size it will try.

Key Points

  • Different loop scheduling may compensate for unbalanced loop iterations