Race Conditions with OpenMP

Overview

Teaching: 25 min
Exercises: 5 min

Questions

How can we calculate integrals in parallel?

How to compile a program using math functions with gcc?

Objectives

Understand what is a race condition

Learn how to avoid race conditions

Learn how to use reduction variables

data race occurs when two threads access the same memory without proper synchronization.
data race can cause a program to produce incorrect results in parallel mode.

Parallel Numerical Integration

Here is an example of an integration of the sine function from 0 to $\pi$.

Serial version is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char **argv) {
	int steps = 1e7;           /* Number of rectangles */ 
	double delta = M_PI/steps; /* Width of each rectangle */
	double total = 0.0;        /* Accumulator */
	int i;

	printf("Using %.0e steps\n", (float)steps);
	for (i=0; i<steps; i++) {
		total = total + sin(delta*i) * delta;
	}
	printf("The integral of sine from 0 to Pi is %.12f\n", total);
}

integrate_sin_serial.c

Make a copy of the file integrate_sin_serial.c that you will work with to parallelize. You can call it integrate_sin_omp.c.

Compiling Code with Mathematical Functions

To compile a C program that uses math library with GCC we need to explicitly link to it:

 gcc -o integrate integrate_sin_omp.c -lm

Let’s run the program.

./integrate

Using 1e+07 steps
The integral of sine from 0 to Pi is 2.000000000000

Use the time utility to get the execution time:

srun time -p ./integrate

Using 1e+07 steps
The integral of sine from 0 to Pi is 2.000000000000
real 0.94
user 0.94
sys 0.00

The output real is the useful one. This example took 0.941 seconds to run.

The user and sys lines describe how much time was spent in the “user” code and how much in the “system” code, a distinction that doesn’t interest us today.

Parallelizing Numerical Integration

Let’s parallelize the main loop and execute the code.

...
	printf("Using %.0e steps\n", (float)steps);
	#pragma omp parallel for
...	

integrate_sin_omp.c

The result is incorrect because data dependency on total leads to a race condition when the program is run in parallel.

How to Avoid Data Race Conditions?

Data races can be avoided by synchronizing threads so that variables are accessed in the correct order.

The omp critical Directive

Race conditions can be avoided by adding a critical directive.

A critical directive only allows one thread at a time to run some block of code.

Add the omp critical directive to the statement in the main loop and rerun the code.

...
	printf("Using %.0e steps\n", (float)steps);
	#pragma omp parallel for
	for (i=0; i<steps; i++) {
	#pragma omp critical
...	

integrate_sin_omp.c

gcc -o integrate integrate_sin_omp.c -lm -fopenmp
srun -c1 time -p ./integrate
srun -c2 time -p ./integrate

Using critical, in this case, is a wrong decision because critical serializes the whole loop.

The omp atomic Directive

The omp atomic directive is similar to critical but one thread being in an atomic operation doesn’t block any other atomic operations about to happen.

The downsides:
- it can be used only to control a single statement
- the set of operations that atomic supports is restricted
- there are serialization costs associated with both omp critical and omp atomic.

Examples demonstrating how to use atomic:

#pragma omp atomic update
     k += n*mass;  /* update k atomically */
#pragma omp atomic read
     tmp = c;		/* read c atomically */
#pragma omp atomic write
     count = n*m;		/* write count atomically */
#pragma omp atomic capture
{ /* capture original value of v in d, then atomically update v */
	d = v; 
	v += n; 
} 

The atomic clauses: update (default), write, read, capture

Parallel Performance

Replace the critical directive with the atomic directive, recompile the code and run it on more that one thread. Try different number of threads (1-8) and record execution time.

How execution time compares to the code using critical?

Solution

Version Threads Time

serial 0.94

parallelized using critical 1 1.11

parallelized using critical 2 1.32

parallelized using critical 4 1.71

parallelized using atomic 1 1.03

parallelized using atomic 2 0.69

parallelized using atomic 4 0.72

Version	Threads	Time
serial		0.94
parallelized using critical	1	1.11
parallelized using critical	2	1.32
parallelized using critical	4	1.71
parallelized using atomic	1	1.03
parallelized using atomic	2	0.69
parallelized using atomic	4	0.72

The Optimal Way to Parallelize Integration

The best option is to let each thread work on its own chunk of data and then sum the sums of all threads. The reduction clause lets you specify thread-private variables that are subject to a reduction operation at the end of the parallel region

the reduction operation for summation is is “+”
at the end of the parallel construct the values of all private copies of the shared variable will be added together

#pragma omp parallel for reduction(+: total)

Recompile the code and execute. Now we got the right answer, and x3.7 speedup with 4 threads!

Other reduction operators:
- subtraction
- multiplication
- minimum, maximum
- logical operators.

Finding the Maximum Value in an Array

Imagine that you are trying to find the largest value in an array. Is there a way to do this type of search in parallel? Begin with the serial version:
/* --- File array_max_serial.c --- */
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>

int main(int argc, char **argv) {
	int size = 1e8;
	int *rand_nums;
	int i;
	int curr_max;
	time_t t;

	rand_nums=malloc(size*sizeof(int)); 
	/* Intialize random number generator */
	srand((unsigned) time(&t));
	/* Initialize array with random values */
	for (i=0; i<size; i++) {
		rand_nums[i] = rand();
	}
	/* Find maximum */
	curr_max = 0.0;
	for (i=0; i<size; i++) {
		curr_max = fmax(curr_max, rand_nums[i]);
	}
	printf("Max value is %d\n", curr_max);
}
Solution

This problem is analogous to the summation.
Because the curr_max variable will be written to, each thread should have its own copy.
Once all threads have found the maximum value in their share of data, you need to determine which thread has the largest value.
...
curr_max = 0.0;
#pragma omp parallel for reduction(max:curr_max)
...

Key Points

Race conditions can be avoided by using the omp critical or the omp atomic directives

The best option to parallelize summation is to use the reduction directive

previous episode

ACENET Summer School - Directive-Based Parallel Programming with OpenMP and OpenACC

next episode

Race Conditions with OpenMP

Overview