Race Conditions with OpenMP
Overview
Teaching: 25 min
Exercises: 5 minQuestions
How can we calculate integrals in parallel?
How to compile a program using math functions with gcc?
Objectives
Understand what is a race condition
Learn how to avoid race conditions
Learn how to use reduction variables
- data race occurs when two threads access the same memory without proper synchronization.
- data race can cause a program to produce incorrect results in parallel mode.
Parallel Numerical Integration
Here is an example of an integration of the sine function from 0 to $\pi$.
Serial version is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char **argv) {
int steps = 1e7; /* Number of rectangles */
double delta = M_PI/steps; /* Width of each rectangle */
double total = 0.0; /* Accumulator */
int i;
printf("Using %.0e steps\n", (float)steps);
for (i=0; i<steps; i++) {
total = total + sin(delta*i) * delta;
}
printf("The integral of sine from 0 to Pi is %.12f\n", total);
}
Make a copy of the file integrate_sin_serial.c that you will work with to parallelize. You can call it integrate_sin_omp.c.
Compiling Code with Mathematical Functions
To compile a C program that uses math library with GCC we need to explicitly link to it:
gcc -o integrate integrate_sin_omp.c -lm
Let’s run the program.
./integrate
Using 1e+07 steps
The integral of sine from 0 to Pi is 2.000000000000
- Use the
time
utility to get the execution time:
srun time -p ./integrate
Using 1e+07 steps
The integral of sine from 0 to Pi is 2.000000000000
real 0.94
user 0.94
sys 0.00
The output real is the useful one. This example took 0.941 seconds to run.
The user and sys lines describe how much time was spent in the “user” code and how much in the “system” code, a distinction that doesn’t interest us today.
Parallelizing Numerical Integration
Let’s parallelize the main loop and execute the code.
...
printf("Using %.0e steps\n", (float)steps);
#pragma omp parallel for
...
integrate_sin_omp.c
- The result is incorrect because data dependency on total leads to a race condition when the program is run in parallel.
How to Avoid Data Race Conditions?
- Data races can be avoided by synchronizing threads so that variables are accessed in the correct order.
The omp critical Directive
Race conditions can be avoided by adding a critical directive.
- A critical directive only allows one thread at a time to run some block of code.
Add the omp critical directive to the statement in the main loop and rerun the code.
...
printf("Using %.0e steps\n", (float)steps);
#pragma omp parallel for
for (i=0; i<steps; i++) {
#pragma omp critical
...
integrate_sin_omp.c
gcc -o integrate integrate_sin_omp.c -lm -fopenmp
srun -c1 time -p ./integrate
srun -c2 time -p ./integrate
- Using critical, in this case, is a wrong decision because critical serializes the whole loop.
The omp atomic Directive
- The omp atomic directive is similar to critical but one thread being in an atomic operation doesn’t block any other atomic operations about to happen.
- The downsides:
- it can be used only to control a single statement
- the set of operations that atomic supports is restricted
- there are serialization costs associated with both omp critical and omp atomic.
Examples demonstrating how to use atomic:
#pragma omp atomic update
k += n*mass; /* update k atomically */
#pragma omp atomic read
tmp = c; /* read c atomically */
#pragma omp atomic write
count = n*m; /* write count atomically */
#pragma omp atomic capture
{ /* capture original value of v in d, then atomically update v */
d = v;
v += n;
}
The atomic clauses: update (default), write, read, capture
Parallel Performance
- Replace the critical directive with the atomic directive, recompile the code and run it on more that one thread. Try different number of threads (1-8) and record execution time.
- How execution time compares to the code using critical?
Solution
Version Threads Time serial 0.94 parallelized using critical 1 1.11 parallelized using critical 2 1.32 parallelized using critical 4 1.71 parallelized using atomic 1 1.03 parallelized using atomic 2 0.69 parallelized using atomic 4 0.72
The Optimal Way to Parallelize Integration
The best option is to let each thread work on its own chunk of data and then sum the sums of all threads. The reduction clause lets you specify thread-private variables that are subject to a reduction operation at the end of the parallel region
- the reduction operation for summation is is “+”
- at the end of the parallel construct the values of all private copies of the shared variable will be added together
#pragma omp parallel for reduction(+: total)
Recompile the code and execute. Now we got the right answer, and x3.7 speedup with 4 threads!
- Other reduction operators:
- subtraction
- multiplication
- minimum, maximum
- logical operators.
Finding the Maximum Value in an Array
Imagine that you are trying to find the largest value in an array. Is there a way to do this type of search in parallel? Begin with the serial version:
/* --- File array_max_serial.c --- */ #include <stdio.h> #include <stdlib.h> #include <omp.h> #include <math.h> int main(int argc, char **argv) { int size = 1e8; int *rand_nums; int i; int curr_max; time_t t; rand_nums=malloc(size*sizeof(int)); /* Intialize random number generator */ srand((unsigned) time(&t)); /* Initialize array with random values */ for (i=0; i<size; i++) { rand_nums[i] = rand(); } /* Find maximum */ curr_max = 0.0; for (i=0; i<size; i++) { curr_max = fmax(curr_max, rand_nums[i]); } printf("Max value is %d\n", curr_max); }
Solution
This problem is analogous to the summation.
- Because the curr_max variable will be written to, each thread should have its own copy.
- Once all threads have found the maximum value in their share of data, you need to determine which thread has the largest value.
... curr_max = 0.0; #pragma omp parallel for reduction(max:curr_max) ...
Key Points
Race conditions can be avoided by using the omp critical or the omp atomic directives
The best option to parallelize summation is to use the reduction directive