Linear algebra

Overview

Teaching: 20 min
Exercises: 30 min
Questions
  • How do I parallelize a loop?

Objectives
  • Use the PARALLEL FOR pragma (or PARALLEL DO)

  • Use the PRIVATE clause

  • Be able to identify data dependencies

One of the classic applications of programming is linear algebra, in all of its beauty and complexity. We will use a few simple problems to see how to execute loops in parallel using OpenMP.

Multiply an array by a constant

The simplest problem is applying some function to an array of numbers. An example is multiplying each value by some constant number. In serial you would use a for loop (in C; a DO loop in Fortran) to do this multiplication for each element, like this:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char **argv) {
   struct timespec ts_start, ts_end;
   int size = 1000000;
   int multiplier = 2;
   int a[size];
   int c[size];
   int i;
   float time_total;
   clock_gettime(CLOCK_MONOTONIC, &ts_start);
   for (i = 0; i<size; i++) {
      c[i] = multiplier * a[i];
   }
   clock_gettime(CLOCK_MONOTONIC, &ts_end);
   time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
   printf("Total time is %f ms\n", time_total/1000000);
}

We added some calls to ‘clock_gettime()’ from the ‘time.h’ header file to get the start and end times of the heavy work being done by the for loop. In this case, we get a count of how many seconds and how many nanoseconds elapsed, given in two parts of the time structure. We did some math to get the elapsed time in milliseconds.

Time and Size

What happens to the run time of your program through multiple runs? What happens if you change the size and recompile and rerun?

How complicated is it to turn this program into a parallel program? Not complicated at all: We add two lines, #include <omp.h> near the top, and #pragma omp parallel for just before the for (...) statement:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>

int main(int argc, char **argv) {
   struct timespec ts_start, ts_end;
   int size = 1000000;
   int multiplier = 2;
   int a[size];
   int c[size];
   int i;
   float time_total;
   clock_gettime(CLOCK_MONOTONIC, &ts_start);
   #pragma omp parallel for
   for (i = 0; i<size; i++) {
      c[i] = multiplier * a[i];
   }
   clock_gettime(CLOCK_MONOTONIC, &ts_end);
   time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
   printf("Time in loop is %f ms.  %d iterations.\n", time_total/1000000, size);
}

Compiling and running

Don’t forget to include the -fopenmp flag when compiling, and remember to set OMP_NUM_THREADS when running.

Using more threads

How many threads did you use? What happens to the runtime when you change the number of threads?

In this case, the number of iterations around the for loop gets divided across the number of threads available. In order to do this, however, OpenMP needs to know how many iterations there are in the for loop. It also can’t change part way through the loops. To ensure this, there are some rules:

Summing the values in a matrix

Now let’s try adding up the elements of a matrix. Moving up to two dimensions adds a new layer of looping. The basic code looks like the following.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>

int main(int argc, char **argv) {
   struct timespec ts_start, ts_end;
   int size = 1000;
   int a[size][size];
   int i, j;
   // Set the matrix values to 1
   for (i=0; i<size; i++) {
      for (j=0; j<size; j++) {
         a[i][j] = 1;
      }
   }
   int c[size];
   // Zero the accumulator 
   for (i=0; i<size; i++) {
      c[i] = 0;
   }
   int total = 0;
   float time_total;
   clock_gettime(CLOCK_MONOTONIC, &ts_start);
   #pragma omp parallel for
   for (i = 0; i<size; i++) {
      for (j=0; j<size; j++) {
         c[i] = c[i] + a[i][j];
      }
   }
   for (i=0; i<size; i++) {
      total = total + c[i];
   }
   clock_gettime(CLOCK_MONOTONIC, &ts_end);
   time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
   printf("Total is %d, time is %f ms\n", total, time_total/1000000);
}

Is the result correct?

What should be the result of this code? Is that what it does? If not, what might be wrong? Does it do the same for different values of OMP_NUM_THREADS?

Solution

The elements all have value 1, and there are 1000*1000 of them, so the total should be 1,000,000. Why isn’t it?

Remember that OpenMP threads share memory. This means that every thread can see and access all of memory for the process. In the above case, multiple threads are all accessing the global variable j at the same time.

OpenMP includes a method to manage this correctly with the addition of a keyword, private().

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>

int main(int argc, char **argv) {
   struct timespec ts_start, ts_end;
   int size = 1000;
   int a[size][size];
   int i, j;
   // Set the matrix values to 1
   for (i=0; i<size; i++) {
      for (j=0; j<size; j++) {
         a[i][j] = 1;
      }
   }
   int c[size];
   // Zero the accumulator
   for (i=0; i<size; i++) {
      c[i] = 0;
   }
   int total = 0;
   float time_total;
   clock_gettime(CLOCK_MONOTONIC, &ts_start);
   #pragma omp parallel for private(j)
   for (i=0; i<size; i++) {
      for (j=0; j<size; j++) {
         c[i] = c[i] + a[i][j];
      }
   }
   for (i=0; i<size; i++) {
      total = total + c[i];
   }
   clock_gettime(CLOCK_MONOTONIC, &ts_end);
   time_total = (ts_end.tv_sec - ts_start.tv_sec)*1000000000 + (ts_end.tv_nsec - ts_start.tv_nsec);
   printf("Total is %d, time is %f ms\n", total, time_total/1000000);
}

This makes sure that every thread has their own private copy of j to be used for the inner for loop.

Data Dependencies

Definition

If two statements read or write the same memory location, and at least one of the statements writes that memory location, then there is a ‘'’data dependency’’’ on that memory location between the two statements.

In turning a serial program into a parallel program, it’s vital to maintain the correctness of the serial program. (Assuming the serial program is correct, of course, but that’s a separate question.) ‘'’Data dependency’’’ is an important concept we use to guarantee that we’re transforming a serial program into an equivalent parallel program— equivalent in terms of its output, that is.

If there is a data dependency between two statements, then the order in which those statements are executed may affect the output of the program, and hence its correctness.

Consider this loop that computes a cumulative sum:

for ( i=2; i<N; i=i+1 ) {
    a[i] = a[i] + a[i-1];
}

The iteration with i==2 for example reads locations a[1] and a[2] and writes location a[2]. The iteration with i==3 then reads locations a[2] and a[3]. Since a[2] was read by both and written by one, there is a data dependency on a[2] between the assignment statement in successive loop iterations. So, the “statements” in the definition don’t need to be separate lines of code.

Now think about what it would mean to run this loop in parallel: Different iterations would be carried out by different threads of execution, possibly at the same time. If any two iterations didn’t happen in the order dictated by the serial code, the results would be wrong.

There are three types of data dependencies:

Is There a Dependency?

Which of the following loops have data dependencies?

/* loop #1 */
for ( i=2; i<N; i=i+2 ) {
    a[i] = a[i] + a[i-1]; }

/* loop #2 */
for ( i=1; i<N/2; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #3 */
for ( i=1; i<N/2+1; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #4 */
for ( i=1; i<N; i=i+1 ) {
    a[idx[i]] = a[idx[i]] + b[idx[i]]; }

Solution

Loop #1 does not.

Loop #2 does not.

Loop #3 does.

Loop #4 might or might not, depending on the contents of array idx. If any two entries of idx are the same, then there’s a dependency.

Thread-safe functions

Another important concept is that of a ‘'’thread-safe function’’’. A thread-safe function is one which can be called simultaneously by multiple threads of execution. You need to be aware when writing multi-threaded programs whether functions you are using are thread-safe or not. A classic example of this is when generating pseudo-random numbers. Most random number generators maintain state between calls to them. If they aren’t written to be thread-safe, then that internal state can get mixed up by multiple threads calling them at the same time. For more, see https://en.wikipedia.org/wiki/Thread_safety.

Key Points

  • The PARALLEL FOR (or PARALLEL DO) pragma makes a loop execute in parallel

  • A single variable accessed by different threads can cause wrong results

  • The PRIVATE clause makes a copy of a variable for each thread