Parallel Operations with Arrays

Overview

Teaching: 20 min
Exercises: 20 min

Questions

How do I parallelize a loop?

How do I ensure that the parallel code gives correct results?

Objectives

Learn to use the parallel for pragma

Learn to use the private clause

Be able to identify data dependencies

A single CPU cannot efficiently handle array operations
Large-scale linear algebra problems can be efficiently solved with OpenMP

C arrays - static and dynamic

Declaring static arrays:
 A[2048][2048];
Disadvantages:
Allocated when a program starts

Occupies memory until a program ends.

Allocated on the stack. There may be a limit on the stack allocation
The stack memory limit can be checked with the command:
ulimit -s
If you want to modify the stack memory limit, you can use the ulimit command:
ulimit -s unlimited
On compute nodes stack is unlimited

Testing static and dynamic memory allocation with a simple program:
/* --- File test_stack.c --- */
#include <stdio.h>
#include <stdlib.h>

#define N_BYTES 10000000 // 10 MB

void test_heap(int N) {
 char *B;
 B=malloc(N*sizeof(char)); // Allocating on heap
 printf("Allocation of %i bytes on heap succeeded\n", N);
 free(B);
}

void test_stack(int N) {
 char A[N]; // Allocating on stack
 printf("Allocation of %i on stack succeeded\n", N);
}

int main(int argc, char **argv) {
 test_heap(N_BYTES);
 test_stack(N_BYTES);
}

Multiplying an Array by a Constant

Serial code loops over all array elements as follows:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv)
{
        double start, end;
        int size = 5e8;
        int multiplier = 2;
        int *A, *C;
        int i;

        /* Allocate memory for arrays */
        A = malloc(size * sizeof(int));
        C = malloc(size * sizeof(int));

        start = omp_get_wtime();
        /* Multiply array a by multiplier */
        for (i = 0; i < size; i++)
        {
                C[i] = multiplier * A[i];
        }
        end = omp_get_wtime();
        printf("Total time is %f s\n", end-start);
}

array_multiply_serial.c

The omp_get_wtime( ) function is used to determine the start and end times for the loop.

Compiling and Running a Serial Version

Compile the program.

gcc array_multiply_serial.c -o array_multiply_serial -fopenmp

Run it on the cluster.

srun --mem=5000 array_multiply_serial

Execution time and array size

Run the program several times and observe execution time.

How does the program perform over multiple runs?

Change the size of the array, recompile and rerun the program. Record execution time.

How did the size of the array affect execution time?

Creating a Parallel Version

We need to tell compiler that we want to parallelize the main for loop.

...
        /* Multiply array a by multiplier */
	#pragma omp parallel for /* <--- OpenMP parallel for loop --- */
        for (i = 0; i < size; i++)
...

Iterations of the loop are distributed between the threads in this example.

Compiling and Running a Parallel Version

Compile the program.

gcc array_multiply_omp.c -o array_multiply_omp -fopenmp

Run it on the cluster with 4 threads.

srun -c4 --mem=5000 array_multiply_omp 

Using Threads

Run the program with different number of threads and observe how the runtime changes.

What happens to the runtime when you change the number of threads?

For a loop to be parallelized, it must meet certain requirements

An iteration count must be defined for the loop
After the loop begins, the number of iterations cannot change.
The loop cannot contain any calls to break() or exit() functions.

Parallelizing while loops is not possible since the number of iterations is unknown.

Calculating the Sum of Matrix Elements

Here is a basic example code:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv)
{
    double start, end;
    int size = 1e4;
    int **A, *C;
    int i, j;

    /* Allocate memory */
    C = malloc(size * sizeof(int));
    A = (int **)malloc(size * sizeof(int *));
    for (i = 0; i < size; i++)
        A[i] = malloc(size * sizeof(int));
    /* Set all matrix elements to 1 */
    for (i = 0; i < size; i++)
        for (j = 0; j < size; j++)
            A[i][j] = 1;
    /* Zero the accumulator */
    for (i = 0; i < size; i++)
        C[i] = 0;

    start = omp_get_wtime();
    #pragma omp parallel for
    /* Each thread sums one column */
    for (i = 0; i < size; i++)
        for (j = 0; j < size; j++)
            C[i] += A[i][j];
        
    int total = 0;
    /* Add sums of all columns together */
    for (i = 0; i < size; i++)
        total += C[i];

    end = omp_get_wtime();
    printf("Total is %d, time is %f s\n", total, end-start);
}

matrix_sum_omp.c

Are the matrix elements added up correctly?

What result should this code produce?

Is the code producing the expected result?

What might be the problem? How can this be fixed?

Solution

The elements all have value 1, and there are 1e4*1e4 of them, so the total should be 1e8.

Sum of the matrix sum is incorrect.

OpenMP threads share memory. This means that every thread has read and write access to all of the process memory. In this example, multiple threads are all accessing the loop variable j. Threads that started last will reset j and as a result threads that started earlier will count some elements more than once. For the code to work correctly each thread must have its own copy of the variable j.

The omp private clause

Private variables are hidden from other threads.

#pragma omp parallel for private(j)

In every thread, a private variable is created locally.
When entering a parallel region, private copies of a variable are created from the original object.

There is an implicit way to make variables private or shared

Variables declared outside of a parallel region are implicitly shared
Variables declared inside are implicitly private.

Be Cautious About Data Dependencies

If there is a data dependency between two statements, then the order in which those statements are executed may affect the output of the program, and hence its correctness.

Consider this loop that computes a cumulative sum:

for ( i = 2; i < N; i = i + 1 ) {
    a[i] = a[i] + a[i-1];
}

The result will be incorrect if any two iterations are not performed in the right order

Types of Data Dependencies:

FLOW (READ after WRITE), like the last example, when one statement uses the results of another.
ANTI (WRITE after READ), when one statement should write to a location only ‘‘after’’ another has read what’s there.
OUTPUT (WRITE after WRITE), when two statements write to a location, so the result will depend on which one wrote to it last.
INPUT (READ after READ). Both statements read a variable. Since neither tasks writes, they can run in either order.

View wikipedia page on data dependencies

Identifying data dependencies

What loops have data dependencies?
/* loop #1 */
for ( i=2; i<N; i=i+2 ) {
    a[i] = a[i] + a[i-1]; }

/* loop #2 */
for ( i=1; i<N/2; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #3 */
for ( i=0; i<N/2+1; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #4 */
for ( i=1; i<N; i=i+1 ) {
    a[idx[i]] = a[idx[i]] + b[idx[i]]; } 
Solution

Loop #1 does not. The increment of this loop is 2, so in the first step we compute a[2]=a[2]+a[1]. In the next step we compute a[4]=a[4]+a[3] … etc.
Loop #2 does not. In this example we are summing array elements separated by half array size. In this range of i values each thread modifies only one element of the array a.
Loop #3 does. Here the last iteration creates data dependency writing to a[N/2] because a[N/2] is used when i=0:
i=N/2; a[N/2] = a[N/2] + a[N]  
i=0;   a[0] = a[0] + a[N/2]
Loop #4 might or might not, depending on the contents of the array idx. If any two entries of idx are the same, then there’s a dependency.

Take a look at your code for data dependencies and eliminate them where you can.
The method of writing into a new array and updating old from new after each iteration is one approach.

Thread-safe Functions

Consider this code:

#pragma omp parallel for
for(i = 0, i < N, i++)
	y[i] = func(x[i])

Will this parallel code give the same results as the serial version?

It depends on what the function does.

Thread-safe functions are those that can be called concurrently by multiple threads.

Consider the following function for calculating distance between two points in 2D:

float funct(float *p1, float *p2)
{
	float dx, dy;
	dx = p2[0]-p1[0];
	dy = p2[1]-p1[1];
	return(sqrt(dx*dx + dy*dy));
}

This function is thread-safe because dx and dy are private.

Take a look at a modified version of the code in which dx and dy are defined externally:

float dx, dy;
float funct(float *p1, float *p2)
{
	dx = p2[0]-p1[0];
	dy = p2[1]-p1[1];
	return(sqrt(dx*dx + dy*dy));
}

This function is not thread-safe because dx and dy are global and can be modified by any thread.

It is important to be aware of whether the functions you use are thread-safe.

For more info, see Thread safety.

Optimizing Performance

CPU Cache and Data Locality

Let’s do a quick experiment. Compile our matrix_multiply_omp.c code with Intel compiler:

gcc matrix_multiply_omp.c -fopenmp -O1

Run on 4 CPUs and record timing. On our training cluster you will get something like:

srun --mem-per-cpu=2000 -c4 ./a.out

Total is 100000000, time is 0.014977 s

Then swap i and j indexes in the main for loop:

...
	#pragma omp parallel for
    for (i = 0; i<size; i++) 
        for (int j=0; j<size; j++) 
                C[i] += A[i][j]; /* <-- Swap indexes --- */
...

Recompile the program and run it again.

Total is 100000000, time is 0.391304 s

What is going on?

CPU cache helps to accelerate computations by eliminating memory bottlenecks.
To take advantage of CPU cache, an array we are looping through must be stored sequentially as a contiguous memory block.

If inner loop iterates through elements of a row then chunk of a row is loaded into cache upon the first access:

                                      inner loop j (rows)
                             j=1               j=2   
   1    2 ... 1000 ... -->  1 2 ... 1000 
1001 1002 ... 2000 ... -->                 1001 1002 ... 2000 
2001 2002 ... 3000 ...
...

This will not happen if in the inner loop we iterate through elements of a column (loop variable i) because columns are not contiguous memory blocks:

                                      inner loop i (columns)
                              i=1                 i=2   
   1    2 ... 1000 ... -->  1 1001 2001 ...   2 1002 2002 ...   
1001 1002 ... 2000 ...
2001 2002 ... 3000 ...
...

In this case, CPU will need to access main memory to load each matrix element.

Avoiding Parallel Overhead at Low Iteration Counts

Use conditional parallelization:

#pragma omp parallel for if(N > 1000) 
for(i = 0; i < N; i++){
	a[i] = k*b[i] + c[i];
}

Minimizing Parallelization Overhead

If the inner loop is parallelized, in each iteration step of the outer loop, a parallel region is created. This causes parallelization overhead.

It’s better to create fewer threads with longer tasks.

Key Points

With the parallel for pragma, a loop is executed in parallel

When a variable is accessed by different threads, incorrect results can be produced.

Each thread receives a copy of the variable created by the private clause

previous episode

ACENET Summer School - Directive-Based Parallel Programming with OpenMP and OpenACC

next episode