This lesson is being piloted (Beta version)

Parallel Operations with Arrays

Overview

Teaching: 20 min
Exercises: 20 min
Questions
  • How do I parallelize a loop?

  • How do I ensure that the parallel code gives correct results?

Objectives
  • Learn to use the parallel for pragma

  • Learn to use the private clause

  • Be able to identify data dependencies



C arrays - static and dynamic

Declaring static arrays:

 A[2048][2048];

Disadvantages:

  • Allocated when a program starts
  • Occupies memory until a program ends.
  • Allocated on the stack. There may be a limit on the stack allocation
  • The stack memory limit can be checked with the command:
    ulimit -s
    
  • If you want to modify the stack memory limit, you can use the ulimit command:
    ulimit -s unlimited
    
  • On compute nodes stack is unlimited

Testing static and dynamic memory allocation with a simple program:

/* --- File test_stack.c --- */
#include <stdio.h>
#include <stdlib.h>

#define N_BYTES 10000000 // 10 MB

void test_heap(int N) {
 char *B;
 B=malloc(N*sizeof(char)); // Allocating on heap
 printf("Allocation of %i bytes on heap succeeded\n", N);
 free(B);
}

void test_stack(int N) {
 char A[N]; // Allocating on stack
 printf("Allocation of %i on stack succeeded\n", N);
}

int main(int argc, char **argv) {
 test_heap(N_BYTES);
 test_stack(N_BYTES);
}

Multiplying an Array by a Constant

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv)
{
        double start, end;
        int size = 5e8;
        int multiplier = 2;
        int *A, *C;
        int i;

        /* Allocate memory for arrays */
        A = malloc(size * sizeof(int));
        C = malloc(size * sizeof(int));

        start = omp_get_wtime();
        /* Multiply array a by multiplier */
        for (i = 0; i < size; i++)
        {
                C[i] = multiplier * A[i];
        }
        end = omp_get_wtime();
        printf("Total time is %f s\n", end-start);
}

array_multiply_serial.c

Compiling and Running a Serial Version

Compile the program.

gcc array_multiply_serial.c -o array_multiply_serial -fopenmp

Run it on the cluster.

srun --mem=5000 array_multiply_serial

Execution time and array size

Run the program several times and observe execution time.

  • How does the program perform over multiple runs?

Change the size of the array, recompile and rerun the program. Record execution time.

  • How did the size of the array affect execution time?

Creating a Parallel Version

...
        /* Multiply array a by multiplier */
	#pragma omp parallel for /* <--- OpenMP parallel for loop --- */
        for (i = 0; i < size; i++)
...

Iterations of the loop are distributed between the threads in this example.

Compiling and Running a Parallel Version

Compile the program.

gcc array_multiply_omp.c -o array_multiply_omp -fopenmp

Run it on the cluster with 4 threads.

srun -c4 --mem=5000 array_multiply_omp 

Using Threads

Run the program with different number of threads and observe how the runtime changes.

  • What happens to the runtime when you change the number of threads?

For a loop to be parallelized, it must meet certain requirements

Parallelizing while loops is not possible since the number of iterations is unknown.

Calculating the Sum of Matrix Elements

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char **argv)
{
    double start, end;
    int size = 1e4;
    int **A, *C;
    int i, j;

    /* Allocate memory */
    C = malloc(size * sizeof(int));
    A = (int **)malloc(size * sizeof(int *));
    for (i = 0; i < size; i++)
        A[i] = malloc(size * sizeof(int));
    /* Set all matrix elements to 1 */
    for (i = 0; i < size; i++)
        for (j = 0; j < size; j++)
            A[i][j] = 1;
    /* Zero the accumulator */
    for (i = 0; i < size; i++)
        C[i] = 0;

    start = omp_get_wtime();
    #pragma omp parallel for
    /* Each thread sums one column */
    for (i = 0; i < size; i++)
        for (j = 0; j < size; j++)
            C[i] += A[i][j];
        
    int total = 0;
    /* Add sums of all columns together */
    for (i = 0; i < size; i++)
        total += C[i];

    end = omp_get_wtime();
    printf("Total is %d, time is %f s\n", total, end-start);
}

matrix_sum_omp.c

Are the matrix elements added up correctly?

  1. What result should this code produce?
  2. Is the code producing the expected result?
  3. What might be the problem? How can this be fixed?

Solution

  1. The elements all have value 1, and there are 1e4*1e4 of them, so the total should be 1e8.
  2. Sum of the matrix sum is incorrect.
  3. OpenMP threads share memory. This means that every thread has read and write access to all of the process memory. In this example, multiple threads are all accessing the loop variable j. Threads that started last will reset j and as a result threads that started earlier will count some elements more than once. For the code to work correctly each thread must have its own copy of the variable j.

The omp private clause

#pragma omp parallel for private(j)

There is an implicit way to make variables private or shared

Be Cautious About Data Dependencies

Consider this loop that computes a cumulative sum:

for ( i = 2; i < N; i = i + 1 ) {
    a[i] = a[i] + a[i-1];
}

The result will be incorrect if any two iterations are not performed in the right order

Types of Data Dependencies:

View wikipedia page on data dependencies

Identifying data dependencies

What loops have data dependencies?

/* loop #1 */
for ( i=2; i<N; i=i+2 ) {
    a[i] = a[i] + a[i-1]; }

/* loop #2 */
for ( i=1; i<N/2; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #3 */
for ( i=0; i<N/2+1; i=i+1 ) {
    a[i] = a[i] + a[i+N/2]; }

/* loop #4 */
for ( i=1; i<N; i=i+1 ) {
    a[idx[i]] = a[idx[i]] + b[idx[i]]; } 

Solution

Loop #1 does not. The increment of this loop is 2, so in the first step we compute a[2]=a[2]+a[1]. In the next step we compute a[4]=a[4]+a[3] … etc.
Loop #2 does not. In this example we are summing array elements separated by half array size. In this range of i values each thread modifies only one element of the array a.
Loop #3 does. Here the last iteration creates data dependency writing to a[N/2] because a[N/2] is used when i=0:

i=N/2; a[N/2] = a[N/2] + a[N]  
i=0;   a[0] = a[0] + a[N/2]

Loop #4 might or might not, depending on the contents of the array idx. If any two entries of idx are the same, then there’s a dependency.

Thread-safe Functions

Consider this code:

#pragma omp parallel for
for(i = 0, i < N, i++)
	y[i] = func(x[i])

Will this parallel code give the same results as the serial version?

Thread-safe functions are those that can be called concurrently by multiple threads.

Consider the following function for calculating distance between two points in 2D:

float funct(float *p1, float *p2)
{
	float dx, dy;
	dx = p2[0]-p1[0];
	dy = p2[1]-p1[1];
	return(sqrt(dx*dx + dy*dy));
}

Take a look at a modified version of the code in which dx and dy are defined externally:

float dx, dy;
float funct(float *p1, float *p2)
{
	dx = p2[0]-p1[0];
	dy = p2[1]-p1[1];
	return(sqrt(dx*dx + dy*dy));
}

It is important to be aware of whether the functions you use are thread-safe.

For more info, see Thread safety.

Optimizing Performance

CPU Cache and Data Locality

Let’s do a quick experiment. Compile our matrix_multiply_omp.c code with Intel compiler:

gcc matrix_multiply_omp.c -fopenmp -O1 

Run on 4 CPUs and record timing. On our training cluster you will get something like:

srun --mem-per-cpu=2000 -c4 ./a.out
Total is 100000000, time is 0.014977 s

Then swap i and j indexes in the main for loop:

...
	#pragma omp parallel for
    for (i = 0; i<size; i++) 
        for (int j=0; j<size; j++) 
                C[i] += A[i][j]; /* <-- Swap indexes --- */
...

Recompile the program and run it again.

Total is 100000000, time is 0.391304 s

What is going on?

If inner loop iterates through elements of a row then chunk of a row is loaded into cache upon the first access:

                                      inner loop j (rows)
                             j=1               j=2   
   1    2 ... 1000 ... -->  1 2 ... 1000 
1001 1002 ... 2000 ... -->                 1001 1002 ... 2000 
2001 2002 ... 3000 ...
...

This will not happen if in the inner loop we iterate through elements of a column (loop variable i) because columns are not contiguous memory blocks:

                                      inner loop i (columns)
                              i=1                 i=2   
   1    2 ... 1000 ... -->  1 1001 2001 ...   2 1002 2002 ...   
1001 1002 ... 2000 ...
2001 2002 ... 3000 ...
...

In this case, CPU will need to access main memory to load each matrix element.

Avoiding Parallel Overhead at Low Iteration Counts

#pragma omp parallel for if(N > 1000) 
for(i = 0; i < N; i++){
	a[i] = k*b[i] + c[i];
}

Minimizing Parallelization Overhead

If the inner loop is parallelized, in each iteration step of the outer loop, a parallel region is created. This causes parallelization overhead.

Key Points

  • With the parallel for pragma, a loop is executed in parallel

  • When a variable is accessed by different threads, incorrect results can be produced.

  • Each thread receives a copy of the variable created by the private clause