Parallel Operations with Arrays
Overview
Teaching: 20 min
Exercises: 20 minQuestions
How do I parallelize a loop?
How do I ensure that the parallel code gives correct results?
Objectives
Learn to use the
parallel for
pragmaLearn to use the
private
clauseBe able to identify data dependencies
- A single CPU cannot efficiently handle array operations
- Large-scale linear algebra problems can be efficiently solved with OpenMP
C arrays - static and dynamic
Declaring static arrays:
A[2048][2048];
Disadvantages:
- Allocated when a program starts
- Occupies memory until a program ends.
- Allocated on the stack. There may be a limit on the stack allocation
- The stack memory limit can be checked with the command:
ulimit -s
- If you want to modify the stack memory limit, you can use the ulimit command:
ulimit -s unlimited
- On compute nodes stack is unlimited
Testing static and dynamic memory allocation with a simple program:
/* --- File test_stack.c --- */ #include <stdio.h> #include <stdlib.h> #define N_BYTES 10000000 // 10 MB void test_heap(int N) { char *B; B=malloc(N*sizeof(char)); // Allocating on heap printf("Allocation of %i bytes on heap succeeded\n", N); free(B); } void test_stack(int N) { char A[N]; // Allocating on stack printf("Allocation of %i on stack succeeded\n", N); } int main(int argc, char **argv) { test_heap(N_BYTES); test_stack(N_BYTES); }
Multiplying an Array by a Constant
- Serial code loops over all array elements as follows:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv)
{
double start, end;
int size = 5e8;
int multiplier = 2;
int *A, *C;
int i;
/* Allocate memory for arrays */
A = malloc(size * sizeof(int));
C = malloc(size * sizeof(int));
start = omp_get_wtime();
/* Multiply array a by multiplier */
for (i = 0; i < size; i++)
{
C[i] = multiplier * A[i];
}
end = omp_get_wtime();
printf("Total time is %f s\n", end-start);
}
- The omp_get_wtime( ) function is used to determine the start and end times for the loop.
Compiling and Running a Serial Version
Compile the program.
gcc array_multiply_serial.c -o array_multiply_serial -fopenmp
Run it on the cluster.
srun --mem=5000 array_multiply_serial
Execution time and array size
Run the program several times and observe execution time.
- How does the program perform over multiple runs?
Change the size of the array, recompile and rerun the program. Record execution time.
- How did the size of the array affect execution time?
Creating a Parallel Version
- We need to tell compiler that we want to parallelize the main
for
loop.
...
/* Multiply array a by multiplier */
#pragma omp parallel for /* <--- OpenMP parallel for loop --- */
for (i = 0; i < size; i++)
...
Iterations of the loop are distributed between the threads in this example.
Compiling and Running a Parallel Version
Compile the program.
gcc array_multiply_omp.c -o array_multiply_omp -fopenmp
Run it on the cluster with 4 threads.
srun -c4 --mem=5000 array_multiply_omp
Using Threads
Run the program with different number of threads and observe how the runtime changes.
- What happens to the runtime when you change the number of threads?
For a loop to be parallelized, it must meet certain requirements
- An iteration count must be defined for the loop
- After the loop begins, the number of iterations cannot change.
- The loop cannot contain any calls to
break()
orexit()
functions.
Parallelizing while
loops is not possible since the number of iterations is unknown.
Calculating the Sum of Matrix Elements
- Here is a basic example code:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv)
{
double start, end;
int size = 1e4;
int **A, *C;
int i, j;
/* Allocate memory */
C = malloc(size * sizeof(int));
A = (int **)malloc(size * sizeof(int *));
for (i = 0; i < size; i++)
A[i] = malloc(size * sizeof(int));
/* Set all matrix elements to 1 */
for (i = 0; i < size; i++)
for (j = 0; j < size; j++)
A[i][j] = 1;
/* Zero the accumulator */
for (i = 0; i < size; i++)
C[i] = 0;
start = omp_get_wtime();
#pragma omp parallel for
/* Each thread sums one column */
for (i = 0; i < size; i++)
for (j = 0; j < size; j++)
C[i] += A[i][j];
int total = 0;
/* Add sums of all columns together */
for (i = 0; i < size; i++)
total += C[i];
end = omp_get_wtime();
printf("Total is %d, time is %f s\n", total, end-start);
}
Are the matrix elements added up correctly?
- What result should this code produce?
- Is the code producing the expected result?
- What might be the problem? How can this be fixed?
Solution
- The elements all have value 1, and there are 1e4*1e4 of them, so the total should be 1e8.
- Sum of the matrix sum is incorrect.
- OpenMP threads share memory. This means that every thread has read and write access to all of the process memory. In this example, multiple threads are all accessing the loop variable j. Threads that started last will reset j and as a result threads that started earlier will count some elements more than once. For the code to work correctly each thread must have its own copy of the variable j.
The omp private clause
- Private variables are hidden from other threads.
#pragma omp parallel for private(j)
- In every thread, a private variable is created locally.
- When entering a parallel region, private copies of a variable are created from the original object.
There is an implicit way to make variables private or shared
- Variables declared outside of a parallel region are implicitly shared
- Variables declared inside are implicitly private.
Be Cautious About Data Dependencies
- If there is a data dependency between two statements, then the order in which those statements are executed may affect the output of the program, and hence its correctness.
Consider this loop that computes a cumulative sum:
for ( i = 2; i < N; i = i + 1 ) {
a[i] = a[i] + a[i-1];
}
The result will be incorrect if any two iterations are not performed in the right order
Types of Data Dependencies:
- FLOW (READ after WRITE), like the last example, when one statement uses the results of another.
- ANTI (WRITE after READ), when one statement should write to a location only ‘‘after’’ another has read what’s there.
- OUTPUT (WRITE after WRITE), when two statements write to a location, so the result will depend on which one wrote to it last.
- INPUT (READ after READ). Both statements read a variable. Since neither tasks writes, they can run in either order.
View wikipedia page on data dependencies
Identifying data dependencies
What loops have data dependencies?
/* loop #1 */ for ( i=2; i<N; i=i+2 ) { a[i] = a[i] + a[i-1]; } /* loop #2 */ for ( i=1; i<N/2; i=i+1 ) { a[i] = a[i] + a[i+N/2]; } /* loop #3 */ for ( i=0; i<N/2+1; i=i+1 ) { a[i] = a[i] + a[i+N/2]; } /* loop #4 */ for ( i=1; i<N; i=i+1 ) { a[idx[i]] = a[idx[i]] + b[idx[i]]; }
Solution
Loop #1 does not. The increment of this loop is 2, so in the first step we compute a[2]=a[2]+a[1]. In the next step we compute a[4]=a[4]+a[3] … etc.
Loop #2 does not. In this example we are summing array elements separated by half array size. In this range of i values each thread modifies only one element of the array a.
Loop #3 does. Here the last iteration creates data dependency writing to a[N/2] because a[N/2] is used when i=0:i=N/2; a[N/2] = a[N/2] + a[N] i=0; a[0] = a[0] + a[N/2]
Loop #4 might or might not, depending on the contents of the array idx. If any two entries of idx are the same, then there’s a dependency.
- Take a look at your code for data dependencies and eliminate them where you can.
- The method of writing into a new array and updating old from new after each iteration is one approach.
Thread-safe Functions
Consider this code:
#pragma omp parallel for
for(i = 0, i < N, i++)
y[i] = func(x[i])
Will this parallel code give the same results as the serial version?
- It depends on what the function does.
Thread-safe functions are those that can be called concurrently by multiple threads.
Consider the following function for calculating distance between two points in 2D:
float funct(float *p1, float *p2)
{
float dx, dy;
dx = p2[0]-p1[0];
dy = p2[1]-p1[1];
return(sqrt(dx*dx + dy*dy));
}
- This function is thread-safe because dx and dy are private.
Take a look at a modified version of the code in which dx and dy are defined externally:
float dx, dy;
float funct(float *p1, float *p2)
{
dx = p2[0]-p1[0];
dy = p2[1]-p1[1];
return(sqrt(dx*dx + dy*dy));
}
- This function is not thread-safe because dx and dy are global and can be modified by any thread.
It is important to be aware of whether the functions you use are thread-safe.
For more info, see Thread safety.
Optimizing Performance
CPU Cache and Data Locality
Let’s do a quick experiment. Compile our matrix_multiply_omp.c code with Intel compiler:
gcc matrix_multiply_omp.c -fopenmp -O1
Run on 4 CPUs and record timing. On our training cluster you will get something like:
srun --mem-per-cpu=2000 -c4 ./a.out
Total is 100000000, time is 0.014977 s
Then swap i and j indexes in the main for loop:
...
#pragma omp parallel for
for (i = 0; i<size; i++)
for (int j=0; j<size; j++)
C[i] += A[i][j]; /* <-- Swap indexes --- */
...
Recompile the program and run it again.
Total is 100000000, time is 0.391304 s
What is going on?
- CPU cache helps to accelerate computations by eliminating memory bottlenecks.
- To take advantage of CPU cache, an array we are looping through must be stored sequentially as a contiguous memory block.
If inner loop iterates through elements of a row then chunk of a row is loaded into cache upon the first access:
inner loop j (rows)
j=1 j=2
1 2 ... 1000 ... --> 1 2 ... 1000
1001 1002 ... 2000 ... --> 1001 1002 ... 2000
2001 2002 ... 3000 ...
...
This will not happen if in the inner loop we iterate through elements of a column (loop variable i) because columns are not contiguous memory blocks:
inner loop i (columns)
i=1 i=2
1 2 ... 1000 ... --> 1 1001 2001 ... 2 1002 2002 ...
1001 1002 ... 2000 ...
2001 2002 ... 3000 ...
...
In this case, CPU will need to access main memory to load each matrix element.
Avoiding Parallel Overhead at Low Iteration Counts
- Use conditional parallelization:
#pragma omp parallel for if(N > 1000)
for(i = 0; i < N; i++){
a[i] = k*b[i] + c[i];
}
Minimizing Parallelization Overhead
If the inner loop is parallelized, in each iteration step of the outer loop, a parallel region is created. This causes parallelization overhead.
- It’s better to create fewer threads with longer tasks.
Key Points
With the
parallel for
pragma, a loop is executed in parallelWhen a variable is accessed by different threads, incorrect results can be produced.
Each thread receives a copy of the variable created by the
private
clause