OpenMP Tasks

Overview

Teaching: 25 min
Exercises: 5 min

Questions

How to recurse in parallel

Objectives

Use general parallel tasks

There are some fundamental limitations to the openmp constructs we have considered so far.

only loops with a known length at run time are allowed
finite number of parallel sections.

Some common computation tasks such as recursive or pointer-chasing algorithms and while loops are not compatible with these limitations.

The omp task

The omp task directive was introduced in version 3.0 in 2008.

What is a task?

A task is composed of the code to be executed and the data environment (inputs to be used and outputs to be generated).

Task execution model

The execution of a task may be immediate or delayed.
If delayed, the task in placed in a pool of tasks.
All threads can take tasks out of the pool and execute them.
Tasks need a scheduler.
The implementation of the scheduling is compiler-specific.
Need control of task generation.

A typical program using tasks would follow the code shown below:

/* Create threads */
#pragma omp parallel 
{
    #pragma omp single 
    { 
        for(i<1000) /* Generate tasks */
        #pragma omp task
           work_on(i) 
    }
}

A parallel region creates a team of threads
A single thread then creates the tasks, adding them to a pool that belongs to the team
All the threads in that team execute tasks from this queue. Tasks could run in any order.

If you need to have results from the tasks before you can continue, you can use the taskwait directive to pause your program until the tasks are done.

Finding the Smallest Factor

Let’s use tasks to find the smallest factor of a large number (using 4993*5393 as test case). Start with the serial code:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    int N = 26927249;
    int f;
    for (f = 2; f <= N; f++)
    {
        if (f % 200 == 0) /* Print progress */
        {
            fprintf(stdout, "checking: %li\n", f);
            fflush(stdout);
        }
        /* Check if f is a factor */
        if (N % f == 0)
        { // The remainder is 0, found factor!
            fprintf(stdout, "Factor: %li\n", f);
            exit(0); /* Terminate the program */
        }
        for (int i = 1; i < 4e6; i++)
            ; /* Burn some CPU cycles */
        /* End of factor finding block */    
    }
}

find_factor_serial.c

Exercise

Make the following changes to the code:

Turn the factor finding block into a task.

Generate a task for each trial factor.

Once a factor is identified, stop generating tasks.

Print the progress of tasks generation and execution separately.

Run the program in parallel and compare execution time with the serial version.
time srun ./factor
time srun -c4 ./factor
How many tasks are in the pool?

How many tasks will be created if you submit a job with 100 and 400 threads?

Are all tasks generated right away?
time srun -c4 --export OMP_NUM_THREADS=100 ./factor
solution
/* File find_factor_omp.c */
#include <stdio.h>
#include <stdlib.h>

int main()
{
  long N = 4993 * 5393;
  long f;
#pragma omp parallel
#pragma omp single
  for (f = 2; f <= N; f++) /* Loop generating tasks */
  {
    if (f % 200 == 0) /* Print progress */
    {
      fprintf(stdout, "%li tasks generated\n", f);
      fflush(stdout);
    }
#pragma omp task
    { /* Check if f is a factor */
      if (f % 200 == 0) /* Print progress */
        fprintf(stdout, "    %li tasks done\n", f);
      if (N % f == 0)
      { // The remainder is 0, found factor!
        fprintf(stdout, "Factor: %li\n", f);
        exit(0); /* Terminate the program */
      }
      else
        for (int i = 1; i < 4e6; i++)
          ; /* Burn some CPU cycles */
    }
  } 
}
Not all tasks are generated right away. To prevent overloading of the system, the scheduler controls the size of the task pool. As soon as the list of deferred tasks gets too long, the implementation stops generating new tasks, and switches all threads in the team to executing already generated tasks.

Task Synchronization

Some applications require tasks to be executed in a certain order.

For example, consider the code:

x = f();
#pragma omp task
{ y1 = g1(x); }
#pragma omp task
{ y2 = g2(x); }
z = h(y1) + h(y2);

We need to wait until tasks computing y1 and y2 are finished before computing z.

Use the omp taskwait directive:

x = f();
#pragma omp task
{ y1 = g1(x); }
#pragma omp task
{ y2 = g2(x); }
#pragma omp taskwait
z = h(y1) + h(y2);

This corrected code will generate two tasks and wait for both of them to be finished before computing z.

Controlling Task Execution Order

In some cases a task may depend on another task, but not on all generated tasks. The taskwait directive in this case is too restrictive.

#pragma omp task /* task A */
preprocess_data(A);
#pragma omp task /* task B */
preprocess_data(B);
#pragma omp taskwait
#pragma omp task /* task C */
compute_stats(A);
#pragma omp task /* task D */
compute_stats(B);

In the example above case task C should be executed after task A, but it won’t run until the execution of task B is finished.

The task depend clause enforces constraints on the scheduling of tasks. Only sibling tasks are dependent on each other under these constraints.

The depend clause is followed by an access mode that can be in, out or inout.
depend(in:x) - task will read variable x
depend(out:y) - task will write variable y, previous value of y will be ignored.
depend(inout:z) - task will both read and write variable z
The OpenMP scheduler considers task dependencies and automatically decides whether a task is ready for execution.

The previous example rewritten with task dependencies will be:

#pragma omp task depend(out:A) /* task A */
preprocess_data(A);
#pragma omp task depend(out:B) /* task B */
preprocess_data(B);
#pragma omp task depend(in:A) /* task C */
compute_stats(A);
#pragma omp task depend(in:B) /* task D */
compute_stats(B);

This example illustrates READ after WRITE dependency: task A writes A, which is then read by task C.

Controlling Order of Data Access

Consider the following loop:

for(i=1;i<N;i++)
  for(j=1;j<N;j++)
    x[i][j] = x[i-1][j] + x[i][j-1];

The loop cannot be parallelized with parallel for construct due to data dependency. The next task should be executed after the previous task updates the variable x.

Use tasks with dependencies to parallelize this code.
You should be able to correct the code by only adding the task depend clause.
Start with the incorrectly parallelized code:

#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv)
{

 int N = 8;
 int x[N][N], y[N][N];
 int i, j;

 /* Initialize x,y */
 for (i = 0; i < N; i++)
 {
   x[0][i] = x[i][0] = y[0][i] = y[i][0] = i;
 }

 /* Serial computation */
 for (i = 1; i < N; i++)
 {
   for (j = 1; j < N; j++)
     x[i][j] = x[i - 1][j] + x[i][j - 1];
 }

 /* Parallel computation */
#pragma omp parallel
#pragma omp single
 for (i = 1; i < N; i++)
 {
   for (j = 1; j < N; j++)
#pragma omp task
     y[i][j] = y[i - 1][j] + y[i][j - 1];
 }

 printf("Serial result:\n");
 for (i = 0; i < N; i++)
 {
   for (j = 0; j < N; j++)
     printf("%6d", x[i][j]);
   printf("\n");
 }
 printf("Parallel result:\n");
 for (i = 0; i < N; i++)
 {
   for (j = 0; j < N; j++)
     printf("%6d", y[i][j]);
   printf("\n");
 }
}

task_depend_omp.c

This example works correctly with gcc/9.3.0 but not with gcc/10 or gcc/11.

solution

Parallel threads are writing data, so the dependence should be depend(out:y)

Computing Fibonacci Numbers

The next example shows how task and taskwait directives can be used to compute Fibonacci numbers recursively. The Fibonacci Sequence is the series of numbers: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, … The next number is found by adding up the two numbers before it.

In the parallel construct, the single directive calls fib(n).

The call to fib(n) generates two tasks. One of the tasks computes fib(n-1) and the other computes fib(n-2).

The two return values are added together to produce the value returned by fib(n).

Each of the calls to fib(n-1) and fib(n-2) will in turn generate two tasks. Tasks will be recursively generated until the argument passed to fib() is less than 2. The taskwait directive ensures that both tasks generated in fib( ) compute i and j before return.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

/* Usage: ./fib [n] */
/* Default n=10 */

int fib(int n) {
  int i, j;
  if (n<2)
     return n;
  else {
     #pragma omp task shared(i)
     i=fib(n-1);
     #pragma omp task shared(j)
     j=fib(n-2);
     #pragma omp taskwait
     return i+j;
  }
}

int main(int argc, char **argv){
  int n=10, result;
  if(argc==2) {
      char *a = argv[1];
      n = atoi(a);
  }

  #pragma omp parallel
  {
     #pragma omp single
     result = fib(n);
  }
  printf("Fibonacci number %d is %d\n", n, result);
}
fib_omp.c

Tasks?

How do tasks work with different number of threads? Can OpenMP use more than one thread to execute a task?

Solution

Each task region will be executed by one thread. OpenMP will not use more threads to execute a single task.

Task gotchas

There are a few gotchas to be aware of. While the intention is that tasks will run in parallel, there is nothing in the specification that guarantees this behavior. You may need to check how your particular environment works.

Review

OpenMP pragma, function, or clause	Concepts
#pragma omp parallel	Parallel region, teams of threads, structured block, interleaved execution across threads
int omp_get_thread_num() int omp_get_num_threads()	Create threads with a parallel region and split up the work using the number of threads and thread ID
double omp_get_wtime()	Speedup and Amdahl’s law. Assessing performance
setenv OMP_NUM_THREADS N	Internal control variables. Setting the default number of threads with an environment variable
#pragma omp barrier #pragma omp critical	Synchronization and race conditions.
#pragma omp for #pragma omp parallel for	Worksharing, parallel loops, loop carried dependencies
reduction(op:list)	Reductions of values across a team of threads
schedule(dynamic [,chunk]) schedule (static [,chunk])	Loop schedules, loop overheads and load balance
private(list) firstprivate(list) shared(list)	Data environment
nowait	Disabling implied barriers on workshare constructs, the high cost of barriers.
#pragma omp single	Workshare with a single thread
#pragma omp task #pragma omp taskwait	Tasks including the data environment for tasks.

Key Points

OpenMP can manage general parallel tasks

tasks now allow the parallelization of applications exhibiting irregular parallelism such as recursive algorithms, and pointer-based data structures.

previous episode

ACENET Summer School - Directive-Based Parallel Programming with OpenMP and OpenACC

next episode

OpenMP Tasks

Overview

The omp task

Task execution model

Finding the Smallest Factor

Exercise

solution

Task Synchronization

Controlling Task Execution Order

Controlling Order of Data Access

solution

Computing Fibonacci Numbers

Tasks?

Solution

Task gotchas

Review

Key Points

previous episode

next episode