This lesson is being piloted (Beta version)

OpenMP Tasks

Overview

Teaching: 25 min
Exercises: 5 min
Questions
  • How to recurse in parallel

Objectives
  • Use general parallel tasks

There are some fundamental limitations to the openmp constructs we have considered so far.

Some common computation tasks such as recursive or pointer-chasing algorithms and while loops are not compatible with these limitations.

The omp task

What is a task?

Task execution model

A typical program using tasks would follow the code shown below:

/* Create threads */
#pragma omp parallel 
{
    #pragma omp single 
    { 
        for(i<1000) /* Generate tasks */
        #pragma omp task
           work_on(i) 
    }
}

If you need to have results from the tasks before you can continue, you can use the taskwait directive to pause your program until the tasks are done.

Finding the Smallest Factor

Let’s use tasks to find the smallest factor of a large number (using 4993*5393 as test case). Start with the serial code:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    int N = 26927249;
    int f;
    for (f = 2; f <= N; f++)
    {
        if (f % 200 == 0) /* Print progress */
        {
            fprintf(stdout, "checking: %li\n", f);
            fflush(stdout);
        }
        /* Check if f is a factor */
        if (N % f == 0)
        { // The remainder is 0, found factor!
            fprintf(stdout, "Factor: %li\n", f);
            exit(0); /* Terminate the program */
        }
        for (int i = 1; i < 4e6; i++)
            ; /* Burn some CPU cycles */
        /* End of factor finding block */    
    }
}

find_factor_serial.c

Exercise

Make the following changes to the code:

  • Turn the factor finding block into a task.
  • Generate a task for each trial factor.
  • Once a factor is identified, stop generating tasks.
  • Print the progress of tasks generation and execution separately.

Run the program in parallel and compare execution time with the serial version.

time srun ./factor
time srun -c4 ./factor
  • How many tasks are in the pool?
  • How many tasks will be created if you submit a job with 100 and 400 threads?
  • Are all tasks generated right away?
time srun -c4 --export OMP_NUM_THREADS=100 ./factor

solution

/* File find_factor_omp.c */
#include <stdio.h>
#include <stdlib.h>

int main()
{
  long N = 4993 * 5393;
  long f;
#pragma omp parallel
#pragma omp single
  for (f = 2; f <= N; f++) /* Loop generating tasks */
  {
    if (f % 200 == 0) /* Print progress */
    {
      fprintf(stdout, "%li tasks generated\n", f);
      fflush(stdout);
    }
#pragma omp task
    { /* Check if f is a factor */
      if (f % 200 == 0) /* Print progress */
        fprintf(stdout, "    %li tasks done\n", f);
      if (N % f == 0)
      { // The remainder is 0, found factor!
        fprintf(stdout, "Factor: %li\n", f);
        exit(0); /* Terminate the program */
      }
      else
        for (int i = 1; i < 4e6; i++)
          ; /* Burn some CPU cycles */
    }
  } 
}

Not all tasks are generated right away. To prevent overloading of the system, the scheduler controls the size of the task pool. As soon as the list of deferred tasks gets too long, the implementation stops generating new tasks, and switches all threads in the team to executing already generated tasks.

Task Synchronization

Some applications require tasks to be executed in a certain order.

For example, consider the code:

x = f();
#pragma omp task
{ y1 = g1(x); }
#pragma omp task
{ y2 = g2(x); }
z = h(y1) + h(y2);
x = f();
#pragma omp task
{ y1 = g1(x); }
#pragma omp task
{ y2 = g2(x); }
#pragma omp taskwait
z = h(y1) + h(y2);

This corrected code will generate two tasks and wait for both of them to be finished before computing z.

Controlling Task Execution Order

In some cases a task may depend on another task, but not on all generated tasks. The taskwait directive in this case is too restrictive.

#pragma omp task /* task A */
preprocess_data(A);
#pragma omp task /* task B */
preprocess_data(B);
#pragma omp taskwait
#pragma omp task /* task C */
compute_stats(A);
#pragma omp task /* task D */
compute_stats(B);

In the example above case task C should be executed after task A, but it won’t run until the execution of task B is finished.

The task depend clause enforces constraints on the scheduling of tasks. Only sibling tasks are dependent on each other under these constraints.

The previous example rewritten with task dependencies will be:

#pragma omp task depend(out:A) /* task A */
preprocess_data(A);
#pragma omp task depend(out:B) /* task B */
preprocess_data(B);
#pragma omp task depend(in:A) /* task C */
compute_stats(A);
#pragma omp task depend(in:B) /* task D */
compute_stats(B);

This example illustrates READ after WRITE dependency: task A writes A, which is then read by task C.

Controlling Order of Data Access

Consider the following loop:

for(i=1;i<N;i++)
  for(j=1;j<N;j++)
    x[i][j] = x[i-1][j] + x[i][j-1];

The loop cannot be parallelized with parallel for construct due to data dependency. The next task should be executed after the previous task updates the variable x.

  • Use tasks with dependencies to parallelize this code.
  • You should be able to correct the code by only adding the task depend clause.
  • Start with the incorrectly parallelized code:
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv)
{

 int N = 8;
 int x[N][N], y[N][N];
 int i, j;

 /* Initialize x,y */
 for (i = 0; i < N; i++)
 {
   x[0][i] = x[i][0] = y[0][i] = y[i][0] = i;
 }

 /* Serial computation */
 for (i = 1; i < N; i++)
 {
   for (j = 1; j < N; j++)
     x[i][j] = x[i - 1][j] + x[i][j - 1];
 }

 /* Parallel computation */
#pragma omp parallel
#pragma omp single
 for (i = 1; i < N; i++)
 {
   for (j = 1; j < N; j++)
#pragma omp task
     y[i][j] = y[i - 1][j] + y[i][j - 1];
 }

 printf("Serial result:\n");
 for (i = 0; i < N; i++)
 {
   for (j = 0; j < N; j++)
     printf("%6d", x[i][j]);
   printf("\n");
 }
 printf("Parallel result:\n");
 for (i = 0; i < N; i++)
 {
   for (j = 0; j < N; j++)
     printf("%6d", y[i][j]);
   printf("\n");
 }
}

task_depend_omp.c

This example works correctly with gcc/9.3.0 but not with gcc/10 or gcc/11.

solution

Parallel threads are writing data, so the dependence should be depend(out:y)

Computing Fibonacci Numbers

The next example shows how task and taskwait directives can be used to compute Fibonacci numbers recursively. The Fibonacci Sequence is the series of numbers: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, … The next number is found by adding up the two numbers before it.

  • In the parallel construct, the single directive calls fib(n).
  • The call to fib(n) generates two tasks. One of the tasks computes fib(n-1) and the other computes fib(n-2).
  • The two return values are added together to produce the value returned by fib(n).
  • Each of the calls to fib(n-1) and fib(n-2) will in turn generate two tasks. Tasks will be recursively generated until the argument passed to fib() is less than 2. The taskwait directive ensures that both tasks generated in fib( ) compute i and j before return.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

/* Usage: ./fib [n] */
/* Default n=10 */

int fib(int n) {
  int i, j;
  if (n<2)
     return n;
  else {
     #pragma omp task shared(i)
     i=fib(n-1);
     #pragma omp task shared(j)
     j=fib(n-2);
     #pragma omp taskwait
     return i+j;
  }
}

int main(int argc, char **argv){
  int n=10, result;
  if(argc==2) {
      char *a = argv[1];
      n = atoi(a);
  }

  #pragma omp parallel
  {
     #pragma omp single
     result = fib(n);
  }
  printf("Fibonacci number %d is %d\n", n, result);
}

fib_omp.c

Tasks?

How do tasks work with different number of threads? Can OpenMP use more than one thread to execute a task?

Solution

Each task region will be executed by one thread. OpenMP will not use more threads to execute a single task.

Task gotchas

There are a few gotchas to be aware of. While the intention is that tasks will run in parallel, there is nothing in the specification that guarantees this behavior. You may need to check how your particular environment works.

Review

OpenMP pragma, function, or clause Concepts
#pragma omp parallel Parallel region, teams of threads, structured block, interleaved execution across threads
int omp_get_thread_num()
int omp_get_num_threads()
Create threads with a parallel region and split up the work using the number of threads and thread ID
double omp_get_wtime() Speedup and Amdahl’s law. Assessing performance
setenv OMP_NUM_THREADS N Internal control variables. Setting the default number of threads with an environment variable
#pragma omp barrier
#pragma omp critical
Synchronization and race conditions.
#pragma omp for
#pragma omp parallel for
Worksharing, parallel loops, loop carried dependencies
reduction(op:list) Reductions of values across a team of threads
schedule(dynamic [,chunk])
schedule (static [,chunk])
Loop schedules, loop overheads and load balance
private(list)
firstprivate(list)
shared(list)
Data environment
nowait Disabling implied barriers on workshare constructs, the high cost of barriers.
#pragma omp single Workshare with a single thread
#pragma omp task
#pragma omp taskwait
Tasks including the data environment for tasks.

Key Points

  • OpenMP can manage general parallel tasks

  • tasks now allow the parallelization of applications exhibiting irregular parallelism such as recursive algorithms, and pointer-based data structures.