Independent Tasks and Job Schedulers

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • How to run several independent jobs in parallel?

Objectives
  • Be able to identify a problem of independent tasks

  • Use a scheduler like SLURM to submit an array job

Some problems are easy to parallelize. If the sub-tasks don’t need information from each other, or if the sub-tasks could be started in any order (which is more-or-less the same thing), then things are pretty simple.

As an example, there was a researcher who had about 10,000 images of the seafloor collected by a submersible robot, and a single-thread program which quantified the features of one image. (E.g., how many blobs are there? How big are they?) Running the program on a single image took about an hour, so doing the whole data set sequentially would take about 10,000 hours. But there was no reason the program couldn’t be run on different CPU cores or even different computers at the same time.

Computer scientists sometimes call these “embarrassingly parallel” problems. We call them “perfectly parallel” because they can be extremely efficient. A “parameter sweep” is a common example of this type of problem.

Independent tasks in your field?

Can you think of a problem in your field that can be formulated as a set of independent tasks?

We don’t need a fancy parallel programming interface to handle these. We can use a simple task scheduler like GNU Parallel or a dynamic resource manager like Slurm, LSF, PBS, Torque, Grid Engine, etc. Most DRMs support “job arrays”, also called “array jobs” or “task arrays”.

Here’s an example of a submission script for an array job with the SLURM scheduler.

#!/bin/bash
#SBATCH --array=1-100
#SBATCH --time=0-00:01:00

echo "This is task $SLURM_ARRAY_TASK_ID on $(hostname) at $(date)"

If the above is saved into a script called array_job_submit.sh it can be submitted to the SLURM scheduler with:

$ sbatch array_job_submit.sh

The creation of a Julia set image that we did in the last episode was also nearly perfectly parallel:

Why not to use a DRM or similar tool?

Given the above, why didn’t we do the Julia set exercise using Slurm or GNU Parallel?

Discussion

Two reasons:

The Julia set problem is not 100% independent: Each sub-task needs to write its result back to the image array, which is a shared piece of memory. It’s not practical to manage access to that piece of shared memory from multiple processes.

There are costs to scheduling, starting, and cleaning up a DRM job or a GNU Parallel task. On Alliance clusters we advise people to estimate that a Slurm job might take as much as a minute of overhead. So the sub-tasks or “work units” should be significantly larger than a minute if you’re going to use Slurm to manage them– an hour or more, preferably.

Key Points