Introduction
Overview
Teaching: 15 min
Exercises: 0 minQuestions
What is Dask?
How does it work?
Objectives
Dask is a flexible library for parallel computing in Python. Python code is (or was) a bit notoriously difficult to parallelize because of something called the global interpreter lock (GIL) which essentially meant Python couldn’t be multi-threaded within a single process. Dask was started because of a desire to parallelize the existing SciPy stack and libraries spun off from that. It started by attempting to parallelize the NumPY library because it forms the basis on which SciPy was built. NumPy was also difficult to use when working with large datasets that didn’t fit nicely into memory but that fit nicely onto disk.
Dask tries to be familiar
If you are already used to working with python module such as NumPy, Pandas, or PySpark Dask provides similar interfaces allowing python code already using these modules to be converted to use Dask parallel constructs easily.
How does Dask work?
Dask creates task graphs
Dask creates a graph of tasks, which show the inter-dependencies of the tasks. These tasks are usually functions that operate on some input. Below is an example of a very basic visualization of one of these task graphs for a particular set of tasks. The circles represent the function and the arrows point from the function to boxes which represent the output of the function. These outputs then have arrows which point to further functions which use these outputs as inputs.
Dask schedules tasks
With the task graph in hand, Dask will schedule tasks on workers/threads and ensure the outputs of those tasks get from one task to the next in the graph. There are a number of different ways Dask can schedule these tasks from serial, to multi-threaded, to across multiple distributed compute nodes on an HPC cluster.
Beginnings of Dask
Dask is a fairly new project. The First commit to the Dask project occurred in Dec, 2014. Dask started small, see how short the first commit is. The core.py
file is only 21 lines including white space. If you look at the test_comprehensive_user_experience
function you can see how it was to be used very early on. It already looked a lot like Dask Delayed does now, as we shall see.
In this test function it records the functions inc
and add
and their arguments and then later executes those calculations using the get
function to produce the final result of the calculation.
Key Points