Wrapping Up
Overview
Teaching: 5 min
Exercises: minQuestions
What did we learn?
What next?
Objectives
Summary
- Dask uses
dask.delayed
to compute task graphs. - These task graphs can then be later used to
compute
the result in either a multi-threaded or distributed way. - Multi-threaded python only benefits from the parallelism if it does considerable work outside the Python interpreter because of the GIL (e.g. NumPy etc.).
- Distributed parallel Python code works well for most Python code but has a bit of an extra overhead for message-passing, start up/tear down of Dask cluster.
- Python code is slow, but Cython can be used to easily compile computationally intensive parts into faster C code.
- Dask Array provides a way to parallelize NumPy arrays either in a multi-threaded way for faster computing, or in a distributed way to reduce the individual compute node memory requirements and still give a reasonable speed up over serial execution.
Futher reading
On topics we have covered there can be more to learn. Check out these docs.
- Dask docs
- General best practices
- Dask examples
- Dask Delayed Docs
- Delayed best practices
- Dask Array Docs
- Best practices for Dask array
There are also some main features of Dask we didn’t talk about.
- Dask DataFrames which provide a Pandas like interface. See also Data Frames best practices.
- Dask Bag which implements operations like, map,filter, fold, and groupby on collections of generic Python objects similar to PyToolz or a Pythonic version of PySpark.
- Dask Futures which supports a real-time task framework, similar to
dask.delayed
but is immediate rather than lazy.
Key Points