This lesson is being piloted (Beta version)

Real Computations

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • What happens when I use Dask for “real” computations?

  • How can I see CPU efficiency of a SLURM Job?

Objectives

So far we have seen how to use Dask delayed to speed up or parallelize the time.sleep() function. Lets now try it with some “real” computations.

Lets start with our hello.py script and add some functions that do some computations and used Dask delayed to parallelize them.

$ cp hello.py compute.py
$ nano compute.py

Then edit to contain the following.

import time
import dask

def elapsed(start):
  return str(time.time()-start)+"s"

def computePart(size):
  part=0
  for i in range(size):
    part=part+i
  return part

def main():

  size=40000000
  numParts=4

  parts=[]
  for i in range(numParts):
    part=dask.delayed(computePart)(size)
    parts.append(part)
  sumParts=dask.delayed(sum)(parts)

  start=time.time()
  result=sumParts.compute()
  computeTime=elapsed(start)

  print()
  print("=======================================")
  print("result="+str(result))
  print("Compute time: "+computeTime)
  print("=======================================")
  print()

if __name__=="__main__":
  start=time.time()
  main()
  wallClock=elapsed(start)
  print()
  print("----------------------------------------")
  print("wall clock time:"+wallClock)
  print("----------------------------------------")
  print()

compute.py

Lets run it.

$ srun --cpus-per-task=1 python compute.py

=======================================
result=3199999920000000
Compute time: 11.632155656814575s
=======================================


----------------------------------------
wall clock time:11.632789134979248s
----------------------------------------

More cores!

Take the compute.py code and run it with different numbers of cores like we did for the delayed.py code. What do you expect to happen? What actually happens?

Solution

1 core

$ srun --cpus-per-task=1 python compute.py
=======================================
result=3199999920000000
Compute time: 11.632155656814575s
=======================================


----------------------------------------
wall clock time:11.632789134979248s
----------------------------------------

2 cores

$ srun --cpus-per-task=2 python compute.py
=======================================
result=3199999920000000
Compute time: 11.144386768341064s
=======================================


----------------------------------------
wall clock time:11.14497971534729s
----------------------------------------

4 cores

$ srun --cpus-per-task=4 python compute.py
=======================================
result=3199999920000000
Compute time: 11.241060972213745s
=======================================


----------------------------------------
wall clock time:11.241632223129272s
----------------------------------------

Shouldn’t they be approximately dividing up the work, but the times are all about the same?

In the last exercise we learned that our “computations” do not parallelize as well as the time.sleep() function did previously. To explore this lets take a look at the CPU efficiency of our jobs. To do this we can use the seff command which outputs stats for a job after it has completed given the JobID including the CPU efficiency. But where do we get the JobID? To get the JobID we can run our jobs in the background by appending the & character to our srun command. Once we have our job running in the background we check on it in the queue to get the JobID.

To make it a little easier to see what is going on in our queue lets create an alias for the squeue command with some extra options.

$ alias sqcm="squeue -u $USER -o'%.7i %.9P %.8j %.7u %.2t %.5M %.5D %.4C %.5m %N'"

Now we can run the command sqcm and get only our jobs (not everyones) and also additional information about our jobs, for example how many cores and how much memory was requested.

Lets run a job now and try it out.

$ srun --cpus-per-task=1 python compute.py&
$ sqcm
 JOBID PARTITION     NAME   USER ST  TIME NODES CPUS MIN_M NODELIST
    964 cpubase_b   python user49  R  0:10     1    1  256M node-mdm1
$
=======================================
Compute time: 11.121154069900513s
=======================================


----------------------------------------
wall clock time:11.121747255325317s
----------------------------------------

I got the output from our sqcm command followed a little later by the output of our job. It is important to run the sqcm command before the job completes or you won’t get to see the JobID.

Our sqcm command showed us that our job was running with one CPU on one node and with 256M of memory. It also shows us how long the job has run for and let us see the JobID.

When the srun command is running in the background I found I had to press the return key to get my prompt back and at the same time got a message about my background srun job completing.

[1]+  Done                    srun --cpus-per-task=1 python compute.py
$

With the JobID we can use the seff command to see what the cpu efficiency was.

$ seff 965
Job ID: 965
Cluster: pcs
User/Group: user49/user49
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:11
CPU Efficiency: 84.62% of 00:00:13 core-walltime
Job Wall-clock time: 00:00:13
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 256.00 MB (256.00 MB/core)

We got almost 85% CPU efficiency, not too bad.

CPU efficiency

Given what we just learned about how to check on a jobs efficiency, lets re-run our compute.py jobs with different numbers of cores 1,2,4 and see what the CPU efficiency is.

Solution

1 core

$ srun --cpus-per-task=1 python compute.py&
$ sqcm
  JOBID PARTITION     NAME   USER ST  TIME NODES CPUS MIN_M NODELIST
    966 cpubase_b   python user49  R  0:04     1    1  256M node-mdm1
$ seff 966
Job ID: 966
Cluster: pcs
User/Group: user49/user49
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:11
CPU Efficiency: 91.67% of 00:00:12 core-walltime
Job Wall-clock time: 00:00:12
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 256.00 MB (256.00 MB/core)

2 cores

$ srun --cpus-per-task=2 python compute.py&
$ sqcm
  JOBID PARTITION     NAME   USER ST  TIME NODES CPUS MIN_M NODELIST
    967 cpubase_b   python user49  R  0:04     1    2  256M node-mdm1
$ seff 967
Job ID: 967
Cluster: pcs
User/Group: user49/user49
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:12
CPU Efficiency: 50.00% of 00:00:24 core-walltime
Job Wall-clock time: 00:00:12
Memory Utilized: 12.00 KB
Memory Efficiency: 0.00% of 512.00 MB

4 cores

$ srun --cpus-per-task=4 python compute.py&
$ sqcm
  JOBID PARTITION     NAME   USER ST  TIME NODES CPUS MIN_M NODELIST
    968 cpubase_b   python user49  R  0:04     1    4  256M node-mdm1
$ seff 968
Job ID: 968
Cluster: pcs
User/Group: user49/user49
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:00:12
CPU Efficiency: 25.00% of 00:00:48 core-walltime
Job Wall-clock time: 00:00:12
Memory Utilized: 4.00 KB
Memory Efficiency: 0.00% of 1.00 GB

The efficiencies are ~ 92%, 50%, and 25% for 1,2,4 CPUs respectively. It looks like very little if any of these threads are running in parallel.

What is going on, where did our parallelism go? Remember the Global Interpreter Lock (GIL) I mentioned way back at the beginning? Well that’s what is going on! When a thread needs to use the Python interpreter it locks access to the interpreter so no other thread can access it, when it is done it releases the lock and another thread can use it. This means that no single thread can execute Python code at the same time!

But why did our sleep function work? When we used the sleep function, the waiting happens outside the Python interpreter. In this case each thread will happily wait in parallel since the ‘work’ inside the sleep function doesn’t have to wait for the Python interpreter to become free and so they can run in parallel. This goes for any other function calls that do not run in the Python interpreter, for example NumPy operations often run as well-optimized C code and don’t need the Python interpreter to operate, however it can be tricky to know exactly when and how the Python interpreter will be needed.

Does this mean Dask is not that great for parallelism? Well, the way we have used it so far can be useful in certain circumstances, for example what I mentioned about NumPy above, however there is another way you can use Dask, and that is in a distributed way. Lets look at that now!

Key Points