Real Computations
Overview
Teaching: 15 min
Exercises: 10 minQuestions
What happens when I use Dask for “real” computations?
How can I see CPU efficiency of a SLURM Job?
Objectives
So far we have seen how to use Dask delayed to speed up or parallelize the time.sleep()
function. Lets now try it with some “real” computations.
Lets start with our hello.py
script and add some functions that do some computations and used Dask delayed
to parallelize them.
$ cp hello.py compute.py
$ nano compute.py
Then edit to contain the following.
import time
import dask
def elapsed(start):
return str(time.time()-start)+"s"
def computePart(size):
part=0
for i in range(size):
part=part+i
return part
def main():
size=40000000
numParts=4
parts=[]
for i in range(numParts):
part=dask.delayed(computePart)(size)
parts.append(part)
sumParts=dask.delayed(sum)(parts)
start=time.time()
result=sumParts.compute()
computeTime=elapsed(start)
print()
print("=======================================")
print("result="+str(result))
print("Compute time: "+computeTime)
print("=======================================")
print()
if __name__=="__main__":
start=time.time()
main()
wallClock=elapsed(start)
print()
print("----------------------------------------")
print("wall clock time:"+wallClock)
print("----------------------------------------")
print()
Lets run it.
$ srun --cpus-per-task=1 python compute.py
=======================================
result=3199999920000000
Compute time: 11.632155656814575s
=======================================
----------------------------------------
wall clock time:11.632789134979248s
----------------------------------------
More cores!
Take the
compute.py
code and run it with different numbers of cores like we did for thedelayed.py
code. What do you expect to happen? What actually happens?Solution
1 core
$ srun --cpus-per-task=1 python compute.py
======================================= result=3199999920000000 Compute time: 11.632155656814575s ======================================= ---------------------------------------- wall clock time:11.632789134979248s ----------------------------------------
2 cores
$ srun --cpus-per-task=2 python compute.py
======================================= result=3199999920000000 Compute time: 11.144386768341064s ======================================= ---------------------------------------- wall clock time:11.14497971534729s ----------------------------------------
4 cores
$ srun --cpus-per-task=4 python compute.py
======================================= result=3199999920000000 Compute time: 11.241060972213745s ======================================= ---------------------------------------- wall clock time:11.241632223129272s ----------------------------------------
Shouldn’t they be approximately dividing up the work, but the times are all about the same?
In the last exercise we learned that our “computations” do not parallelize as well as the time.sleep()
function did previously. To explore this lets take a look at the CPU efficiency of our jobs. To do this we can use the seff
command which outputs stats for a job after it has completed given the JobID including the CPU efficiency. But where do we get the JobID? To get the JobID we can run our jobs in the background by appending the &
character to our srun
command. Once we have our job running in the background we check on it in the queue to get the JobID.
To make it a little easier to see what is going on in our queue lets create an alias for the squeue
command with some extra options.
$ alias sqcm="squeue -u $USER -o'%.7i %.9P %.8j %.7u %.2t %.5M %.5D %.4C %.5m %N'"
Now we can run the command sqcm
and get only our jobs (not everyones) and also additional information about our jobs, for example how many cores and how much memory was requested.
Lets run a job now and try it out.
$ srun --cpus-per-task=1 python compute.py&
$ sqcm
JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_M NODELIST
964 cpubase_b python user49 R 0:10 1 1 256M node-mdm1
$
=======================================
Compute time: 11.121154069900513s
=======================================
----------------------------------------
wall clock time:11.121747255325317s
----------------------------------------
I got the output from our sqcm
command followed a little later by the output of our job. It is important to run the sqcm
command before the job completes or you won’t get to see the JobID.
Our sqcm
command showed us that our job was running with one CPU on one node and with 256M of memory. It also shows us how long the job has run for and let us see the JobID.
When the srun
command is running in the background I found I had to press the return key to get my prompt back and at the same time got a message about my background srun
job completing.
[1]+ Done srun --cpus-per-task=1 python compute.py
$
With the JobID we can use the seff
command to see what the cpu efficiency was.
$ seff 965
Job ID: 965
Cluster: pcs
User/Group: user49/user49
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:11
CPU Efficiency: 84.62% of 00:00:13 core-walltime
Job Wall-clock time: 00:00:13
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 256.00 MB (256.00 MB/core)
We got almost 85% CPU efficiency, not too bad.
CPU efficiency
Given what we just learned about how to check on a jobs efficiency, lets re-run our compute.py jobs with different numbers of cores 1,2,4 and see what the CPU efficiency is.
Solution
1 core
$ srun --cpus-per-task=1 python compute.py& $ sqcm
JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_M NODELIST 966 cpubase_b python user49 R 0:04 1 1 256M node-mdm1
$ seff 966
Job ID: 966 Cluster: pcs User/Group: user49/user49 State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:11 CPU Efficiency: 91.67% of 00:00:12 core-walltime Job Wall-clock time: 00:00:12 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 256.00 MB (256.00 MB/core)
2 cores
$ srun --cpus-per-task=2 python compute.py& $ sqcm
JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_M NODELIST 967 cpubase_b python user49 R 0:04 1 2 256M node-mdm1
$ seff 967
Job ID: 967 Cluster: pcs User/Group: user49/user49 State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 00:00:12 CPU Efficiency: 50.00% of 00:00:24 core-walltime Job Wall-clock time: 00:00:12 Memory Utilized: 12.00 KB Memory Efficiency: 0.00% of 512.00 MB
4 cores
$ srun --cpus-per-task=4 python compute.py& $ sqcm
JOBID PARTITION NAME USER ST TIME NODES CPUS MIN_M NODELIST 968 cpubase_b python user49 R 0:04 1 4 256M node-mdm1
$ seff 968
Job ID: 968 Cluster: pcs User/Group: user49/user49 State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 4 CPU Utilized: 00:00:12 CPU Efficiency: 25.00% of 00:00:48 core-walltime Job Wall-clock time: 00:00:12 Memory Utilized: 4.00 KB Memory Efficiency: 0.00% of 1.00 GB
The efficiencies are ~ 92%, 50%, and 25% for 1,2,4 CPUs respectively. It looks like very little if any of these threads are running in parallel.
What is going on, where did our parallelism go? Remember the Global Interpreter Lock (GIL) I mentioned way back at the beginning? Well that’s what is going on! When a thread needs to use the Python interpreter it locks access to the interpreter so no other thread can access it, when it is done it releases the lock and another thread can use it. This means that no single thread can execute Python code at the same time!
But why did our sleep
function work? When we used the sleep
function, the waiting happens outside the Python interpreter. In this case each thread will happily wait in parallel since the ‘work’ inside the sleep
function doesn’t have to wait for the Python interpreter to become free and so they can run in parallel. This goes for any other function calls that do not run in the Python interpreter, for example NumPy operations often run as well-optimized C code and don’t need the Python interpreter to operate, however it can be tricky to know exactly when and how the Python interpreter will be needed.
Does this mean Dask is not that great for parallelism? Well, the way we have used it so far can be useful in certain circumstances, for example what I mentioned about NumPy above, however there is another way you can use Dask, and that is in a distributed way. Lets look at that now!
Key Points