Input and Output

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How is input/output in the HPC clusters organized?

  • How do I optimize input/output in HPC environment?

Objectives
  • Sketch the storage structure of a generic, typical cluster

  • Understand the terms “local disk”, “IOPS”, “SAN”, “metadata”, “parallel I/O”

Cluster storage

Cluster Architecture

Here is the basic architecture of an HPC cluster:

Broadly, you have two choices: You can do I/O to the node-local disk (if there is any), or you can do I/O to the SAN. Local disk suffers little or no contention but is inconvenient.

Local disk

What are the inconveniences of a local disk? What sort of work patterns suffer most from this? What suffers the least? That is, what sort of jobs can use local disk most easily?

Discussion

In a shared computing environment with a DRM running (e.g. Slurm) you cannot know in advance which machine will run your job, so you can’t stage the data needed onto the node in advance. Moving the input data onto the node and the output data off the node must be incorporated into the job script.

If you’re running a distributed-memory job that uses more than one node, then the data has to be distributed and collected from all the nodes, or else all I/O has to be done by a single node— which may become a bottleneck to performance.

Some distributed-memory programs use parallel I/O, which means that different processes may write to a single file at the same time. This is not possible with node-local storage since the nodes don’t share files.

Jobs that use a lot of temporary files which are discarded at the end of the job are very well-suited to node-local storage.

Jobs that make do many small I/O operations (e.g. writing detailed progress information to a log file) are very poorly suited to network storage (SAN), and so are well-suited to node-local storage by comparison.

Mixed use of node and network storage is also possible, e.g. reading data from the SAN but writing results to local disk before staging them back at the end of the calculation, or at infrequent intervals during the calculation.

Local disk performance depends on a lot of things. If you’re interested you can get an idea about Intel data center solid-state drives here.

For that particular disk model, sequential read/write speed is 3200/3200 MB/sec, It is also capable of 654K/220K IOPS (IO operations per second) for 4K random read/write.

Storage Array Network

Most input and output on a cluster goes through the SAN. There are many architectural choices made in constructing SAN.

This is all the domain of the sysadmins, but what should you as the user do about input/output?

Programming input/output.

If you’re doing parallel computing you have choices about how you do input and output.

Measuring I/O rates

The most important part you should know is that the parallel filesystem is optimized for storing large shared files that may be accessible from many computing node. If used to store many small size files, it will perform poorly. As you may have been told in our new-user seminar, we strongly recommend that you do not to generate millions of small size files.

—– Takeaways:

Moving data on and off a cluster

Estimating transfer times

  1. How long to move a gigabyte at 100Mbit/s?
  2. How long to move a terabyte at 1Gbit/s?

Solution

Remember a byte is 8 bits, a megabyte is 8 megabits, etc.

  1. 1024 MByte/GByte * 8 bit/Byte / 100 Mbit/sec = 81 sec
  2. 1024 TByte/GByte * 8 bit/Byte / 1 Gbit/sec = 8192 sec, a little under 2 hours and 20 minutes

Remember these are “theoretical maximums”! There is almost always some other bottleneck or contention that reduces this!

Key Points