To help us provide you with the most relevant content, a survey link has been added to the newsletter.
You are very welcome to provide contributions and suggestions about the Technical Newsletter via the survey or by emailing firstname.lastname@example.org
Enabling Distributed Machine Learning at Pawsey
Author: Cristian Di Pietrantonio
Summary. The Pawsey Supercomputing Applications team is pleased to support Machine Learning and Deep Learning projects. This support begins with user documentation which describes how to run different distributed training tasks using Tensorflow and Horovod.
Machine Learning at Pawsey - hardware and expertise
Pawsey has always observed and actively interacted with the scientific community to understand its needs. When the deep learning trend rapidly emerged, our technical and administrative staff recognised it as one of the several rising applications that required a supercomputer with a radical new configuration. This is how Topaz came to be - a supercomputer that makes GPUs the protagonists that complement CPU-centric systems like Magnus.
Topaz is a 22-node supercomputer, each equipped with two NVIDIA V100s - one of the most performant GPU available on the market today. Users can use Topaz to run GPU intensive jobs, like machine learning/deep learning training and inference tasks.
However, a powerful supercomputer is not enough: one must use it properly to make the most out of it. For this reason, Pawsey's Services team have been investing into keeping its members's expertise up-to-date with best practices, latest methodologies and frameworks.
As an initial result of this effort, today we are providing a tutorial, with more to follow, on how to use Tensorflow leveraging the distributed environment offered by Topaz.
Distributed Tensorflow - a quick peek
The complete tutorial, which will be gradually expanded, is available here. This is what you can expect to read and learn:
The solution proposed makes use of the NVIDIA Tensorflow container. Jupyter notebooks support is still under investigation, so the way to perform ML tasks is using classic Python scripts.
There was some development on making Tensorflow run on multiple nodes even before Tensorflow itself provided this feature. As such, while currently Tensorflow provides an experimental support for distributed computation, third-party frameworks are nicely doing the job. One of these, Horovod, stands out for its ease to use and performance. In the tutorial Horovod's basics and its usage with Tensorflow are illustrated.
Multiple Models - Multiple GPUs / Single Model - Multiple GPUs
In understanding user requirements, this tutorial discusses how to train different models in parallel, each one on a different GPU, to speed up prototyping and model optimisation. Moreover, a second approach is presented - using multiple GPUs following the "Data Parallelism" strategy which is particularly useful when you have too much data to fit in the memory of one device. You can partition it among the pool of devices available, fit the same model architecture on different data and blend the found parameters together to collect the acquired knowledge of each model replica.
A brief history of Machine Learning and Deep Learning
It is not a secret that Machine Learning (ML) has contributed considerably to scientific and industrial progress in the past decade. Although most Machine Learning algorithms and ideas are older than their practical applications, their adoption has grown considerably in recent times. There are two main reasons for this:
- The advent of many-cores processors
- The abundance of data produced by several new technologies: internet-connected commodity devices (IoT), web applications (especially social networks), growing scientific instrument precision
Early usage of machine learning frameworks by researchers and companies was supported by commodity hardware, such as workstations and general purpose servers. Moreover, the widespread presence of Graphic Processing Units in consumer machines made possible great initial advancements in this field in a very short time, due to their high instruction throughput and parallelism, that was a perfect fit with the kind of computations required by the most popular machine learning model: the neural network (NN). Time goes by and researchers discover that larger the model, the higher the volume of data they can process and their resulting accuracy. This fact pushed professionals and academics to use cloud computing services that were developing as an answer to the fast growing demand of computing power, and also pushed governments to finance national HPC facilities targeting these new emerging needs. With resources not being a limiting factor anymore, Deep Learning (DL) algorithms came to be as an answer to the question of whether important features to observe in data can be automatically identified and exploited given that an enough amount of data and computing power are available. It turned out that "an enough amount" means a huge quantity of data, ranging from hundreds of gigabytes to several petabytes today, processed by tens to even hundreds of GPUs.
Slurm Resources Allocation - A Primer
Author: Marco De La Pierre
As this is a recurring topic in the Helpdesk tickets, we thought of providing a refresher on how to select compute resources with the Slurm scheduler.
Here is a sample job script that we will discuss throughout this article:
pawsey0001 in the above script needs to be substituted by your Pawsey project ID)
Header and script body
First of all, let us remind that any job script is made up of two sections, that are processed at distinct times:
- The header section is at the top of the script, and is made up of all the lines starting with
#SBATCH. This section is processed by Slurm at the time the script is submitted using sbatch and goes into the queue. It is used by the scheduler to determine how many compute resources need to be allocated in the system for the job.
- The section that follows the header is like any other Unix script, and includes all the commands that make up the job. This section is processed when the job goes from a queued to a running status (when it gets executed). In order to actually use the multiple cores/nodes that were requested at the allocation stage, commands need to prepended by
Let us focus on the following Slurm options, that can be used both in the header section (using
#SBATCH) and in the task submission line (following
If no options are provided to
srun, the settings provided through
#SBATCH will be used. Any option that is nowhere specified will default to 1.
How do we use these?
Well, first of all we should ask ourselves what type of programs we are going to run. To answer this, if it is the first time we are using an application, it is a good practice to go have a look at its manual.
A serial application can only use one core, so that's what you're going to ask to Slurm:
Serial commands, including shell commands, do not need
srun, and get executed by the first core of the first node in the allocation.
srun will be accepted by Slurm, too, though.
No serial jobs on Magnus
You should only run serial jobs on Zeus, where you can ask for a 1 core allocation.
On Magnus and Galaxy you will be allocated a full compute node anyway, so running a serial job will result in a waste of core hours from your allocation.
If you really need to run serial jobs on Magnus, we suggest you make use of job packing to maximise resource usage; see links at the end of this article.
An MPI application can run multiple processes, either within the same node or across multiple nodes. In the Slurm jargon, MPI processes are mapped to
tasks. So for instance if you run the application with 48 processes on Magnus you'll write:
Alternatively, you can ask for the same resources in terms of
nodes (remember Magnus has 24 cores per node):
A multi-threaded application can run multiple shared-memory threads within the same compute node. Somehow confusingly, in the Slurm jargon threads are mapped to
cpus. For example this is how you set up a run with 24 threads for a multi-threaded application using OpenMP:
In this example, the variable OMP_NUM_THREADS is also being set, as most OpenMP applications use it to figure out how many threads they can spawn.
Multiple applications in the same job script
So far we didn't really need to use any option in the
srun line of the script. Now imagine we need to run a job consisting of several steps, each one with different resource requirements.
srun options will then be required to specify them. For descriptions of the various steps see comments in the script:
Pawsey User Documentation resources
A wider variety of job script examples can be found in our online documentation, in particular on these pages:
- Various types of applications: Example Job Scripts
- Hybrid MPI/OpenMP
- Articulated workflows: Example Workflows
- Packing multiple serial jobs
- General Job Scheduling: Job Scheduling
New Options for Molecular Visualisation with VMD
Author: Marco De La Pierre
Installation of VMD on Zeus and Topaz (both compute and visualisation nodes) was updated during the April maintenance.
Under the hood, support was added for MPI, GPUs with CUDA (Zeus only), FFmpeg, Tachyon and Tcl/Tk; additional plugins were enabled.
As a result, several new post-processing and visualisation features are now available to users, including but not restricted to:
- Movie Making (see these docs)
- Analyses using Tcl/Tk, such as PME Electrostatics
- MPI distributed post-processing (scripting only, no vis, see these docs and these others; before running vmd, the shell variable
SET_VMDNOMPIneeds to be
- CUDA accelerated post-processing, e.g. radial distribution functions and rendering of molecular orbitals (Zeus only, start here and here)
- Interactive Molecular Dynamics (experimental support for NAMD, LAMMPS, Gromacs, start here)