In this tutorial, you will set up the environment on Topaz for developing machine learning systems using the NVIDIA TensorFlow Docker image, Singularity and GPU resources.
Prerequisites/Before you begin
To execute examples in this tutorial and develop your own distributed machine learning systems, you need to download the NVIDIA TensorFlow image. Pawsey advocates the use of Singularity as the container system, which is able to automatically convert Docker images to Singularity images. In the following commands, the $MY_IMAGES environment variable contains the path where the TensorFlow image will be saved (ideally you will have a directory where all of your images are saved to).
Steps
Connect to Topaz and allocate an interactive session with
salloc
(one node, one task). Then, download the NVIDIA Tensorflow container image using Singularity.Terminal 1. Create an interactive session and download the TensorFlow image.$ salloc -p gpuq-dev -N 1 -n 1 $ module load singularity $ singularity pull --dir $MY_IMAGES docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
Change to the directory you would like to use as the work directory. You are going to create a couple of files to get started. To test that the image in fact contains TensorFlow and Horovod, you can use the following simple Python script that we will call
test_image.py
.Listing 1. Contents of test_image.py.import tensorflow as tf import horovod as hvd print("Tensorflow version is", tf.__version__) print("Horovod version is", hvd.__version__)
Create a Slurm job script to execute the Python code. Its main purpose is to define resources needed, configuring the NVIDIA container, and finally execute the Python script with Singularity. One can take this chance to write such a script in a way it can be used to launch different Python programs, with little effort. It will be shown how.
The script assumes that you have defined end exported the following environment variables:
- SINGULARITY_WORK_DIR - path to a folder that Singularity will map to your home folder in the container. Usually this folder is the work directory where you keep your project's files
- MY_IMAGES - path to the directory containing the NVIDIA Tensorflow image
Since
sbatch
will start a process with a clean environment by default (for the sake of reproducibility), exporting the aforementioned variables is not enough. You will need to set and export theSBATCH_EXPORT
variable with a list of environment variables that you wishsbatch
to pass on to the child process. So, being with the terminal in the work directory, you could do the following:Terminal 2. Setting environment variables.$ # MY_IMAGES was already set previously $ export MY_IMAGES $ export SINGULARITY_WORK_DIR=`pwd` $ export SBATCH_EXPORT=MY_IMAGES,SINGULARITY_WORK_DIR
Now it is time to take a look at the job script.
Listing 2. Content of distributed_tf.sh.#!/bin/bash #SBATCH --nodes=2 #SBATCH --tasks-per-node=2 #SBATCH --cpus-per-task=1 #SBATCH --gres=gpu:2 #SBATCH --partition=gpuq #SBATCH --account=<your-account-here> module load singularity TENSORFLOW_IMAGE=$MY_IMAGES/tensorflow_20.03-tf2-py3.sif export SINGULARITYENV_CUDA_HOME=$CUDA_HOME srun singularity run --nv -B $SINGULARITY_WORK_DIR:$HOME $TENSORFLOW_IMAGE python "$@"
The script just outlined will help us launch the TensorFlow distributed computation examples that will follow. Both GPUs are used on every allocated node (Topaz has two GPUs per node) and two serial tasks (processes) per node are launched so that each of them will use one GPU. Therefore, parameters
tasks-per-node
,cpus-per-task
andgres
are constants. However, you can override thenodes
parameter to add more resources to the computation. The script then proceeds to loadsingularity
. The last one is used by Horovod internally to manage the communication among processes. Finally, the last line starts the parallel job invoking the Python interpreter within the Tensorflow container so that it executes whatever Python script (followed by arguments, if any) is passed as first argument todistributed_tf.sh
.Let's see now how to run the test Python script.
Terminal 3. Submit the job to Slurm.$ ls distributed_tf.sh test_image.py $ export MY_IMAGES=$MYGROUP/singularity/images $ export SINGULARITY_WORK_DIR=`pwd` $ export SBATCH_EXPORT=MY_IMAGES,SINGULARITY_WORK_DIR $ # do the above exports once every session $ sbatch distributed_tf.sh test_image.py
As you can see, once we have the setup in place, launching a job is very easy. Let's take a look at the job output. You should see a lot of logging information from TensorFlow, but at the very end there will be the output of our
print
calls.Listing 3. Example job output.Tensorflow version is 2.1.0 Horovod version is 0.19.0 Tensorflow version is 2.1.0 Horovod version is 0.19.0 Tensorflow version is 2.1.0 Horovod version is 0.19.0 Tensorflow version is 2.1.0 Horovod version is 0.19.0
Notice how each string is printed four times. This is because
srun
launched 4 processes, each executing your Python code.As the output is going to be lengthy in the next tutorial, it is recommended to create a different output file for each task in the job. This is accomplished by passing
-o slurm-%j_%t.out
tosrun
in thedistributed_tf.sh
script.Listing 4. Edit distributed_tf.sh.srun -o slurm-%j_%t.out singularity run --nv -B $SINGULARITY_WORK_DIR:$HOME $TENSORFLOW_IMAGE python "$@"
As last remark, you can change the number of nodes by passing the
–-nodes
option tosbatch
by command line (e.g.sbatch --nodes=1 distributed_tf.sh test_image.py
).
Related pages
- How to Interact with Multiple GPUs using TensorFlow
- How to Use Horovod for Distributed Training in Parallel using TensorFlow