In this tutorial, you will set up the environment on Topaz for developing machine learning systems using the NVIDIA TensorFlow Docker image, Singularity and GPU resources.

On this page:

Prerequisites/Before you begin

To execute examples in this tutorial and develop your own distributed machine learning systems, you need to download the NVIDIA TensorFlow image. Pawsey advocates the use of Singularity as the container system, which is able to automatically convert Docker images to Singularity images. In the following commands, the $MY_IMAGES environment variable contains the path where the TensorFlow image will be saved (ideally you will have a directory where all of your images are saved to).

Steps

  1. Connect to Topaz and allocate an interactive session with salloc (one node, one task). Then, download the NVIDIA Tensorflow container image using Singularity.

    Terminal 1. Create an interactive session and download the TensorFlow image.
    $ salloc -p gpuq-dev -N 1 -n 1
    $ module load singularity
    $ singularity pull --dir $MY_IMAGES docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
  2. Change to the directory you would like to use as the work directory. You are going to create a couple of files to get started. To test that the image in fact contains TensorFlow and Horovod, you can use the following simple Python script that we will call test_image.py.

    Listing 1. Contents of test_image.py.
    import tensorflow as tf
    import horovod as hvd
    print("Tensorflow version is", tf.__version__)
    print("Horovod version is", hvd.__version__)
  3. Create a Slurm job script to execute the Python code. Its main purpose is to define resources needed, configuring the NVIDIA container, and finally execute the Python script with Singularity. One can take this chance to write such a script in a way it can be used to launch different Python programs, with little effort. It will be shown how.

    The script assumes that you have defined end exported the following environment variables:

    • SINGULARITY_WORK_DIR - path to a folder that Singularity will map to your home folder in the container. Usually this folder is the work directory where you keep your project's files
    • MY_IMAGES - path to the directory containing the NVIDIA Tensorflow image

    Since sbatch will start a process with a clean environment by default (for the sake of reproducibility), exporting the aforementioned variables is not enough. You will need to set and export the SBATCH_EXPORT variable with a list of environment variables that you wish sbatch to pass on to the child process. So, being with the terminal in the work directory, you could do the following:

    Terminal 2. Setting environment variables.
    $ # MY_IMAGES was already set previously
    $ export MY_IMAGES
    $ export SINGULARITY_WORK_DIR=`pwd`
    $ export SBATCH_EXPORT=MY_IMAGES,SINGULARITY_WORK_DIR
  4. Now it is time to take a look at the job script.

    Listing 2. Content of distributed_tf.sh.
    #!/bin/bash
    #SBATCH --nodes=2
    #SBATCH --tasks-per-node=2
    #SBATCH --cpus-per-task=1
    #SBATCH --gres=gpu:2
    #SBATCH --partition=gpuq
    #SBATCH --account=<your-account-here>
     
    module load singularity
     
    TENSORFLOW_IMAGE=$MY_IMAGES/tensorflow_20.03-tf2-py3.sif
    export SINGULARITYENV_CUDA_HOME=$CUDA_HOME
     
    srun singularity run --nv -B $SINGULARITY_WORK_DIR:$HOME $TENSORFLOW_IMAGE python "$@"

    The script just outlined will help us launch the TensorFlow distributed computation examples that will follow. Both GPUs are used on every allocated node (Topaz has two GPUs per node) and two serial tasks (processes) per node are launched so that each of them will use one GPU. Therefore, parameters tasks-per-node, cpus-per-task and gres are constants. However, you can override the nodes parameter to add more resources to the computation. The script then proceeds to load singularity. The last one is used by Horovod internally to manage the communication among processes. Finally, the last line starts the parallel job invoking the Python interpreter within the Tensorflow container so that it executes whatever Python script (followed by arguments, if any) is passed as first argument to distributed_tf.sh.

  5. Let's see now how to run the test Python script.

    Terminal 3. Submit the job to Slurm.
    $ ls
    distributed_tf.sh  test_image.py
    $ export MY_IMAGES=$MYGROUP/singularity/images
    $ export SINGULARITY_WORK_DIR=`pwd`
    $ export SBATCH_EXPORT=MY_IMAGES,SINGULARITY_WORK_DIR
    $ # do the above exports once every session
    $ sbatch distributed_tf.sh test_image.py
  6. As you can see, once we have the setup in place, launching a job is very easy. Let's take a look at the job output. You should see a lot of logging information from TensorFlow, but at the very end there will be the output of our print calls.

    Listing 3. Example job output.
    Tensorflow version is 2.1.0
    Horovod version is 0.19.0
    Tensorflow version is 2.1.0
    Horovod version is 0.19.0
    Tensorflow version is 2.1.0
    Horovod version is 0.19.0
    Tensorflow version is 2.1.0
    Horovod version is 0.19.0

    Notice how each string is printed four times. This is because srun launched 4 processes, each executing your Python code.

  7. As the output is going to be lengthy in the next tutorial, it is recommended to create a different output file for each task in the job. This is accomplished by passing -o slurm-%j_%t.out to srun in the distributed_tf.sh script.

    Listing 4. Edit distributed_tf.sh.
    srun -o slurm-%j_%t.out singularity run --nv -B $SINGULARITY_WORK_DIR:$HOME $TENSORFLOW_IMAGE python "$@"

    As last remark, you can change the number of nodes by passing the –-nodes option to sbatch by command line (e.g.  sbatch --nodes=1 distributed_tf.sh test_image.py).

Related pages


  • No labels