In this tutorial, you will set up the environment on Topaz for developing machine learning systems using the NVIDIA TensorFlow Docker image, Singularity and GPU resources.
Prerequisites/Before you begin
To execute examples in this tutorial and develop your own distributed machine learning systems, you need to download the NVIDIA TensorFlow image. Pawsey advocates the use of Singularity as the container system, which is able to automatically convert Docker images to Singularity images. In the following commands, the $MY_IMAGES environment variable contains the path where the TensorFlow image will be saved (ideally you will have a directory where all of your images are saved to).
Connect to Topaz and allocate an interactive session with
salloc(one node, one task). Then, download the NVIDIA Tensorflow container image using Singularity.
Change to the directory you would like to use as the work directory. You are going to create a couple of files to get started. To test that the image in fact contains TensorFlow and Horovod, you can use the following simple Python script that we will call
Create a Slurm job script to execute the Python code. Its main purpose is to define resources needed, configuring the NVIDIA container, and finally execute the Python script with Singularity. One can take this chance to write such a script in a way it can be used to launch different Python programs, with little effort. It will be shown how.
The script assumes that you have defined end exported the following environment variables:
- SINGULARITY_WORK_DIR - path to a folder that Singularity will map to your home folder in the container. Usually this folder is the work directory where you keep your project's files
- MY_IMAGES - path to the directory containing the NVIDIA Tensorflow image
sbatchwill start a process with a clean environment by default (for the sake of reproducibility), exporting the aforementioned variables is not enough. You will need to set and export the
SBATCH_EXPORTvariable with a list of environment variables that you wish
sbatchto pass on to the child process. So, being with the terminal in the work directory, you could do the following:
Now it is time to take a look at the job script.
The script just outlined will help us launch the TensorFlow distributed computation examples that will follow. Both GPUs are used on every allocated node (Topaz has two GPUs per node) and two serial tasks (processes) per node are launched so that each of them will use one GPU. Therefore, parameters
gresare constants. However, you can override the
nodesparameter to add more resources to the computation. The script then proceeds to load
singularity. The last one is used by Horovod internally to manage the communication among processes. Finally, the last line starts the parallel job invoking the Python interpreter within the Tensorflow container so that it executes whatever Python script (followed by arguments, if any) is passed as first argument to
Let's see now how to run the test Python script.
As you can see, once we have the setup in place, launching a job is very easy. Let's take a look at the job output. You should see a lot of logging information from TensorFlow, but at the very end there will be the output of our
Notice how each string is printed four times. This is because
srunlaunched 4 processes, each executing your Python code.
As the output is going to be lengthy in the next tutorial, it is recommended to create a different output file for each task in the job. This is accomplished by passing
As last remark, you can change the number of nodes by passing the
sbatchby command line (e.g.
sbatch --nodes=1 distributed_tf.sh test_image.py).
- How to Interact with Multiple GPUs using TensorFlow
- How to Use Horovod for Distributed Training in Parallel using TensorFlow