Check this page regularly as it will be updated frequently over the incoming months as the deployment of the software progresses.
In particular, currently:
- GPU supported software modules are still in the process of being deployed
This page summarises the information needed to start using the Setonix GPU partitions.
Overview
The GPU partition of Setonix is made up of 192 nodes, 38 of which are high memory nodes (512 GB RAM instead of 256GB). Each GPU node features 4 AMD MI250X GPUs, as depicted in Figure 1. Each MI250X comprises 2 Graphics Complex Die (GCD), with each effectively seen as a standalone GPU by the system. A 64-core AMD Trento CPU is connected to the four MI250X with the AMD InfinityFabric interconnect, the same interconnection between the GPU cards, with a peak bandwidth of 200Gb/s. For more information refer to the Setonix General Information.
Figure 1. A GPU node of Setonix
Supported Applications
Several scientific applications are already able to offload computations to the MI250X, many others are in the process of being ported to AMD GPUs. Here is a list of the main ones and their current status.
Name | AMD GPU Acceleration | Module on Setonix |
---|---|---|
Amber | Yes | |
Gromacs | Yes | |
LAMMPS | Yes | |
NAMD | Yes | |
NekRS | Yes | |
PyTorch | Yes | |
ROMS | No | |
Tensorflow | Yes |
Table 1. List of popular applications
Module names of AMD GPU applications end with the postfix amd-gfx90a
. The most accurate list is given by the module
command:
$ module avail gfx90a
Tensorflow
Tensorflow is available as container at the following location,
/software/setonix/2022.11/containers/sif/amdih/tensorflow/rocm5.0-tf2.7-dev/tensorflow-rocm5.0-tf2.7-dev.sif
but no module has been created for it yet.
Supported Numerical Libraries
Popular numerical routines and functions have been implemented by AMD to run on their GPU hardware. All of the following are available when loading the rocm/5.0.2
module.
Name | Description |
---|---|
rocFFT | Fast Fourier Transform. Documentation pages (external site). |
rocBLAS | rocBLAS is the AMD library for Basic Linear Algebra Subprograms (BLAS) on the ROCm platform. Documentation pages (external site). |
rocSOLVER | rocSOLVER is a work-in-progress implementation of a subset of LAPACK functionality on the ROCm platform. Documentation pages (external site). |
Table 2. Popular GPU numerical libraries.
Each of the above libraries has an equivalent HIP wrapper that enables compilation on both ROCm and NVIDIA platforms.
A complete list of available libraries can be found on this page (external site).
AMD ROCm installations
The default ROCm installation is rocm/5.2.3
provided by HPE Cray. In addition, Pawsey staff have installed the more recent rocm/5.4.3
from source using ROCm-from-source. It is an experimental installation and users might encounter compilation or linking errors. You are encouraged to explore it during development and to report any issues. For production jobs, however, we currently recommend using rocm/5.2.3
.
Submitting Jobs
You can submit GPU jobs to the gpu
, gpu-dev
and gpu-highmem
Slurm partitions using your GPU allocation.
Note that you will need to use a different project code for the --account
/-A option. More specifically, it is your project code followed by the -gpu
suffix. For instance, if your project code is project1234
, then you will have to use project1234-gpu
.
The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation packs". Each "allocation pack" consists of: For that, the request of resources only needs the number of nodes ( In the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores, so don't use The use/management of resources with --gpu-bind=closest may NOT work for all applications There are now two methods to achieve optimal binding of GPUs: The first method is simpler, but may not work for all codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate. The use of full explicit 14 CPU threads all controlling the same 1 GCD Notes for the request of resources: Notes for the use/management of resources with srun: General notes:New way of request (
#SBATCH
) and use (srun
) of resources for GPU nodessalloc
or (#SBATCH
pragmas) and the options for the use of resources during execution of the code via srun
.–-nodes
, -N
) and the number of GPUs per node (--gpus-per-node
). The total number of requested GCDs (equivalent to slurm GPUs), resulting from the multiplication of these two parameters, will be interpreted as an indication of the total number of requeted "allocation packs".--ntasks
, --cpus-per-task
, --mem
, etc. in the request headers of the script ( #SBATCH
directives), or in the request options of salloc
. If, for some reason, the job requirements are dictated by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation packs" that meet their needs. The "allocation pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.srun
is another story. After the requested resources are allocated, the srun
command should be explicitly provided with enough parameters indicating how resources are to be used by the srun
step and the spawned tasks. So the real management of resources is performed by the command line options of srun
. No default parameters should be considered for srun
.
The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes:srun
parameters for optimal binding: --gpus-per-task=<number>
together with --gpu-bind=closest
Required Resources per Job New "simplified" way of requesting resources Total Allocated resources Charge per hour srun
options is now required
(only the 1st
method for optimal binding is listed here)1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU) #SBATCH --nodes=1
#SBATCH --gpus-per-node=1
1 allocation pack =
1 GPU, 8 CPU cores (1 chiplet), 29.44 GB RAM64 SU *1
export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gpus-per-node=1 --gpus-per-task=1
#SBATCH --nodes=1
#SBATCH --gpus-per-node=2
2 allocation packs=
2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB RAM128 SU *2
export OMP_NUM_THREADS=14
srun -N 1 -n 1 -c 16 --gpus-per-node=1 --gpus-per-task=1
3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication #SBATCH --nodes=1
#SBATCH --gpus-per-node=3
3 allocation packs=
3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB RAM192 SU *3
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 3 -c 8 --gpus-per-node=3 --gpus-per-task=1 --gpu-bind=closest
2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication #SBATCH --nodes=1
#SBATCH --gpus-per-node=4
4 allocation packs=
4 GPU, 32 CPU cores (4 chiplets), 117.76 GB RAM256 SU *4
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 2 -c 16 --gpus-per-node=4 --gpus-per-task=2 --gpu-bind=closest
8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication #SBATCH --nodes=1
#SBATCH --exclusive
8 allocation packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM512 SU export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest
srun
options.salloc
.--gpu-bind=closest
may NOT work for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required.-c
) option should be set to multiples of 8 (whole chiplets) to guarantee that srun
will distribute the resources in "allocation packs" and then "reserving" whole chiplets per srun
task, even if the real number is 1 thread per task. The real number of threads with the OMP_NUM_THREADS
variable.srun
may work fine with default inherited option values. Nevertheless, it is a good practice to use full explicit options of srun
to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8
) for the srun
task and control the real number of threads with the OMP_NUM_THREADS
variable.srun
task and the number of threads is controlled with the OMP_NUM_THREADS
variable.-c 8
) for each srun
task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3
). The real number of threads is controlled with the OMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest
. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED
is set to 1.-c 16
to "reserve" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun
tasks, -n 2
). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to each chiplets.
An extensive explanation on the use of the GPU nodes (including request by "allocation packs" and the "manual" binding) is in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.
Compiling software
If you are using ROCm libraries, such as rocFFT, to offload computations to GPUs, you should be able to use any compiler to link those to your code.
For HIP code use hipcc. And, for code making use of OpenMP offloading, you must use:
hipcc
for c/c++ftn
(wrapper for cray-fortran from PrgEnv-cray) for fortran. This compiler also allows GPU offloading with OpenACC.
When using hipcc
, note that the location of the MPI headers and libraries are not automatically included (contrary to the automatic inclusion when using the Cray wrapper scripts). Therefore, if your code also requires MPI, the location of the MPI headers and libraries must be provided to hipcc
as well as the GPU Transport Layer libraries:
-I${MPICH_DIR}/include -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa
Also to ensure proper use of GPU-GPU MPI communication codes must be compiled and run with the following environment variable set:
export MPICH_GPU_SUPPORT_ENABLED=1
Accounting
Each MI250X GCD, which corresponds to a Slurm GPU, is charged 64 SU per hour. This means the use of an entire GPU node is charged 512 SU per hour. In general, a job is charged the largest proportion of core, memory, or GPU usage rounded up to 1/8ths of a node (corresponding to an individual MI250X GCD). Note that GPU node usage is accounted against GPU allocations with the -gpu
suffix, which are separate to CPU allocations.
Programming AMD GPUs
You can program AMD MI250X GPUs using HIP, which is the programming framework equivalent to the one of NVIDIA, CUDA. The HIP platform is available after having loaded the rocm
module.
The complete AMD documentation on how to program with HIP can be found here (external site).
Example Jobscripts
The following are some brief examples of requesting GPUs via Slurm batch scripts on Setonix. For more detail, particularly regarding using shared nodes and the CPU binding for optimal placement relative to GPUs, refer to Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.
#!/bin/bash --login #SBATCH --account=project-gpu #SBATCH --partition=gpu #SBATCH --nodes=1 #1 nodes in this example #SBATCH --gpus-per-node=1 #1 GPUs per node (1 "allocation packs" in total for the job) #SBATCH --time=00:05:00 #---- #Loading needed modules (adapt this for your own purposes): module load PrgEnv-cray module load rocm craype-accel-amd-gfx90a module list #---- #MPI & OpenMP settings export OMP_NUM_THREADS=1 #This controls the real number of threads per task #---- #Execution srun -N 1 -n 1 -c 8 --gpus-per-node=1 ./program
#!/bin/bash --login #SBATCH --account=project-gpu #SBATCH --partition=gpu #SBATCH --nodes=1 #1 nodes in this example #SBATCH --exclusive #All resources of the node are exclusive to this job # #8 GPUs per node (8 "allocation packs" in total for the job) #SBATCH --time=00:05:00 #---- #Loading needed modules (adapt this for your own purposes): module load PrgEnv-cray module load rocm craype-accel-amd-gfx90a module list #---- #MPI & OpenMP settings export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable #---- #Execution srun -N 1 -n 1 -c 64 --gpus-per-node=8 --gpus-per-task=8 ./program
#!/bin/bash --login #SBATCH --account=project-gpu #SBATCH --partition=gpu #SBATCH --nodes=1 #1 nodes in this example #SBATCH --exclusive #All resources of the node are exclusive to this job # #8 GPUs per node (8 "allocation packs" in total for the job) #SBATCH --time=00:05:00 #---- #Loading needed modules (adapt this for your own purposes): module load PrgEnv-cray module load rocm craype-accel-amd-gfx90a module list #---- #MPI & OpenMP settings export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs export OMP_NUM_THREADS=1 #This controls the real number of threads per task #---- #Execution srun -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest ./program
Method 1 may fail for some applications.
The use of --gpu-bind=closest
may not work for all codes. For those codes, "manual" binding may be the only reliable method if they relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.