SLURM scheduler version has been updated to SLURM 22.05.8 and a new CLI filter has been installed on the GPU nodes in order to provide optimal binding of GPUs.
Slurm use for CPU-only nodes
The request and use of resources for the CPU-only nodes has not changed, so users may keep using their already working Slurm batch scripts for submitting jobs.
The only recommendation that we may raise at this point is that Slurm has announced the "separation" of the request of resources and the srun
launcher. This implies that, in future versions of Slurm, srun
will not inherit the exact parameters as requested for the allocation. Since previous versions of Slurm, this has already happened to the --cpus-per-task
(or -c)
option, which needs to be explicitly set in each srun
command, independently of its setting during the request for resources. Therefore, we recommend users to be aware of the upcoming changes on Slurm and adopt, as a best practice, the explicit set all the parameters of srun
that indicate the resources to be used in the command, rather than assuming that these parameters are inherited correctly by default. (Indeed, this practice has now become a requirement for the use of the GPU nodes in Setonix, as you can read in the following section.)
Slurm use for GPU nodes
The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation packs". Each "allocation pack" consists of: For that, the request of resources only needs the number of nodes ( In the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores, so don't use The use/management of resources with --gpu-bind=closest may NOT work for all applications There are now two methods to achieve optimal binding of GPUs: The first method is simpler, but may not work for all codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate. The use of full explicit 14 CPU threads all controlling the same 1 GCD Notes for the request of resources: Notes for the use/management of resources with srun: General notes:New way of request (
#SBATCH
) and use (srun
) of resources for GPU nodessalloc
or (#SBATCH
pragmas) and the options for the use of resources during execution of the code via srun
.–-nodes
, -N
) and the number of GPUs per node (--gpus-per-node
). The total number of requested GCDs (equivalent to slurm GPUs), resulting from the multiplication of these two parameters, will be interpreted as an indication of the total number of requeted "allocation packs".--ntasks
, --cpus-per-task
, --mem
, etc. in the request headers of the script ( #SBATCH
directives), or in the request options of salloc
. If, for some reason, the job requirements are dictated by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation packs" that meet their needs. The "allocation pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.srun
is another story. After the requested resources are allocated, the srun
command should be explicitly provided with enough parameters indicating how resources are to be used by the srun
step and the spawned tasks. So the real management of resources is performed by the command line options of srun
. No default parameters should be considered for srun
.
The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes:srun
parameters for optimal binding: --gpus-per-task=<number>
together with --gpu-bind=closest
Required Resources per Job New "simplified" way of requesting resources Total Allocated resources Charge per hour srun
options is now required
(only the 1st
method for optimal binding is listed here)1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU) #SBATCH --nodes=1
#SBATCH --gpus-per-node=1
1 allocation pack =
1 GPU, 8 CPU cores (1 chiplet), 29.44 GB RAM64 SU *1
export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gpus-per-node=1 --gpus-per-task=1
#SBATCH --nodes=1
#SBATCH --gpus-per-node=2
2 allocation packs=
2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB RAM128 SU *2
export OMP_NUM_THREADS=14
srun -N 1 -n 1 -c 16 --gpus-per-node=1 --gpus-per-task=1
3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication #SBATCH --nodes=1
#SBATCH --gpus-per-node=3
3 allocation packs=
3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB RAM192 SU *3
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 3 -c 8 --gpus-per-node=3 --gpus-per-task=1 --gpu-bind=closest
2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication #SBATCH --nodes=1
#SBATCH --gpus-per-node=4
4 allocation packs=
4 GPU, 32 CPU cores (4 chiplets), 117.76 GB RAM256 SU *4
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 2 -c 16 --gpus-per-node=4 --gpus-per-task=2 --gpu-bind=closest
8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication #SBATCH --nodes=1
#SBATCH --exclusive
8 allocation packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM512 SU export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest
srun
options.salloc
.--gpu-bind=closest
may NOT work for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required.-c
) option should be set to multiples of 8 (whole chiplets) to guarantee that srun
will distribute the resources in "allocation packs" and then "reserving" whole chiplets per srun
task, even if the real number is 1 thread per task. The real number of threads with the OMP_NUM_THREADS
variable.srun
may work fine with default inherited option values. Nevertheless, it is a good practice to use full explicit options of srun
to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8
) for the srun
task and control the real number of threads with the OMP_NUM_THREADS
variable.srun
task and the number of threads is controlled with the OMP_NUM_THREADS
variable.-c 8
) for each srun
task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3
). The real number of threads is controlled with the OMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest
. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED
is set to 1.-c 16
to "reserve" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun
tasks, -n 2
). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to each chiplets.
An extensive explanation on the use of the GPU nodes (including these updates and the "manual" binding) is in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.