The Pawsey Supercomputing Centre has installed a new GPU-enabled system called Garrawarla, a Wajarri word meaning "Spider", to enable our Murchison Wide Field Array (MWA) researchers to produce scientific outcomes while the Pawsey Supercomputing System is being procured. This MWA compute cluster provides the latest generation of CPUs and GPUs, high memory bandwidth, and increased memory per node to allow MWA researchers to effectively process large datasets.
Garrawarla has the following hardware characteristics:
78 "HPE XL190 Gen10" compute nodes, each with:
- 2x Intel Xeon Gold 6230 20 core 2.1GHz CPU, code name "Cascade Lake".
- 384GB RAM (only 370GB is available through slurm, with some of the memory reserved for the OS)
- 1x 240GB SSD boot drive
- 1x 960GB NVMe drive
- 1x HDR100/Ethernet 100Gb ConnectX-6 with single QSFP56 port
- 1x V100 32GB NVIDIA GPU
Mounted Global File Systems
The Astronomy Filesystem (/astro), which is dedicated to MWA, is the only Lustre file system mounted on this cluster. It is provided by HPE, with 3 PB of usable space and capable of reading/writing at 30 GB/s.
Contact the Pawsey Service Desk Portal if you need assistance in migrating your workflows to Garrawarla.
Currently, the /group file system has many clients on it, and it is mounted on several locations. As a result it is the most impacted resource at Pawsey due to high metadata load. By not mounting /group on Garrawarla, any performance degradation issues with /group will not have an impact on Garrawarla because there will be no Lustre client connections from Garrawarla to the /group Lustre servers.
This approach will cause some disruption to workflows as some steps of a job may include stage in/out of data from/to /group via the copyq on Zeus. Data workflows presents an example to copy results to local storage with the use of copy queue on Zeus at the end of simulation on Magnus. You can modify this script to copy results from/to /group before/after the simulation on Garrawarla.
Garrawarla includes the following software characteristics:
- SLES12 with SP5 Operating System
- Slurm 20.02.4 queueing system,
- Compilers: gcc/8.3.0, gcc/5.5.0, gcc/10.1.0 intel/19.0.5, pgi/20.1, clang/10.0.0
- MPI: OpenMPI/4.0.3, OpenMPI/4.0.2, IntelMPI/19.0.5
- Profilers: ARM Forge/19.0.3, ARM Forge/20.0.3, Intel VTune/19.0.5, NVIDIA Nsight/2019.5.2
Python/2.7.17 and its associated packages are installed in the /pawsey/mwa/software/mwa_sles12sp4 directory.
IMPORTANT: Run "module use /pawsey/mwa/software/mwa_sles12sp4/modulefiles" before loading these modules.
Interaction with Garrawarla is done remotely using SSH (Secure Shell version 2, SSH-2):
More information about SSH-based access: https://support.pawsey.org.au/documentation/display/US/Logging+in
Select and use a compiler
There are two families of supported software compilers on Garrawarla:
- GNU: GNU Compiler Collection 8.3.0 is loaded by default.
It is up to you, as the user, to decide which programming environment is most suitable for the task at hand. To know the available gcc versions:
In the above in brackets, the L means the module is loaded, and D means it is the default version if no version is specified during the module load.
Compiler executables are named as follows:
Type the "man" command followed by the compiler name to load the corresponding manual page.
Compiling MPI code
MPI libraries can be loaded using the corresponding modules. Use of OpenMPI with Unified Communication X is recommended for normal use cases, and can be achieved by loading the appropriate module:
Once the MPI library is loaded, MPI wrappers are available for the currently selected compiler. Example commands follow:
|C||mpicc hello_mpi.c||C||mpicc hello_mpi.c||C||mpiicc hello_mpi.c||C||mpicc hello_mpi.c|
|C++||mpicxx hello_mpi.cpp||C++||mpicxx hello_mpi.cpp||C++||mpiicpc hello_mpi.cpp||C++||mpicxx hello_mpi.cpp|
|Fortran||mpif90 hello_mpi.f90||Fortran||mpif90 hello_mpi.f90||Fortran||mpiifort hello_mpi.f90||Fortran||mpif90 hello_mpi.f90|
IMPORTANT: Always use srun to launch a MPI executable, regardless of whether it is OpenMPI or Intel-MPI.
Compiling OpenMP code
To compile code for OpenMP multi-threading, add specific flags at compile time. The syntax differs depending on the selected compiler:
|C||icc -qopenmp hello_omp.c||C||gcc -fopenmp hello_omp.c|
|C++||icpc -qopenmp hello_omp.cpp||C++||g++ -fopenmp hello_omp.cpp|
|Fortran||ifort -qopenmp hello_omp.f90||Fortran||gfortran -fopenmp hello_omp.f90|
Refer to Compiling on Zeus#Usefulcompileroptions for useful compiler options while compiling code on Garrawarla.
Compiling GPU code
GPU code compilation should occur on the compute nodes in the gpuq partition, either interactively for simple programs or via a job script for larger software suites.
Compiler compatibility notice: CUDA versions up to 10.2 are compatible with Intel compilers and GCC compilers.
All the nodes in Garrawarla are equipped with NVIDIA v100 GPUs, based on the NVIDIA Volta architecture and are accessible from the gpuq partition.
Compiling a CUDA application
1. To compile interactively, submit a job request with salloc:
The terminal appears to hang until the job starts.
Once the job has started, your login prompt displays the compute node in the gpuq you are now on.
2. Ensure that the module for the desired compiler is loaded. The current default on Garrawarla is gcc/8.3.0. A different GNU version or the Intel compiler can be loaded with module swap, e.g.:
3. Load the CUDA module:
4. To compile MPI-enabled CUDA code, load the OPENMPI-UCX-GPU module as well:
5. Execute compile and link stages jointly or separately:
5a. Execute compilation commands jointly, e.g.:
5b. Alternatively, if you require separate compilation and link stages, compile with the "-c" option first, e.g.:
5c. Continue with the link stage. Make sure the link stage takes place using the host compiler, and includes the CUDA run time library via "-lcudart":
Compiling using a job script
To compile via a job script:
1. Prepare the script to request a node, load the relevant environment, and execute the compilation commands. For example, create a script file named compile.slurm which contains:
2. Submit the script called compile.slurm to the queue:
Compiling an OpenACC application
The PGI compiler (v20.1) is available for compiling code that contains OpenACC directives.
C, C++ and Fortran compilers are invoked using pgcc, pgc++ and pgfortran, respectively.
You can compile either interactively via salloc or in a batch job:.
Compiling using a batch job
Review the queuing system configuration
Garrawarla resources are managed by the Slurm queueing system. For detailed information about Slurm, refer to: https://support.pawsey.org.au/documentation/display/US/Job+Scheduling
Garrawarala has two overlapping partitions with all 78 nodes available in both partitions (queues):
- workq- for CPU-only jobs; with only 38 cores available in each node (mwa001-mwa078), max. 24h walltime. The remaining 2 cores are available to support GPU jobs in each node.
- gpuq - for GPU-only jobs; with all 40 cores and single GPU available in each node (mwa001-mwa078), max. 24h walltime. You can either request the entire node (with 40 cores and single GPU) or only 2 cores + single GPU for your GPU jobs. Any non-GPU job requests are automatically rejected from this partition by Slurm.
Submit your job
Garrawarla compute nodes are in both partitions and are configured as a shared resource. This means that it is especially important in a request for the GPU to specify the number of tasks and amount of main memory required by the job. If not specified, by default a job is allocated with a single CPU core, no GPU and around 9GB of RAM.
It is recommended that all jobs request the following:
|Set the account to which the job is to be charged. A default account is configured for each user.|
|Specify the total number of nodes.|
|Specify the total number of tasks (processes).|
|Specify GPUs per node|
|Specify the number of tasks per node.|
|Specify the number of tasks per socket.|
|Specify the number of cores per socket. Note: Each node has two CPU sockets with 18 cores and 20 cores, respectively, to support GPU workflows.|
|Specify the number of threads per process for multi-threaded jobs.|
|Specify the memory required per node. Note: If this option is not used, the scheduler allocates approximately 9gb of memory per process.|
|Request an allocation on the specified partition. If not specified, jobs are submitted to the default partition.|
|Set the wall-clock time limit for the job.|
Refer to the following examples, which demonstrate different job allocation modes, including how to access the local NVMe storage.
Batch Job Examples
Requesting NVMe resources in SLURM
Each node in Garrawarla has an attached NVMe device with 890GB usable space mounted as /nvmetmp.
Request a specific amount of NVMe storage in your job script using
--tmp=<some-value>g directives, and request up to 890GB. If both commands are used, only
--gres is applied. You should not be able to use more NVMe space than what has been allocated to you. By default, without any explicit NVMe request, a job should get allocated 1G of a /nvmetmp on the NVMe device.
The NVMe device (or the portion used by a job) is cleaned up after the job completes. IMPORTANT: Migrate any valuable results from the NVMe device before the job completes.
|To request 200GB NVMe space||salloc -N 1 --tmp=200g|
salloc: Nodes mwa001 are ready for job wdavey@mwa001:~> df -h | grep /nvmetmp /dev/nvme0n1p1 200G 0 200G 0% /nvmetmp
Serial job using a single CPU core
In the following example, we assume that the
serial_code is a serial application that is to be run on a single core in the workq partition. The amount of memory available for the job is adjusted since by default it would be given only about 9GB per process.
OpenMP code using all available CPU cores per node
In the following example, we assume that the
cpu_code is an OpenMP code using all 20 CPU cores of a socket. Note: Each node has 2 CPUs, with 20 cores each. A single process per socket and 20 OpenMP threads (single CPU) is run, leaving the other resources (the other CPU) available for other jobs. The amount of memory is adjusted to 180GB since by default it would be given only 9GB per process. In this example, you are allocated with CPU cores and memory of a single CPU (single NUMA node).
Note: To facilitate GPU workflows, only 38 cores are available on a node in the workq partition, with 18 cores on CPU-1 and 20 on CPU-2. For best performance with OpenMP applications, it is recommend to launch threads in a single CPU/NUMA node.
Non-MPI code using a single GPU
In the following example. we assume that the
gpu_code is a non-MPI application and can use a single GPU. The amount of memory available for the job since is adjusted by default, and the job is given about 9gb per process.
Note: For best performance with OpenMP applications, it is recommend to launch threads within a single CPU (or NUMA node or socket). Each socket
MPI code using more than one GPU
In the following example, we assume that the
gpu_code is a MPI application and can use a single GPU per process. Two processes are run, one per node, and we adjust the amount of memory per node since by default the job is given 9gb per process.
OpenMP code using a single GPU and all available CPU cores
In the following example, we assume that the
gpu_code is an OpenMP code using a single GPU and all 20 CPU cores. Note: Each node has 2 CPUs, with 20 cores each. A single process per socket and 20 OpenMP threads (single CPU) is run, leaving the other resources (CPU) available for other jobs. The amount of memory is adjusted to 180gb since by default the job is given 9gb per process. In this example, the job is allocated with CPU cores and memory of a single CPU (single NUMA node).
Note: For best performance with OpenMP applications, it is recommended to launch threads within a single CPU (or NUMA node or socket).
MPI + OpenMP code using the GPU and all available CPU cores per node
In the following example, we assume that
gpu_code is an MPI + OpenMP code using a single GPU per process and capable of using OpenMP multi-threading to additionally use all CPU cores in a node. Note: There are 2 CPUs per node, with 20 cores each. The code is run on two nodes with one process per node, each using a single GPU. The amount of memory is adjusted to 180gb since by default the job would be given 9gb per process.
Run a job using interactive mode
As on other Pawsey systems, you can used the
salloc command to run interactive sessions. You can use the
#SBATCH options mentioned above to specify various interactive job parameters. For example, to run an OpenMP code using 1 GPU, you can open an interactive session with the following command:
For all interactive sessions, after
salloc has run and you are on a compute node, use the
srun command to execute your commands. This is valid for all commands. For example, used
srun to run the
nvidia-smi command on the interactive node:
Pawsey provides a tailored suite of tools called pawseytools which is already configured to be a default module upon login. The pawseyAccountBalance utility in pawseytools shows the current state of the default group's usage against their allocation and also the /astro storage quota and usage. For example:
Troubleshooting & Good Practices
Using singularity with GPUs
--nv option when using Singularity on Garrawarla compute nodes.
Segmentation fault while running CUDA/OpenACC applications with UCX support
A segmentation fault can occur if applications are either statically linked to CUDA libraries or memory is allocated before MPI_Init. As a workaround, disable memory type cache by exporting UCX_MEMTYPE_CACHE=n
Using ramdisk support on the compute nodes
Each node on Garrawarla has up to 50% of the memory ( ~185GB) mounted in /dev/shm and available as ramdisk, which can be used to speed up large I/O intensive computations. This resource is not trackable in Slurm, so you should cleanup /dev/shm before exiting the job, which otherwise will reduce the memory available for subsequent jobs on that node. Also, to be fair with system usage, request cores according to the ramdisk usage. For example, by default only 9gb is available per core; therefore, to use 90gb of ramdisk you should ask for an additional 10 cores to avoid issues for other jobs running on the same node.
Requesting only the required memory to allow jobs on the overlapping partitions
Each compute node has 384GB of CPU memory, out of which only ~371 GB is available for the users' jobs through SLURM. However, users will notice only 9GB allocated for each core requested in the SLURM. The workq partition provides only 38 cores on each node and if a job requests all 38 cores of a node from the workq partition, SLURM will automatically allocate 342GB memory (= 38x9GB) for that job. This will leave only ~29GB (= 371-342) of memory for any GPU job that is going to run on that same/overlapping node. So, it is recommended to explicitly request only the required amount of memory for your jobs using --mem directive so that the nodes will be effectively utilized by both CPU and GPU workflows.
The following interactive job requested 38 tasks from a single node in the workq partition. SLURM allocated mwa024 and by default provided 342GB (9GB per each task) for this job.
Now, only 29GB (=371-342) is remaining on the mwa024 that is available for any GPU jobs on this node. So, SLURM will fail to allocate the resources for the job requesting over 29GB from this node from the gpuq partition. It can only honor jobs requesting 29GB or less memory.
So to facilitate jobs to run on both overlapping partitions, users are recommended to request memory as required for their jobs.
Now, the following interactive job requested 38 tasks from a single node in the workq partition but explicitly requested 200g memory using the --mem directive:
This allowed in launching another job from the gpuq partition on this same node and request up to 171GB (= 371-200):