The scheduling system used at Pawsey is called SLURM, the Simple Linux Utility for Resource Management scheduling system. Both Magnus and Galaxy also now uses native SLURM, which deals not just the reservation and scheduling of resources on the system, but also the launch and placement of the user job on the back-end compute nodes.
A batch job submitted to the scheduling system on the front-end via sbatch, when run, will execute the commands in the batch script serially on the lowest-numbered node allocated by SLURM. The command "srun" is used to launch multiple instances of an executable and run them in parallel.
The recommended way of job submission is a script that supplies two parts of information:
- First, through SLURM directives prefixed "#SBATCH" placed in the header of the script, specify job requirements necessary to reserve appropriate resources and schedule the job. This should include at least the number of nodes, number of tasks and walltime required by the job.
- Second, through srun flags such as -n, -N, and -c, specify the exact placement of the application onto the resources. Based on these instructions, the scheduler will decide how to actually launch the job, for example, how many instances of the executable, how many processes/threads, and how to place tasks on physical and logical cores available on a node.
This approach requires users think about the resource requirement (in terms of the number of nodes, as well as a time limit) in advance, and specify the appropriate placement explicitly to srun. Note that the resources specified to SLURM and the placement instructions specified to srun must coincide so that unexpected behaviours don't occur, which requires users have a good understanding of how the application run and how much resource a workflow needs.
Specifying as many job requirements as possible in the header of the SLURM script is recommended. This has an advantage that any incorrect oversubscription of resources will be notified to the user during the submission rather than execution of the job.
Also, for interactive jobs started by salloc, the launcher srun needs to be used to run multiple instances of the executable in parallel. Any serial jobs in a salloc session can run without the srun launcher.
It is important to understand that some SLURM syntax have meanings which may differ from syntax in other batch or resource schedulers.
The term CPU is used to describe the smallest physical consumable, and for multi-core machines this will be the core. For multi-core machines where hyper-threading is enabled this will be a hardware thread.
|TASK||A task under SLURM is a synonym for a process, and is often the number of MPI processes that are required.|
|Account||The term account is used to describe the entity to which used resources are charged to.|
|Partition||SLURM groups nodes into sets called partitions. Jobs are submitted to a partition to run. In other batch systems the term queue is used.|
The most useful user commands are listed in the table below, and explained further through example.
|Submit a batch script|
|Allocate resources for an interactive job|
|Run an executable or job|
|Report state of the cluster nodes/partition|
|Show contents of queue|
|Show accounting data for a completed job|
|Report details of a job|
|Cancel pending or running job|
|Hold a job|
|Release a job|
Submitting a job in SLURM is performed by running sbatch command and specifying a job script.
The sbatch command accepts options that override those specified inside the job script. A correctly formed batch script job.script is submitted to the queue with the sbatch command. Upon success, a unique job identifier is returned. By default, a job will be submitted to the workq partition. By default, both standard output and standard error for this job will appear in the file slurm-jobid.out in the directory from which the script was submitted.
To submit a script to a particular partition, use the -p option
This will submit the job to the debug partition.
It is possible to run serial or parallel jobs interactively through SLURM. This is a very useful feature when developing parallel codes, debugging applications or running X applications. The salloc command is used to obtain a SLURM job allocation for a set of nodes. It takes similar options as the sbatch command to specify the resources required.
For interactive jobs, the command argument is optional, and by default the job will launch the user’s shell.
Applications can be started within this shell without further configuration. Exiting the shell completes the interactive job, releasing the job allocation in the process.
SLURM provides its own job launcher called srun. srun provides similar functionality as other job launchers, such as OpenMPI’s mpirun, and will run the specified executable on resources allocated from the sbatch or salloc command.
srun can also be launched outside of a batch script or an interactive job by explicitly specifying the resource options and executable to run, except on Magnus and Galaxy.
srun options: Application Placement
The job launcher srun takes a number of options to specify how the problem is mapped to the available hardware. Common options include:
Specify ntasks to be the number of instances of the executable to be launched. For MPI jobs, this is the total number of MPI tasks. For other types of job -n must still be present and should be set to -n 1 (ntasks is one).
For MPI jobs, node-count will specify the number of nodes allocated to run the MPI job. For other types of job, -N is set to 1.
For OpenMP jobs, or (p-)threaded jobs, cpus-per-task will specify the number of threads (or "depth"). This should correspond with
OMP_NUM_THREADS value for OpenMP applications.
Can be used to bind tasks to CPUs. Supported options include cores, threads, rank, sockets, map_cpu etc.,
Use this option to enable hyper-threading of physical cores.
For more information, please visit the official documentation for srun, or refer to "srun" man page.
The status of the cluster nodes and partitions can be viewed with the following command.
The default output shows the status for each partition, together with the configuration and state.
Within Pawsey, there are several clusters defined in SLURM. These are: Magnus, Galaxy and Zeus. These mostly correspond to the supercomputers, but the latter contains several data mover partitions. To see the status of any other cluster use
The status of all jobs being managed by the SLURM scheduler can be viewed with the squeue command. Without any options all jobs are displayed.
An example output from the squeue command is shown below.
Filtered results based on user, account or job list are available. A summary of common options are shown in the following table.
|Filter results based on an account|
|Job arrays are displayed one element per line|
|Comma separated list of Job IDs to display|
|Display output in long format|
|Filter results based on job name|
|Comma separated list of partitions to display|
|Display results based on the listed user names|
The fields displayed can be fine tuned with the --format option;
or by setting the SQUEUE_FORMAT environment variable:
For detailed information on squeue output formatting please refer to manual pages (type man squeue on Magnus, Galaxy or Zeus).
Viewing Details of Jobs
To view detailed job information, the scontrol subcommand show job can be used.
Information such as resources requested, submit/start time, node lists and more are available.
Please note that this information is available only for queued and running jobs. For gathering information about completed jobs please refer to the sacct description below.
If for some reason you wish to stop a job that is running or delete a job in the queue, use the scancel command:
This will send a signal to the job specified (via unique identifier) to stop. If running, the job will be terminated; if queued, the job will be removed.
Flexible filtering options also permit Job IDs to be automatically selected based on account, job name or user name, or any combination of those.
Arbitrary signals may also be sent using the --signal=[signal name] option. Signal names may be either their name or number.
To hold a job manually in order to prevent the job being scheduled for execution, the scontrol subcommand hold can be used.
It is not possible to hold a job that has already begun its execution.
To release a job that was previously held manually, the subcommand release is used.
The following SLURM directives are a minimum requirement for specifying the resources needed by a job.
Request nnodes nodes for the job. On Magnus and Galaxy, this will allocate all the cores on the node. Whereas, on Zeus, this will allocate only one core on a node by default.
Request ntasks cores for the job. This should be large enough to accommodate the requirements specified to srun. On Magnus and Galaxy, this should be a multiple of the total number of cores available on a node for efficient use of resources.
Request a wall clock time limit for the job in hours:minutes:seconds format. If the job exceeds this time limit, it will be subject to a termination.
A valid project code must be specified by replacing [your-project].
Controlling standard output and standard error
By default, output that would have appeared on the terminal in an interactive job (both standard output and standard error) is sent to a file
in the working directory from which the job was submitted (with [jobid] being replaced by the appropriate numeric SLURM job id). The name of this standard output file may be controlled via:
The unique job id may be included by using the special token "%j":
resulting in creation of, e.g. myjob-012345.log file in the working directory for a job with SLURM id 012345.
The destination of standard error may be controlled via:
In addition to these directives, the following can be used to provide email notification of a job completion.
Here is the list of the SLURM directives commonly required for jobs to run at Pawsey:
|Set the account to which the job is to be charged. A default account is configured for each user.|
|Specify the total number of nodes.|
|Specify the total number of tasks (processes).|
|Specify the number of tasks per node.|
|Specify the number of cores (physical or logical) per task.|
|Specify the memory required per node.|
|Specify the minimum memory required per CPU core.|
|Set the wall-clock time limit for the job.|
|Set the job name (as it appears under |
|Set the (standard) output file name.|
Use the token "
|Set the (standard) error file name.|
|Request an allocation on the specified partition. If not specified, jobs will be submitted to the default partition.|
|Specify an array job with the defined indices.|
|Specify a job dependency.|
|Request an e-mail notification for events in list. The list may include BEGIN, END, FAIL, or a comma-separated combination of valid tokens.|
|Specify an e-mail address for notifications.|
|Controls which environment variables are propagated to the batch job. The recommended option is NONE.|
The basic unit of allocation on all Magnus and Galaxy queues is a node. The default is
which will grant all the cores on the node and available memory to the job.
Galaxy also has 64 NVIDIA K20X ‘Kepler’ GPU nodes. To request a single K20X Kepler GPU node:
The basic unit of allocation in the work queue is a task, so users should request resources on the basis of nodes and tasks. The default is
which will grant one core and 4GB of memory to the job.
Additional cores can be requested via --ntasks, --cpus-per-task, or a combination of the two.
For instance, to obtain a whole node use:
As the system is not homogeneous, further information may be required if more than one node with uniform hardware is required. Otherwise, the scheduler will allocate nodes as they become available (subject to certain algorithmic constraints).
If a certain amount of memory per node is required, this should be specified in GB, e.g.:
Topaz has two different type of GPU resources (p100 and v100), compute nodes in the gpuq partition are configured as a shared resource, whereas nvlinkq partition are configured as a non shared or exclusive resource. This means that it is especially important to specify number of GPUs, number of tasks and amount of memory required by the job. If not specified, by default job will be allocated with a single CPU core, no GPUs and around 10GB of RAM.
It is recommended that all jobs request the following:
- number of nodes with
- number of GPUs per node with
--gres=gpu:N(should be always used, unless for compiling),
- number of processes with
- number of threads per process with
--cpus-per-task(in case of multithreaded jobs),
- amount of memory per node with
--mem(please note that if this option is not used scheduler will allocate approx. 10gb memory per process),
- walltime with
- partition with
- Pawsey project ID with
Example job scripts for Topaz are provided here. User can request allocation of specific type of GPU resource by using keyword "–constraint" in the script, where the value can be either p100, or v100.
Requests a node with a single GPU specifically a Tesla p100 GPU card
As on other Pawsey systems, salloc command can be used to run interactive sessions. #SBATCH options mentioned above can be used to specify various interactive job parameters, e.g. to run a MPI code utilising 2 GPUs one can open an interactive session with the following command:
For all interactive sessions, after salloc has run and you are on a compute node, you will need to use the srun command to execute your commands. This is valid for all commands, for instance srun needs to be used in order to run nvidia-smi command on the interactive node:
An Example Batch Script
A job script has a header section which specifies the resources that are required to run the job followed by the commands that must be executed. Specifying as many job requirements as possible in the header of the SLURM script is recommended. This has an advantage that any incorrect oversubscription of resources will be notified to the user during the submission rather than execution of the job. An example script is shown below.
The first line specifies that the bash shell will be used to interpret the contents of this script. Invoking with the -l option will ensure that the standard Pawsey environment is loaded. The second line is a comment. At line 3, the SLURM directives, specified by #SBATCH begin, which are used to state the resources required and properties of the job. Line 3 gives the job a name. Line 4 requests 2 hours of walltime. Line 5 requests 2 nodes, line 6 requests 48 MPI processes, line 7 requests 24 processes per node and line 8 requests only one CPU core per task. Line 9 specifies the name of the output file (%j is the job number), and line 10 specifies the file to which errors should be written out. Line 11 gives the account to which this walltime should be charged. Line 13 is a separator (optional) between the script directive preamble and the actions in the script. Lines 14-19 are useful (but optional) diagnostic information that will be printed out. Line 20 invokes srun to run the code (./a.out), and line 21 is a separator (optional) demarcating the end of the script.
- SLURM will copy the entire environment from the shell where a job is submitted from. This may break existing batch scripts that require a different environment than say a login environment. To guard against this "#SBATCH --export=NONE" should be specified for each batch script to start each job in a fresh environment, so that it is reproducible both via yourself and Pawsey support staff.
- SLURM does not set OMP_NUM_THREADS in the environment of a job. Users should manually add this to their batch scripts, which is normally the same as that specified with --cpus-per-task
Environment for srun-launched Executables
By default, srun will propagate the user environment to the launched executable. However, if srun is used within sbatch, the default behaviour may change. "#SBATCH --export=" sets the SLURM_EXPORT_ENV environment variable, which srun will default to using if it exists. Since we recommend setting "#SBATCH --export=NONE", this must be overridden on the srun command line, otherwise srun will not propogate any environment variables.
SLURM Environment Variables
SLURM sets environment variables that your running jobscript can use:
|SLURM_SUBMIT_DIR||The directory that the job was submitted from|
|SLURM_JOB_NAME||The name of the job (such as specified with --job-name=)|
|SLURM_JOB_ID||The unique identifier (job id) for this job|
|SLURM_JOB_NODELIST||List of node names assigned to the job|
|SLURM_NTASKS||Number of tasks allocated to the job|
|SLURM_JOB_CPUS_PER_NODE||Number of CPUs per node available to the job|
|SLURM_JOB_NUM_NODES||Number of nodes allocated to the job|
|SLURM_ARRAY_TASK_ID||This tasks's ID in the job array|
|SLURM_ARRAY_JOB_ID||The master job id for the job array|
|SLURM_PROCID||Uniquely identifies each task. This ranges from 0 to the number of tasks minus 1|
See the man page for sbatch for more environment variables.
With prior arrangement, for special cases, a user may request a resource reservation under SLURM. If successful, a named reservation will be created, and you may submit jobs during the allocated time period. See the Extraordinary Resource Requests policy.
You may also view reservations on the system with the scontrol command,
or view jobs in the queue for a reservation.
There are a couple of ways to check a group/user's usage against their allocation, be it time or storage (which would just be the /group storage system since it is currently the only quota'd storage).
Time Allocation vs Usage
Users can use SLURM commands such as sacct and sreport to get their project's usage. For more details on that, please consult the general SLURM documentation.
Pawsey also provides a tailored suite of tools called pawseytools which is already configured to be a default module upon login. The pawseyAccountBalance utility in pawseytools shows the current state of the default group's usage against their allocation. For example:
gives a list of the quarterly usage of members in project123 in service units.
This prints the usage of project123 for whole year, by quarter.
Accounting data for historical, queued and running jobs can be displayed with the sacct command.
Accounting data for a specific job can be displayed with the -j option. The time window for searching defaults to 00:00:00 of the current day. To find jobs from earlier than that, the scope of the time window can be expanded with the -S option.
Additional filtering options are supported by sacct that can be used to limit the jobs that are displayed.
The fields displayed can be fine tuned with the --format option;
or by setting the SACCT_FORMAT environment variable:
sreport command can also be used to generate similar reports from the accounting data stored in the SLURM database. This might vary with the sacct information on systems with hyper-threading enabled. The value reported by sreport might need dividing by the number of hyper-threads.
Storage Allocation vs Usage
The only quota'd file system at the time of writing is the /group file system. It is a Lustre file system so there are some generic commands available for users to check their usage. See the Lustre File System page for more information.
The pawseyAccountBalance utility also provides a way to check one's group usage versus their quota. For example,
which shows the current usage and quota in GB for project123.
For more information, please check the help page by "pawseyAccountBalance -h".