Scheduling System
The scheduling system used at Pawsey is called SLURM, the Simple Linux Utility for Resource Management scheduling system. Both Magnus and Galaxy also now uses native SLURM, which deals not just the reservation and scheduling of resources on the system, but also the launch and placement of the user job on the back-end compute nodes.
A batch job submitted to the scheduling system on the front-end via sbatch, when run, will execute the commands in the batch script serially on the lowest-numbered node allocated by SLURM. The command "srun" is used to launch multiple instances of an executable and run them in parallel.
The recommended way of job submission is a script that supplies two parts of information:
- First, through SLURM directives prefixed "#SBATCH" placed in the header of the script, specify job requirements necessary to reserve appropriate resources and schedule the job. This should include at least the number of nodes, number of tasks and walltime required by the job.
- Second, through srun flags such as -n, -N, and -c, specify the exact placement of the application onto the resources. Based on these instructions, the scheduler will decide how to actually launch the job, for example, how many instances of the executable, how many processes/threads, and how to place tasks on physical and logical cores available on a node.
This approach requires users think about the resource requirement (in terms of the number of nodes, as well as a time limit) in advance, and specify the appropriate placement explicitly to srun. Note that the resources specified to SLURM and the placement instructions specified to srun must coincide so that unexpected behaviours don't occur, which requires users have a good understanding of how the application run and how much resource a workflow needs.
Specifying as many job requirements as possible in the header of the SLURM script is recommended. This has an advantage that any incorrect oversubscription of resources will be notified to the user during the submission rather than execution of the job.
Also, for interactive jobs started by salloc, the launcher srun needs to be used to run multiple instances of the executable in parallel. Any serial jobs in a salloc session can run without the srun launcher.
SLURM Glossary
It is important to understand that some SLURM syntax have meanings which may differ from syntax in other batch or resource schedulers.
Term | Description |
---|---|
CPU | The term CPU is used to describe the smallest physical consumable, and for multi-core machines this will be the core. For multi-core machines where hyper-threading is enabled this will be a hardware thread. |
TASK | A task under SLURM is a synonym for a process, and is often the number of MPI processes that are required. |
Account | The term account is used to describe the entity to which used resources are charged to. |
Partition | SLURM groups nodes into sets called partitions. Jobs are submitted to a partition to run. In other batch systems the term queue is used. |
SLURM Commands
The most useful user commands are listed in the table below, and explained further through example.
Purpose | Command |
---|---|
Submit a batch script | sbatch |
Allocate resources for an interactive job | salloc |
Run an executable or job | srun |
Report state of the cluster nodes/partition | sinfo |
Show contents of queue | squeue |
Show accounting data for a completed job | sacct |
Report details of a job | scontrol show job |
Cancel pending or running job | scancel |
Hold a job | scontrol hold |
Release a job | scontrol release
|
Submitting Jobs
Submitting a job in SLURM is performed by running sbatch command and specifying a job script.
sbatch [options] job.script
The sbatch command accepts options that override those specified inside the job script. A correctly formed batch script job.script is submitted to the queue with the sbatch command. Upon success, a unique job identifier is returned. By default, a job will be submitted to the workq partition. By default, both standard output and standard error for this job will appear in the file slurm-jobid.out in the directory from which the script was submitted.
To submit a script to a particular partition, use the -p option
sbatch -p debugq job.script
This will submit the job to the debug partition.
Interactive Jobs
It is possible to run serial or parallel jobs interactively through SLURM. This is a very useful feature when developing parallel codes, debugging applications or running X applications. The salloc command is used to obtain a SLURM job allocation for a set of nodes. It takes similar options as the sbatch command to specify the resources required.
salloc [options] [command]
For interactive jobs, the command argument is optional, and by default the job will launch the user’s shell.
~> salloc --ntasks=16 salloc: Granted job allocation 103 ~>
Applications can be started within this shell without further configuration. Exiting the shell completes the interactive job, releasing the job allocation in the process.
~> exit exit salloc: Relinquishing job allocation 103 salloc: Job allocation 103 has been revoked. ~>
Launching Executables
SLURM provides its own job launcher called srun. srun provides similar functionality as other job launchers, such as OpenMPI’s mpirun, and will run the specified executable on resources allocated from the sbatch or salloc command.
srun [options] [executable]
srun can also be launched outside of a batch script or an interactive job by explicitly specifying the resource options and executable to run, except on Magnus and Galaxy.
srun options: Application Placement
The job launcher srun takes a number of options to specify how the problem is mapped to the available hardware. Common options include:
-n ntasks
Specify ntasks to be the number of instances of the executable to be launched. For MPI jobs, this is the total number of MPI tasks. For other types of job -n must still be present and should be set to -n 1 (ntasks is one).
-N node-count
For MPI jobs, node-count will specify the number of nodes allocated to run the MPI job. For other types of job, -N is set to 1.
-c cpus-per-task
For OpenMP jobs, or (p-)threaded jobs, cpus-per-task will specify the number of threads (or "depth"). This should correspond with OMP_NUM_THREADS
value for OpenMP applications.
--cpu_bind=[{quiet,verbose},]type
Can be used to bind tasks to CPUs. Supported options include cores, threads, rank, sockets, map_cpu etc.,
--hint=multithread
Use this option to enable hyper-threading of physical cores.
For more information, please visit the official documentation for srun, or refer to "srun" man page.
Cluster Status
The status of the cluster nodes and partitions can be viewed with the following command.
sinfo [options]
The default output shows the status for each partition, together with the configuration and state.
Within Pawsey, there are several clusters defined in SLURM. These are: Magnus, Galaxy and Zeus. These mostly correspond to the supercomputers, but the latter contains several data mover partitions. To see the status of any other cluster use
sinfo -M <clustername>
Monitoring Jobs
The status of all jobs being managed by the SLURM scheduler can be viewed with the squeue command. Without any options all jobs are displayed.
squeue [options]
An example output from the squeue command is shown below.
~> squeue JOBID USER ACCOUNT NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY 2985935 reaper pawsey0001 slurm.sh nid01235 R None 13:19:48 Tomorr 13:19 23:30:14 32 5348 2988682 reaper pawsye0001 slurm.sh n/a PD Priority N/A N/A 1-00:00:00 32 5229 ~>
Filtered results based on user, account or job list are available. A summary of common options are shown in the following table.
squeue option | Description |
---|---|
--account=<account list> | Filter results based on an account |
--array | Job arrays are displayed one element per line |
--jobs=<job list> | Comma separated list of Job IDs to display |
--long | Display output in long format |
--name=<name list> | Filter results based on job name |
--partition=<partition> | Comma separated list of partitions to display |
--user=<user> | Display results based on the listed user names |
The fields displayed can be fine tuned with the --format option;
~> squeue --format="%.6i %.10P %.8u %15a %.15j %.3t %9r %19S %.10M %.10L %.5D %.4C %Q %N"
or by setting the SQUEUE_FORMAT environment variable:
~> export SQUEUE_FORMAT="%.6i %.10P %.8u %15a %.15j %.3t %9r %19S %.10M %.10L %.5D %.4C %Q %N" ~> squeue JOBID PARTITION USER ACCOUNT NAME ST REASON START_TIME TIME TIME_LEFT NODES CPUS PRIORITY NODELIST 4679 workq user1 director100 run R None 2018-04-17T00:00:46 9:03:52 2:56:08 16 256 3736 nid000[39-54] 4680 workq user1 director100 run R None 2018-04-17T00:01:14 9:03:24 2:56:36 32 512 3737 nid00[151-182] 4682 workq user2 director100 script.20.00 R None 2018-04-17T07:24:18 1:40:20 5:19:40 2 32 3764 nid00[144-145]
For detailed information on squeue output formatting please refer to manual pages (type man squeue on Magnus, Galaxy or Zeus).
Viewing Details of Jobs
To view detailed job information, the scontrol subcommand show job can be used.
scontrol show job [job id]
Information such as resources requested, submit/start time, node lists and more are available.
Please note that this information is available only for queued and running jobs. For gathering information about completed jobs please refer to the sacct description below.
Deleting Jobs
If for some reason you wish to stop a job that is running or delete a job in the queue, use the scancel command:
scancel [job id [ job id] ...]
This will send a signal to the job specified (via unique identifier) to stop. If running, the job will be terminated; if queued, the job will be removed.
Flexible filtering options also permit Job IDs to be automatically selected based on account, job name or user name, or any combination of those.
scancel --account=[account] scancel --name=[job name] scancel --user=[user]
Arbitrary signals may also be sent using the --signal=[signal name] option. Signal names may be either their name or number.
Holding Jobs
To hold a job manually in order to prevent the job being scheduled for execution, the scontrol subcommand hold can be used.
scontrol hold [job id]
It is not possible to hold a job that has already begun its execution.
Releasing Jobs
To release a job that was previously held manually, the subcommand release is used.
scontrol release [job id]
SLURM Directives
Essential directives
The following SLURM directives are a minimum requirement for specifying the resources needed by a job.
#SBATCH --nodes=nnodes
Request nnodes nodes for the job. On Magnus and Galaxy, this will allocate all the cores on the node. Whereas, on Zeus, this will allocate only one core on a node by default.
#SBATCH --ntasks=ntasks
Request ntasks cores for the job. This should be large enough to accommodate the requirements specified to srun. On Magnus and Galaxy, this should be a multiple of the total number of cores available on a node for efficient use of resources.
#SBATCH --time=hh:mm:ss
Request a wall clock time limit for the job in hours:minutes:seconds format. If the job exceeds this time limit, it will be subject to a termination.
#SBATCH --account=[your-project]
A valid project code must be specified by replacing [your-project].
Controlling standard output and standard error
By default, output that would have appeared on the terminal in an interactive job (both standard output and standard error) is sent to a file
slurm-[jobid].out
in the working directory from which the job was submitted (with [jobid] being replaced by the appropriate numeric SLURM job id). The name of this standard output file may be controlled via:
#SBATCH --output=myjob.log
The unique job id may be included by using the special token "%j":
#SBATCH --output=myjob-%j.log
resulting in creation of, e.g. myjob-012345.log file in the working directory for a job with SLURM id 012345.
The destination of standard error may be controlled via:
#SBATCH --error=myjob-%j.err
In addition to these directives, the following can be used to provide email notification of a job completion.
#SBATCH --mail-type=ALL #SBATCH --mail-user=myaddress@myorg.edu.au
Here is the list of the SLURM directives commonly required for jobs to run at Pawsey:
Option | Purpose |
---|---|
--account=account | Set the account to which the job is to be charged. A default account is configured for each user. |
--nodes=nnodes | Specify the total number of nodes. |
--ntasks=number | Specify the total number of tasks (processes). |
--ntasks-per-node=number | Specify the number of tasks per node. |
--cpus-per-task=number | Specify the number of cores (physical or logical) per task. |
--mem=size | Specify the memory required per node. |
--mem-per-cpu=size | Specify the minimum memory required per CPU core. |
--time=hh:mm:ss | Set the wall-clock time limit for the job. |
--job-name=name | Set the job name (as it appears under squeue ). This defaults to the script name. |
--output=filename | Set the (standard) output file name. Use the token " %j " to include jobid. |
--error=filename | Set the (standard) error file name. |
--partition=partition | Request an allocation on the specified partition. If not specified, jobs will be submitted to the default partition. |
--array=list | Specify an array job with the defined indices. |
--dependency=list | Specify a job dependency. |
--mail-type=list | Request an e-mail notification for events in list. The list may include BEGIN, END, FAIL, or a comma-separated combination of valid tokens. |
--mail-user=address | Specify an e-mail address for notifications. |
--export=env. variables | Controls which environment variables are propagated to the batch job. The recommended option is NONE. |
Specifying Resources
On Magnus/Galaxy
The basic unit of allocation on all Magnus and Galaxy queues is a node. The default is
#SBATCH --nodes=1
which will grant all the cores on the node and available memory to the job.
Galaxy also has 64 NVIDIA K20X ‘Kepler’ GPU nodes. To request a single K20X Kepler GPU node:
#SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --constraint=kepler #SBATCH --partition=gpuq #SBATCH --export=NONE
On Zeus
The basic unit of allocation in the work queue is a task, so users should request resources on the basis of nodes and tasks. The default is
#SBATCH --ntasks=1 #SBATCH --mem-per-cpu=4096MB
which will grant one core and 4GB of memory to the job.
Additional cores can be requested via --ntasks, --cpus-per-task, or a combination of the two.
For instance, to obtain a whole node use:
#SBATCH --nodes=1 #SBATCH --ntasks=28
or
#SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=28
As the system is not homogeneous, further information may be required if more than one node with uniform hardware is required. Otherwise, the scheduler will allocate nodes as they become available (subject to certain algorithmic constraints).
If a certain amount of memory per node is required, this should be specified in GB, e.g.:
#SBATCH --nodes=1 #SBATCH --mem=125GB
On Topaz
Topaz has two different type of GPU resources (p100 and v100), compute nodes in the gpuq partition are configured as a shared resource, whereas nvlinkq partition are configured as a non shared or exclusive resource. This means that it is especially important to specify number of GPUs, number of tasks and amount of memory required by the job. If not specified, by default job will be allocated with a single CPU core, no GPUs and around 10GB of RAM.
It is recommended that all jobs request the following:
- number of nodes with
--nodes
, - number of GPUs per node with
--gres=gpu:N
(should be always used, unless for compiling), - number of processes with
--ntasks-per-node
and--ntasks-per-socket
, - number of threads per process with
--cpus-per-task
(in case of multithreaded jobs), - amount of memory per node with
--mem
(please note that if this option is not used scheduler will allocate approx. 10gb memory per process), - walltime with
--time
, - partition with
--partition
, - Pawsey project ID with
--account
.
Example job scripts for Topaz are provided here. User can request allocation of specific type of GPU resource by using keyword "–constraint" in the script, where the value can be either p100, or v100.
#SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --constraint=P100
Requests a node with a single GPU specifically a Tesla p100 GPU card
As on other Pawsey systems, salloc command can be used to run interactive sessions. #SBATCH options mentioned above can be used to specify various interactive job parameters, e.g. to run a MPI code utilising 2 GPUs one can open an interactive session with the following command:
salloc --nodes=1 --gres=gpu:2 --ntasks-per-node=2 --ntasks-per-socket=1 --mem=180gb --time=00:05:00 --partition=gpuq --account=[your-project]
For all interactive sessions, after salloc has run and you are on a compute node, you will need to use the srun command to execute your commands. This is valid for all commands, for instance srun needs to be used in order to run nvidia-smi command on the interactive node:
$ salloc -N 1 -pgpuq --gres=gpu:1 --ntasks-per-node=1 salloc: Granted job allocation 1861 salloc: Waiting for resource configuration salloc: Nodes t016 are ready for job bash-4.2$ nvidia-smi No devices were found bash-4.2$ srun -n1 nvidia-smi Tue Nov 12 11:54:58 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:18:00.0 Off | 0 | | N/A 28C P0 25W / 250W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
An Example Batch Script
A job script has a header section which specifies the resources that are required to run the job followed by the commands that must be executed. Specifying as many job requirements as possible in the header of the SLURM script is recommended. This has an advantage that any incorrect oversubscription of resources will be notified to the user during the submission rather than execution of the job. An example script is shown below.
01 #!/bin/bash -l 02 # 2 nodes, 24 MPI processes/node, 48 MPI processes total 03 #SBATCH --job-name="myjob" 04 #SBATCH --time=02:00:00 05 #SBATCH --nodes=2 06 #SBATCH --ntasks=48 07 #SBATCH --ntasks-per-node=24 08 #SBATCH --cpus-per-task=1 09 #SBATCH --output=myjob.%j.o 10 #SBATCH --error=myjob.%j.e 11 #SBATCH --account=pawsey0001 12 #SBATCH --export=NONE 13 #======START===== 14 echo "The current job ID is $SLURM_JOB_ID" 15 echo "Running on $SLURM_JOB_NUM_NODES nodes" 16 echo "Using $SLURM_NTASKS_PER_NODE tasks per node" 17 echo "A total of $SLURM_NTASKS tasks is used" 18 echo "Node list:" 19 sacct --format=JobID,NodeList%100 -j $SLURM_JOB_ID 20 srun --export=ALL -u ./a.out 21 #=====END====
The first line specifies that the bash shell will be used to interpret the contents of this script. Invoking with the -l option will ensure that the standard Pawsey environment is loaded. The second line is a comment. At line 3, the SLURM directives, specified by #SBATCH begin, which are used to state the resources required and properties of the job. Line 3 gives the job a name. Line 4 requests 2 hours of walltime. Line 5 requests 2 nodes, line 6 requests 48 MPI processes, line 7 requests 24 processes per node and line 8 requests only one CPU core per task. Line 9 specifies the name of the output file (%j is the job number), and line 10 specifies the file to which errors should be written out. Line 11 gives the account to which this walltime should be charged. Line 13 is a separator (optional) between the script directive preamble and the actions in the script. Lines 14-19 are useful (but optional) diagnostic information that will be printed out. Line 20 invokes srun to run the code (./a.out), and line 21 is a separator (optional) demarcating the end of the script.
- SLURM will copy the entire environment from the shell where a job is submitted from. This may break existing batch scripts that require a different environment than say a login environment. To guard against this "#SBATCH --export=NONE" should be specified for each batch script to start each job in a fresh environment, so that it is reproducible both via yourself and Pawsey support staff.
- SLURM does not set OMP_NUM_THREADS in the environment of a job. Users should manually add this to their batch scripts, which is normally the same as that specified with --cpus-per-task
Environment for srun-launched Executables
By default, srun will propagate the user environment to the launched executable. However, if srun is used within sbatch, the default behaviour may change. "#SBATCH --export=" sets the SLURM_EXPORT_ENV environment variable, which srun will default to using if it exists. Since we recommend setting "#SBATCH --export=NONE", this must be overridden on the srun command line, otherwise srun will not propogate any environment variables.
#!/bin/bash -l #SBATCH --export=NONE srun --export=ALL ./a.out
SLURM Environment Variables
SLURM sets environment variables that your running jobscript can use:
Variable | Description |
---|---|
SLURM_SUBMIT_DIR | The directory that the job was submitted from |
SLURM_JOB_NAME | The name of the job (such as specified with --job-name=) |
SLURM_JOB_ID | The unique identifier (job id) for this job |
SLURM_JOB_NODELIST | List of node names assigned to the job |
SLURM_NTASKS | Number of tasks allocated to the job |
SLURM_JOB_CPUS_PER_NODE | Number of CPUs per node available to the job |
SLURM_JOB_NUM_NODES | Number of nodes allocated to the job |
SLURM_ARRAY_TASK_ID | This tasks's ID in the job array |
SLURM_ARRAY_JOB_ID | The master job id for the job array |
SLURM_PROCID | Uniquely identifies each task. This ranges from 0 to the number of tasks minus 1 |
See the man page for sbatch for more environment variables.
Reservations
With prior arrangement, for special cases, a user may request a resource reservation under SLURM. If successful, a named reservation will be created, and you may submit jobs during the allocated time period. See the Extraordinary Resource Requests policy.
sbatch --reservation=[name] job.script
You may also view reservations on the system with the scontrol command,
scontrol show reservations
or view jobs in the queue for a reservation.
squeue -R [name]
Project Accounting
There are a couple of ways to check a group/user's usage against their allocation, be it time or storage (which would just be the /group storage system since it is currently the only quota'd storage).
Time Allocation vs Usage
Users can use SLURM commands such as sacct and sreport to get their project's usage. For more details on that, please consult the general SLURM documentation.
Pawsey also provides a tailored suite of tools called pawseytools which is already configured to be a default module upon login. The pawseyAccountBalance utility in pawseytools shows the current state of the default group's usage against their allocation. For example:
pawseyAccountBalance -p project123 -users Compute Information ------------------- Project ID Allocation Usage % used ---------- ---------- ----- ------ project123 1000000 372842 37.3 --user1 356218 35.6 --user2 7699 0.8
gives a list of the quarterly usage of members in project123 in service units.
pawseyAccountBalance -p project123 -yearly Compute Information ------------------- Project ID Period Usage ---------- ---------- ----- project123 2018Q1 372842 project123 2018Q1 250000 project123 2018Q1-2 622842
This prints the usage of project123 for whole year, by quarter.
Accounting data for historical, queued and running jobs can be displayed with the sacct command.
~> sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 14276 bash workq director1+ 16 COMPLETED 0:0 14277 script.sh workq director1+ 16 FAILED 1:0 14354 bash workq director1+ 16 CANCELLED+ 0:0 14355 bash workq director1+ 32 RUNNING 0:0
Accounting data for a specific job can be displayed with the -j option. The time window for searching defaults to 00:00:00 of the current day. To find jobs from earlier than that, the scope of the time window can be expanded with the -S option.
sacct -j [job id] sacct -j [job id] -S [YYYY-MM-DD]
Additional filtering options are supported by sacct that can be used to limit the jobs that are displayed.
sacct --account=[account] sacct --name=[job name] sacct --user=[user]
The fields displayed can be fine tuned with the --format option;
~> sacct --format=jobid,jobname,partition,user,account%16,alloccpus,nnodes,elapsed,cputime,state,exitcode
or by setting the SACCT_FORMAT environment variable:
~> export SACCT_FORMAT="jobid,jobname,partition,user,account%16,alloccpus,nnodes,elapsed,cputime,state,exitcode" ~> sacct JobID JobName Partition User Account AllocCPUS NNodes Elapsed CPUTime State ExitCode ------------ ---------- ---------- --------- ---------------- ---------- -------- ---------- ---------- ---------- -------- 14276 bash workq pryan director100 16 1 00:00:28 00:07:28 COMPLETED 0:0 14277 script.sh workq pryan director100 16 1 00:00:00 00:00:00 FAILED 1:0 14278 bash workq pryan director100 16 1 00:00:05 00:01:20 COMPLETED 0:0 14279 bash workq pryan director100 16 1 00:00:04 00:01:04 COMPLETED 0:0 14354 bash workq pryan director100 16 2 00:00:00 00:00:00 CANCELLED+ 0:0 14355 bash workq pryan director100 32 2 00:06:51 03:39:12 RUNNING 0:0
sreport command can also be used to generate similar reports from the accounting data stored in the SLURM database. This might vary with the sacct information on systems with hyper-threading enabled. The value reported by sreport might need dividing by the number of hyper-threads.
Storage Allocation vs Usage
The only quota'd file system at the time of writing is the /group file system. It is a Lustre file system so there are some generic commands available for users to check their usage. See the Lustre File System page for more information.
The pawseyAccountBalance utility also provides a way to check one's group usage versus their quota. For example,
pawseyAccountBalance -p project123 -storage ... Storage Information ------------------- /group usage for project123, used = 899.54 GiB, quota = 1024.00 GiB
which shows the current usage and quota in GB for project123.
For more information, please check the help page by "pawseyAccountBalance -h".