Page tree
Skip to end of metadata
Go to start of metadata

On this page:


Scheduling System 


The scheduling system used at Pawsey is called SLURM, the Simple Linux Utility for Resource Management scheduling system. Both Magnus and Galaxy also now uses native SLURM, which deals not just the reservation and scheduling of resources on the system, but also the launch and placement of the user job on the back-end compute nodes.

A batch job submitted to the scheduling system on the front-end via sbatch, when run, will execute the commands in the batch script serially on the lowest-numbered node allocated by SLURM. The command "srun" is used to launch multiple instances of an executable and run them in parallel.

The recommended way of job submission is a script that supplies two parts of information:

  • First, through SLURM directives prefixed "#SBATCH" placed in the header of the script, specify job requirements necessary to reserve appropriate resources and schedule the job. This should include at least the number of nodes, number of tasks and walltime required by the job.    
  • Second, through srun flags such as -n, -N, and -c, specify the exact placement of the application onto the resources. Based on these instructions, the scheduler will decide how to actually launch the job, for example, how many instances of the executable, how many processes/threads, and how to place tasks on physical and logical cores available on a node.

This approach requires users think about the resource requirement (in terms of the number of nodes, as well as a time limit) in advance, and specify the appropriate placement explicitly to srun. Note that the resources specified to SLURM and the placement instructions specified to srun must coincide so that unexpected behaviours don't occur, which requires users have a good understanding of how the application run and how much resource a workflow needs.

Specifying as many job requirements as possible in the header of the SLURM script is recommended. This has an advantage that any incorrect oversubscription of resources will be notified to the user during the submission rather than execution of the job.

Also, for interactive jobs started by salloc, the launcher srun needs to be used to run multiple instances of the executable in parallel. Any serial jobs in a salloc session can run without the srun launcher.

SLURM Glossary 


It is important to understand that some SLURM syntax have meanings which may differ from syntax in other batch or resource schedulers.

TermDescription
CPU

The term CPU is used to describe the smallest physical consumable, and for multi-core machines this will be the core. For multi-core machines where hyper-threading is enabled this will be a hardware thread.

TASKA task under SLURM is a synonym for a process, and is often the number of MPI processes that are required.
AccountThe term account is used to describe the entity to which used resources are charged to.
PartitionSLURM groups nodes into sets called partitions. Jobs are submitted to a partition to run. In other batch systems the term queue is used.

SLURM Commands 


The most useful user commands are listed in the table below, and explained further through example.

Purpose Command
Submit a batch scriptsbatch
Allocate resources for an interactive job salloc
Run an executable or job srun
Report state of the cluster nodes/partition sinfo
Show contents of queuesqueue
Show accounting data for a completed jobsacct
Report details of a job scontrol show job
Cancel pending or running job scancel
Hold a job scontrol hold
Release a job scontrol release

Submitting Jobs

Submitting a job in SLURM is performed by running sbatch command and specifying a job script.

sbatch [options] job.script

The sbatch command accepts options that override those specified inside the job script. A correctly formed batch script job.script is submitted to the queue with the sbatch command. Upon success, a unique job identifier is returned. By default, a job will be submitted to the workq partition. By default, both standard output and standard error for this job will appear in the file slurm-jobid.out in the directory from which the script was submitted.

To submit a script to a particular partition, use the -p option

sbatch -p debugq job.script

This will submit the job to the debug partition.

Interactive Jobs

It is possible to run serial or parallel jobs interactively through SLURM. This is a very useful feature when developing parallel codes, debugging applications or running X applications. The salloc command is used to obtain a SLURM job allocation for a set of nodes. It takes similar options as the sbatch command to specify the resources required.

salloc [options] [command]

For interactive jobs, the command argument is optional, and by default the job will launch the user’s shell.

~> salloc --ntasks=16
salloc: Granted job allocation 103
~>

Applications can be started within this shell without further configuration. Exiting the shell completes the interactive job, releasing the job allocation in the process.

~> exit
exit
salloc: Relinquishing job allocation 103
salloc: Job allocation 103 has been revoked.
~>

Launching Executables

SLURM provides its own job launcher called srun. srun provides similar functionality as other job launchers, such as OpenMPI’s mpirun, and will run the specified executable on resources allocated from the sbatch or salloc command. 

srun [options] [executable]

srun can also be launched outside of a batch script or an interactive job by explicitly specifying the resource options and executable to run, except on Magnus and Galaxy.

srun options: Application Placement

The job launcher srun takes a number of options to specify how the problem is mapped to the available hardware. Common options include:

-n ntasks

Specify ntasks to be the number of instances of the executable to be launched. For MPI jobs, this is the total number of MPI tasks. For other types of job -n must still be present and should be set to -n 1 (ntasks is one).

-N node-count

For MPI jobs, node-count will specify the number of nodes allocated to run the MPI job. For other types of job, -N is set to 1.

-c cpus-per-task

For OpenMP jobs, or (p-)threaded jobs, cpus-per-task will specify the number of threads (or "depth"). This should correspond with OMP_NUM_THREADS value for OpenMP applications.

--cpu_bind=[{quiet,verbose},]type

Can be used to bind tasks to CPUs. Supported options include cores, threads, rank, sockets, map_cpu etc.,

--hint=multithread

Use this option to enable hyper-threading of physical cores.

Even though hyper-threading is enabled by default on the hardware level on Magnus and Galaxy, it has been disabled on the level of SLURM (SLURM_HINT environment variable is set to nomultithread by default).

For more information, please visit the official documentation for srun, or refer to "srun" man page.

Cluster Status

The status of the cluster nodes and partitions can be viewed with the following command.

sinfo [options]

The default output shows the status for each partition, together with the configuration and state.

Within Pawsey, there are several clusters defined in SLURM. These are: Magnus, Galaxy and Zeus. These mostly correspond to the supercomputers, but the latter contains several data mover partitions. To see the status of any other cluster use

sinfo -M <clustername>

Monitoring Jobs

The status of all jobs being managed by the SLURM scheduler can be viewed with the squeue command. Without any options all jobs are displayed.

squeue [options]

An example output from the squeue command is shown below.

~> squeue
JOBID    USER     ACCOUNT  		NAME 		EXEC_HOST 	ST     REASON   START_TIME     END_TIME  TIME_LEFT 	NODES   PRIORITY
2985935  reaper  pawsey0001     slurm.sh  	nid01235  	R       None     13:19:48 	Tomorr 13:19   23:30:14   32       5348
2988682  reaper  pawsye0001     slurm.sh       n/a 		PD   	Priority    N/A          N/A 	1-00:00:00    32       5229
~>

Filtered results based on user, account or job list are available. A summary of common options are shown in the following table.

squeue option Description
--account=<account list>Filter results based on an account
--arrayJob arrays are displayed one element per line
--jobs=<job list>Comma separated list of Job IDs to display
--longDisplay output in long format
--name=<name list>Filter results based on job name
--partition=<partition>Comma separated list of partitions to display
--user=<user>Display results based on the listed user names

The fields displayed can be fine tuned with the --format option;

~> squeue --format="%.6i %.10P %.8u %15a %.15j %.3t %9r %19S %.10M %.10L %.5D %.4C %Q %N"

or by setting the SQUEUE_FORMAT environment variable:

~> export SQUEUE_FORMAT="%.6i %.10P %.8u %15a %.15j %.3t %9r %19S %.10M %.10L %.5D %.4C %Q %N"
~> squeue
JOBID PARTITION USER  ACCOUNT     NAME         ST REASON START_TIME        TIME    TIME_LEFT NODES CPUS PRIORITY NODELIST
4679  workq    	user1 director100 run          R  None 2018-04-17T00:00:46 9:03:52 2:56:08   16    256   3736    nid000[39-54]
4680  workq    	user1 director100 run          R  None 2018-04-17T00:01:14 9:03:24 2:56:36   32    512   3737    nid00[151-182]
4682  workq    	user2 director100 script.20.00 R  None 2018-04-17T07:24:18 1:40:20 5:19:40   2     32    3764    nid00[144-145]

For detailed information on squeue output formatting please refer to manual pages (type man squeue on Magnus, Galaxy or Zeus).

Viewing Details of Jobs

To view detailed job information, the scontrol subcommand show job can be used.

scontrol show job [job id]

Information such as resources requested, submit/start time, node lists and more are available.

Please note that this information is available only for queued and running jobs. For gathering information about completed jobs please refer to the sacct description below.

Deleting Jobs

If for some reason you wish to stop a job that is running or delete a job in the queue, use the scancel command:

scancel [job id [ job id] ...]

This will send a signal to the job specified (via unique identifier) to stop. If running, the job will be terminated; if queued, the job will be removed.

Flexible filtering options also permit Job IDs to be automatically selected based on account, job name or user name, or any combination of those.

scancel --account=[account]
scancel --name=[job name]
scancel --user=[user]

Arbitrary signals may also be sent using the --signal=[signal name] option. Signal names may be either their name or number.

Holding Jobs

To hold a job manually in order to prevent the job being scheduled for execution, the scontrol subcommand hold can be used.

scontrol hold [job id]

It is not possible to hold a job that has already begun its execution.

Releasing Jobs

To release a job that was previously held manually, the subcommand release is used.

scontrol release [job id]

SLURM Directives


Essential directives

The following SLURM directives are a minimum requirement for specifying the resources needed by a job.

#SBATCH --nodes=nnodes

Request nnodes nodes for the job. On Magnus and Galaxy, this will allocate all the cores on the node. Whereas, on Zeus, this will allocate only one core on a node by default.

#SBATCH --ntasks=ntasks

Request ntasks cores for the job. This should be large enough to accommodate the requirements specified to srun. On Magnus and Galaxy, this should be a multiple of the total number of cores available on a node for efficient use of resources.

#SBATCH --time=hh:mm:ss

Request a wall clock time limit for the job in hours:minutes:seconds format. If the job exceeds this time limit, it will be subject to a termination.

#SBATCH --account=[your-project]

A valid project code must be specified by replacing [your-project].

Controlling standard output and standard error

By default, output that would have appeared on the terminal in an interactive job (both standard output and standard error) is sent to a file

slurm-[jobid].out

in the working directory from which the job was submitted (with [jobid] being replaced by the appropriate numeric SLURM job id). The name of this standard output file may be controlled via:

#SBATCH --output=myjob.log

The unique job id may be included by using the special token "%j":

#SBATCH --output=myjob-%j.log

resulting in creation of, e.g. myjob-012345.log file in the working directory for a job with SLURM id 012345.

The destination of standard error may be controlled via:

#SBATCH --error=myjob-%j.err

In addition to these directives, the following can be used to provide email notification of a job completion.

#SBATCH --mail-type=ALL
#SBATCH --mail-user=myaddress@myorg.edu.au

Here is the list of the SLURM directives commonly required for jobs to run at Pawsey:

OptionPurpose
--account=accountSet the account to which the job is to be charged. A default account is configured for each user.
--nodes=nnodesSpecify the total number of nodes.
--ntasks=numberSpecify the total number of tasks (processes).
--ntasks-per-node=numberSpecify the number of tasks per node.
--cpus-per-task=numberSpecify the number of cores (physical or logical) per task.
--mem=sizeSpecify the memory required per node.
--mem-per-cpu=sizeSpecify the minimum memory required per CPU core.
--time=hh:mm:ssSet the wall-clock time limit for the job.
--job-name=nameSet the job name (as it appears under squeue). This defaults to the script name.
--output=filenameSet the (standard) output file name.
Use the token "%j" to include jobid.
--error=filenameSet the (standard) error file name.
--partition=partitionRequest an allocation on the specified partition. If not specified, jobs will be submitted to the default partition.
--array=listSpecify an array job with the defined indices.
--dependency=listSpecify a job dependency.
--mail-type=listRequest an e-mail notification for events in list. The list may include BEGINENDFAIL, or a comma-separated combination of valid tokens.
--mail-user=addressSpecify an e-mail address for notifications.
--export=env. variablesControls which environment variables are propagated to the batch job. The recommended option is NONE.

Specifying Resources


Note that when requesting memory, the available memory is less that the physical memory of a node as several gigabytes are used by the operating system.

On Magnus/Galaxy

The basic unit of allocation on all Magnus and Galaxy queues is a node.  The default is

#SBATCH --nodes=1

which will grant all the cores on the node and available memory to the job.

Galaxy also has 64 NVIDIA K20X ‘Kepler’ GPU nodes. To request a single K20X Kepler GPU node:

#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --constraint=kepler
#SBATCH --partition=gpuq
#SBATCH --export=NONE

On Zeus

The basic unit of allocation in the work queue is a task, so users should request resources on the basis of nodes and tasks.  The default is

#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=4096MB

which will grant one core and 4GB of memory to the job.

Additional cores can be requested via --ntasks, --cpus-per-task, or a combination of the two.

For instance, to obtain a whole node use:

#SBATCH --nodes=1
#SBATCH --ntasks=28

or

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=28

As the system is not homogeneous, further information may be required if more than one node with uniform hardware is required. Otherwise, the scheduler will allocate nodes as they become available (subject to certain algorithmic constraints).

If a certain amount of memory per node is required, this should be specified in GB, e.g.:

#SBATCH --nodes=1
#SBATCH --mem=125GB

On Topaz

Topaz has two different type of GPU resources (p100 and v100),  compute nodes in the gpuq partition are configured as a shared resource, whereas nvlinkq partition are configured as a non shared or exclusive resource. This means that it is especially important to specify number of GPUs, number of tasks and amount of memory required by the job. If not specified, by default job will be allocated with a single CPU core, no GPUs and around 10GB of RAM.

It is recommended that all jobs request the following:

  • number of nodes with --nodes,
  • number of GPUs per node with --gres=gpu:N (should be always used, unless for compiling),
  • number of processes with --ntasks-per-node and --ntasks-per-socket,
  • number of threads per process with --cpus-per-task (in case of multithreaded jobs),
  • amount of memory per node with --mem (please note that if this option is not used scheduler will allocate approx. 10gb memory per process),
  • walltime with --time,
  • partition with --partition,
  • Pawsey project ID with --account.

Example job scripts for Topaz are provided here. User can request allocation of specific type of GPU resource by using keyword "–constraint" in the script, where the value can be either p100, or v100.

#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --constraint=P100

Requests a node with a single GPU specifically a Tesla p100 GPU card

As on other Pawsey systems, salloc command can be used to run interactive sessions. #SBATCH options mentioned above can be used to specify various interactive job parameters, e.g. to run a MPI code utilising 2 GPUs one can open an interactive session with the following command:

salloc --nodes=1 --gres=gpu:2 --ntasks-per-node=2 --ntasks-per-socket=1 --mem=180gb --time=00:05:00 --partition=gpuq --account=[your-project]

For all interactive sessions, after salloc has run and you are on a compute node, you will need to use the srun command to execute your commands. This is valid for all commands, for instance srun needs to be used in order to run nvidia-smi command on the interactive node:

$ salloc -N 1 -pgpuq --gres=gpu:1 --ntasks-per-node=1
salloc: Granted job allocation 1861
salloc: Waiting for resource configuration
salloc: Nodes t016 are ready for job
bash-4.2$ nvidia-smi
No devices were found
bash-4.2$ srun -n1 nvidia-smi
Tue Nov 12 11:54:58 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:18:00.0 Off |                    0 |
| N/A   28C    P0    25W / 250W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

An Example Batch Script


A job script has a header section which specifies the resources that are required to run the job followed by the commands that must be executed. Specifying as many job requirements as possible in the header of the SLURM script is recommended. This has an advantage that any incorrect oversubscription of resources will be notified to the user during the submission rather than execution of the job. An example script is shown below.

01 #!/bin/bash -l
02 # 2 nodes, 24 MPI processes/node, 48 MPI processes total
03 #SBATCH --job-name="myjob"
04 #SBATCH --time=02:00:00
05 #SBATCH --nodes=2
06 #SBATCH --ntasks=48
07 #SBATCH --ntasks-per-node=24
08 #SBATCH --cpus-per-task=1
09 #SBATCH --output=myjob.%j.o
10 #SBATCH --error=myjob.%j.e
11 #SBATCH --account=pawsey0001
12 #SBATCH --export=NONE
13 #======START=====
14 echo "The current job ID is $SLURM_JOB_ID"
15 echo "Running on $SLURM_JOB_NUM_NODES nodes"
16 echo "Using $SLURM_NTASKS_PER_NODE tasks per node"
17 echo "A total of $SLURM_NTASKS tasks is used"
18 echo "Node list:"
19 sacct --format=JobID,NodeList%100 -j $SLURM_JOB_ID
20 srun --export=ALL -u ./a.out
21 #=====END====

The first line specifies that the bash shell will be used to interpret the contents of this script. Invoking with the -l option will ensure that the standard Pawsey environment is loaded. The second line is a comment. At line 3, the SLURM directives, specified by #SBATCH begin, which are used to state the resources required and properties of the job. Line 3 gives the job a name. Line 4 requests 2 hours of walltime. Line 5 requests 2 nodes, line 6 requests 48 MPI processes, line 7 requests 24 processes per node and line 8 requests only one CPU core per task. Line 9 specifies the name of the output file (%j is the job number), and line 10 specifies the file to which errors should be written out. Line 11 gives the account to which this walltime should be charged. Line 13 is a separator (optional) between the script directive preamble and the actions in the script. Lines 14-19 are useful (but optional) diagnostic information that will be printed out. Line 20 invokes srun to run the code (./a.out), and line 21 is a separator (optional) demarcating the end of the script.

  • SLURM will copy the entire environment from the shell where a job is submitted from. This may break existing batch scripts that require a different environment than say a login environment. To guard against this "#SBATCH --export=NONE" should be specified for each batch script to start each job in a fresh environment, so that it is reproducible both via yourself and Pawsey support staff.
  • SLURM does not set OMP_NUM_THREADS in the environment of a job. Users should manually add this to their batch scripts, which is normally the same as that specified with --cpus-per-task

Environment for srun-launched Executables

By default, srun will propagate the user environment to the launched executable.  However, if srun is used within sbatch, the default behaviour may change.  "#SBATCH --export=" sets the SLURM_EXPORT_ENV environment variable, which srun will default to using if it exists.  Since we recommend setting "#SBATCH --export=NONE", this must be overridden on the srun command line, otherwise srun will not propogate any environment variables.

#!/bin/bash -l
#SBATCH --export=NONE

srun --export=ALL ./a.out


SLURM Environment Variables 


SLURM sets environment variables that your running jobscript can use:

Variable Description
SLURM_SUBMIT_DIR The directory that the job was submitted from
SLURM_JOB_NAME The name of the job (such as specified with --job-name=)
SLURM_JOB_ID The unique identifier (job id) for this job
SLURM_JOB_NODELIST List of node names assigned to the job
SLURM_NTASKS Number of tasks allocated to the job
SLURM_JOB_CPUS_PER_NODE Number of CPUs per node available to the job
SLURM_JOB_NUM_NODES Number of nodes allocated to the job
SLURM_ARRAY_TASK_ID This tasks's ID in the job array
SLURM_ARRAY_JOB_ID The master job id for the job array
SLURM_PROCIDUniquely identifies each task. This ranges from 0 to the number of tasks minus 1

See the man page for sbatch for more environment variables.

Reservations 


With prior arrangement, for special cases, a user may request a resource reservation under SLURM. If successful, a named reservation will be created, and you may submit jobs during the allocated time period. See the Extraordinary Resource Requests policy.

sbatch --reservation=[name] job.script

You may also view reservations on the system with the scontrol command,

scontrol show reservations

or view jobs in the queue for a reservation.

squeue -R [name]

Project Accounting 


There are a couple of ways to check a group/user's usage against their allocation, be it time or storage (which would just be the /group storage system since it is currently the only quota'd storage).

Time Allocation vs Usage

Users can use SLURM commands such as sacct and sreport to get their project's usage. For more details on that, please consult the general SLURM documentation.

Pawsey also provides a tailored suite of tools called pawseytools which is already configured to be a default module upon login. The pawseyAccountBalance utility in pawseytools shows the current state of the default group's usage against their allocation. For example:

pawseyAccountBalance -p project123 -users

Compute Information
-------------------
Project ID Allocation Usage % used
---------- ---------- ----- ------
project123 1000000 	 372842  37.3
--user1 			 356218  35.6
--user2 			7699 0.8

gives a list of the quarterly usage of members in project123 in service units.

pawseyAccountBalance -p project123 -yearly

Compute Information
-------------------
Project ID   Period    Usage
---------- ----------  ----- 
project123  2018Q1     372842 
project123  2018Q1     250000 
project123  2018Q1-2   622842 

This prints the usage of project123 for whole year, by quarter.

Accounting data for historical, queued and running jobs can be displayed with the sacct command.

~> sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
14276 bash workq director1+ 16 COMPLETED 0:0
14277 script.sh workq director1+ 16 FAILED 1:0
14354 bash workq director1+ 16 CANCELLED+ 0:0
14355 bash workq director1+ 32 RUNNING 0:0

Accounting data for a specific job can be displayed with the -j option. The time window for searching defaults to 00:00:00 of the current day. To find jobs from earlier than that, the scope of the time window can be expanded with the -S option.

sacct -j [job id]
sacct -j [job id] -S [YYYY-MM-DD]

Additional filtering options are supported by sacct that can be used to limit the jobs that are displayed.

sacct --account=[account]
sacct --name=[job name]
sacct --user=[user]

The fields displayed can be fine tuned with the --format option;

~> sacct --format=jobid,jobname,partition,user,account%16,alloccpus,nnodes,elapsed,cputime,state,exitcode

or by setting the SACCT_FORMAT environment variable:

~> export SACCT_FORMAT="jobid,jobname,partition,user,account%16,alloccpus,nnodes,elapsed,cputime,state,exitcode"
~> sacct
JobID JobName Partition User Account AllocCPUS NNodes Elapsed CPUTime State ExitCode
------------ ---------- ---------- --------- ---------------- ---------- -------- ---------- ---------- ---------- --------
14276 bash workq pryan director100 16 1 00:00:28 00:07:28 COMPLETED 0:0
14277 script.sh workq pryan director100 16 1 00:00:00 00:00:00 FAILED 1:0
14278 bash workq pryan director100 16 1 00:00:05 00:01:20 COMPLETED 0:0
14279 bash workq pryan director100 16 1 00:00:04 00:01:04 COMPLETED 0:0
14354 bash workq pryan director100 16 2 00:00:00 00:00:00 CANCELLED+ 0:0
14355 bash workq pryan director100 32 2 00:06:51 03:39:12 RUNNING 0:0

sreport command can also be used to generate similar reports from the accounting data stored in the SLURM database. This might vary with the sacct information on systems with hyper-threading enabled.  The value reported by sreport might need dividing by the number of hyper-threads.

Storage Allocation vs Usage

The only quota'd file system at the time of writing is the /group file system. It is a Lustre file system so there are some generic commands available for users to check their usage. See the Lustre File System page for more information.

The pawseyAccountBalance utility also provides a way to check one's group usage versus their quota. For example,

pawseyAccountBalance -p project123 -storage
...
Storage Information
-------------------
/group usage for project123, used = 899.54 GiB, quota = 1024.00 GiB

which shows the current usage and quota in GB for project123.

For more information, please check the help page by "pawseyAccountBalance -h".

  • No labels