Supercomputers usually have two types of nodes: 1) a few login nodes on the front end which are dedicated to interactive activities such as logging in, compiling, debugging, file management and job submission, and 2) hundreds or even thousands of compute nodes at the back end which actually execute jobs that users submit. Both types of nodes are shared by all the users on the system, which means on the front end, users shouldn't run any compute/memory intensive jobs that prevent other users from performing their tasks normally, and that on the back end, a scheduling mechanism needs to be set up so that users take turns to run their jobs. Popular scheduling systems include SLURM, PBS and Torque. Currently, all Pawsey systems use SLURM.
SLURM (Simple Linux Utility for Resource Management) is a batch queuing system and scheduler that is highly scalable, and capable of operating a heterogeneous cluster with up to tens of millions of processors. It can sustain job throughputs of more than 120,000 jobs per hour, with bursts of job submissions at several times that rate. It is highly tolerant of system failures, with built-in fault tolerance. Plug-ins can be added to support various interconnects, authentication methods, schedulers, and more. And most importantly, SLURM is open-source, licensed through the GNU GPL, and can be ported to UNIX-like operating systems. Like any batch system, SLURM allows users to submit batch jobs, check on their status, and cancel them if necessary. Interactive jobs can also be submitted, and detailed estimates of expected run times can be viewed. The underlying concepts are the same as those of PBS-based systems such as PBS Pro, but the syntax differs.
Essential SLURM commands and functionalities are documented at and a thorough comparison with PBS Pro is also provided, at Migrating from PBS Pro to SLURM, to help users move their workflows from PBS Pro to SLURM.