Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. SLURM allocates resources to a job for a fixed amount of time.  This time limit is either specified in the job request, or if none is specified then it is the default limit.  There are maximum limits on all SLURM partitions, so if you have not requested the maximum then try increasing the time limit in the request with the --time= flag to sbatch or salloc.
    To see the maximum and default time limits, us:

    Code Block
    > sinfo -o "%.10P %.5a %.10l %.15L %.6D %.6t" -p workq
     PARTITION AVAIL  TIMELIMIT     DEFAULTTIME  NODES  STATE
        workq*    up 1-00:00:00         1:00:00      1  drain
        workq*    up 1-00:00:00         1:00:00     20   resv
        workq*    up 1-00:00:00         1:00:00     34    mix
        workq*    up 1-00:00:00         1:00:00     25  alloc


  2. Usually if your allocation is not sufficient to support a job running to completion, SLURM will not start the job.  However, if multiple jobs start at the same time then each job may not hit the limit but collectively they might.  When this happens they will all start, but get terminated when the allocation is used up.  You can tell this is the case if the end elapsed time does not match the start time plus job's timelimit.

    Code Block
    > sacct -j 2954681 -o jobid,startelapsed,end,time
           JobID               Start                 EndElapsed  Timelimit
    ------------ ------------------- ------------------- ----------
    2954681      2019-03-18T19:51:18 2019-03-19T01:45:48  05:54:30 1-00:00:00
    2954681.bat+ 2019-03-18T19:51:18 2019-03-19T01:45:49   05:54:31
    2954681.ext+ 2019-03-18T19:51:18 2019-03-19T01:45:49   05:54:31
    2954681.0     2019-03-18T19:51:22 2019-03-19T01:45:52 05:54:30

    If this is the case, check whether your allocation is used up.  If it is, contact the Pawsey helpdesk.

    Code Block
    > pawseyAccountBalance


...