Slurm usage

Slurm usage

Content

Slurm partitions

To match your job requirements with the hardware, you choose among the

Important Slurm commands

The commands normally used for job control and management are

  • Job submission:
    $ sbatch <jobscript>
    $ srun <arguments> <command>
  • Status of a specific job:
    $ squeue -j <jobID> 
    $ scontrol show job <jobID> (for full job information, even after the job finished)
  • Job cancellation:
    $ scancel <jobID>
    $ scancel -i -u $USER (cancel all your jobs, asking at each job)
    $ scancel -9 (send SIGKILL instead of SIGTERM)
  • Job overview:
    $ squeue
    $ squeue -l --me
  • Estimated job start:
    $ squeue --start -j <jobID>
  • Workload overview of the whole system:
    $ sinfo
    $ sinfo --format="%25C %A"
    $ squeue -l

Job scripts

A job script can be any script that contains special instruction for Slurm. Most commonly used forms are shell scripts, such as bash or plain sh. But other scripting languages (e.g. Python, Perl, R) are also possible.

Example Batch Script
#!/bin/bash

#SBATCH -p cpu-clx:test
#SBATCH -N 4
#SBATCH -t 06:00:00

module load impi
mpirun my-mpi-parallel-binary

Job scripts need to have a shebang line at the top, followed by the #SBATCH options. These #SBATCH comments have to be at the top, as Slurm stops scanning for them after the first non-comment non-whitespace line (e.g., a command or a declaration of a variable).

More examples can be found at Examples and Recipes.

Parameters

ParameterSBATCH flagComment
# nodes-N <#>
# tasks-n <#>
# tasks per node--tasks-per-node <#>different defaults between mpirun and srun
Slurm partition

-p <name>

e.g. cpu-clx, overview: Slurm partition CPU CLX

# CPU cores per task

-c <#>interesting for OpenMP/Hybrid jobs
job walltime limit-t hh:mm:ssrealistic estimates for best scheduling efficiency
e-mail notification--mail-type=ALLsee sbatch manpage for notification types
project account-A <project-ID>project account to be charged

Job runtime limits

Maximum runtimes of jobs, also known as walltime limits, are defined for each partition and can be viewed either on the command line using sinfo , or here. There is no minimum runtime, but a runtime of at least 1 hour is encouraged. Large amounts of small and short jobs can trigger problems in our accounting system. Occasional short jobs are fine, but if you submit large amounts of jobs that finish (or crash) quickly, we might have to intervene and temporarily suspend your account. If you have lots of smaller workloads, please consider combining them into a single job that runs for at least 1 hour.

Project account selection

Each job is running under the name of the user who submitted the job to the system. The user name is not to be confused with the project account which a job gets charged to. 

  • For users new to the system, the default project account is their Test Project. It reads the same as the user name.
  • Users with membership in one or more compute projects need to decide which project account a job will be charged to.
  • Decision can be taken via the sbatch flags --account=my-project-ID or, in short, -A my-project-ID.
  • If no explicit decision is provided, the default applies which can be changed by the user in the Portal (section User Data).

After job submission, Slurm checks the balance of the job's project account. When a negative balance is returned by the Slurm database, the job will not be scheduled. In this case, squeue displays a corresponding message (AccountOutOfComputeTime) to the user.

Example: Negative balance of the project account
$ squeue --me
JOBID PARTITION NAME USER ACCOUNT   STATE TIME NODES NODELIST(REASON)
  ...       ...  ...  ...     ... PENDING 0:00   ... (AccountOutOfComputeTime)

Interactive jobs

To use compute resources interactively, e.g. to follow the execution of MPI programs, the following steps are required. Note that non-interactive batch jobs via job scripts (see below) are the primary way of using the compute resources.

  1. A resource allocation for interactive usage has to be requested first with the salloc command which should also include your resource requirements.
  2. When salloc successfully allocated the requested resources, you have to issue an additional srun command if you want to work directly on one of the allocated nodes (see example below)
  3. Afterwards, srun or MPI launcher commands (mpirun, mpiexec) can be used to start parallel programs.
blogin1:~ $ salloc -t 00:10:00 -p cpu-clx:test -N2 --tasks-per-node 24
salloc: Granted job allocation [...]
salloc: Waiting for resource configuration
salloc: Nodes bcn[1001,1003] are ready for job
# To get a shell on one of the allocated nodes
blogin1:~ $ srun --pty --interactive --preserve-env ${SHELL}
bcn1001:~ $ srun hostname | sort | uniq -c
     24 bcn1001
     24 bcn1003
bcn1001:~ $ exit
blogin1:~ $ exit
salloc: Relinquishing job allocation [...]

Using shared nodes

In some of our Slurm partitions, compute nodes are allocated in shared mode, so that multiple jobs of a partition requesting only parts of a node can run concurrently on the same nodes. You can explicitly request less CPU cores (implies less memory, too) or less memory per node than available, and take care that memory limits are not exceeded.

 Example for a post-processing job script using 10 cores.

This job's memory usage should not exceed a fraction of 10/96 of the total memory.   

#!/bin/bash
#SBATCH -p cpu-clx:huge
#SBATCH -t 1-0          # one day
#SBATCH -n 10
#SBATCH -N 1

python postprocessing.py

(Warnung) Please remember that most of Lise's compute nodes belong to Slurm partitions where nodes are allocated exclusively to a job, meaning nodes are not shared between jobs (see Accounting).