Slurm Workload Manager

Slurm is the workload manager that the CRC uses to process jobs. Only a few components of Slurm will be covered but if you would like the full documentation, it can be found here.

Any and all compute intensive processes must be run on the compute nodes through Slurm. Running compute intensive processes on the head nodes is not permitted. Any compute intensive tasks that are run on the head nodes will be immediately terminated. Failure to comply may result in temporary or premanent revocation of your HPC access privileges.

loading the slurm module

In order to run slurm commands you must make sure the slurm module is loaded. The module should be loaded by default when you login but if it is not you can use the following command to load it:

module load slurm

Useful slurm commands

Squeue

squeue

squeue reports the state of jobs in the queue. It will display a list of all active and pending jobs with the following information:

Job ID number
Partition
Job name
User
State
Run time
Number of nodes requested
Node list (Reason why job is waiting)

Some state codes are as follows:

Code	Short Name	Meaning
R	Running	Job currently has an allocation.
PD	PENDING	Job is awaiting resource allocation.
F	FAILED	Job terminated with non-zero exit code or other failure condition.
S	SUSPENDED	Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
CA	CANCELLED	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD	COMPLETED	Job has terminated all processes on all nodes with an exit code of zero.

Available Partitions

Hodor and Arya are now partitions under the Talon cluster. Walltimes are enforced on all partitions except for the private partitions. The default walltime is 2 hours. Below are the available partitions and their maximum walltimes: talon - Talon CPU. This is the default queue. Maximum walltime is 28 days. talon—gpu - Talon GPU. Talon GPU nodes. Maximum walltime is 28 days. hodor-cpu - Hodor CPU. Maximum walltime is 1 day. hodor-gpu - Hodor GPU. Maximum walltime is 3 days. manu - Private Arya partition. Authorized users only. hoffmann - Private Arya partition. Authorized users only.

Sbatch

sbatch

sbatch is used to submit a job script to the queue. Normally, the job script will contain all of the arguments for sbatch. For a template job script, see the script below.

#!/bin/bash
##### Partition to use
#SBATCH --partition=<name>
##### Number of nodes
#SBATCH -N2
##### Number of tasks per node
#SBATCH --ntasks-per-node=8
#SBATCH --job-name=<YourJobName>
#SBATCH --chdir=./
##### Output file. This and the error file are the first two things we check when we are troubleshooting issues with your job. 
#SBATCH -o slurm_run_%j_output.txt
##### Error file. This and the output file are the first two things we check when we are troubleshooting issues with your job.
#SBATCH -e slurm_run_%j_error.txt

# Changes working directory to the directory where this script is submitted from
printf 'Changing to the working directory: %s\n\n' "$SLURM_SUBMIT_DIR"
cd $SLURM_SUBMIT_DIR

# Load Necessary Modules -- Add whatever modules you need to run your program
printf 'Loading modules\n'
module load slurm

# Determine the job host names and write a hosts file
srun -n $SLURM_NTASKS hostname | sort -u > $SLURM_JOB_ID.hosts

# Run program using mpirun
mpirun -np $SLURM_NTASKS -machinefile $SLURM_JOB_ID.hosts <PROGRAM>

# Remove Hosts file
rm ${SLURM_JOB_ID}.hosts

To submit the job you would use the command:

sbatch <yourShellScript>.sh

When you submit, it will give you a job ID number.

Scancel

scancel is used to cancel jobs using the job_id.

scancel job_id

Srun

srun can be used to obtain resource requirements.

Adding the following to your job script will create a file which lists the nodes to use.

srun -n $SLURM_NTASKS hostname > $SLURM_JOB_ID.hosts

Slurm