Slurm Workload Manager
Slurm is the workload manager that the CRC uses to process jobs. Only a few components of Slurm will be covered but if you would like the full documentation, it can be found here.
Any and all compute intensive processes must be run on the compute nodes through Slurm. Running compute intensive processes on the head nodes is not permitted. Any compute intensive tasks that are run on the head nodes will be immediately terminated. Failure to comply may result in temporary or premanent revocation of your HPC access privileges.
loading the slurm module
In order to run slurm commands you must make sure the slurm module is loaded. The module should be loaded by default when you login but if it is not you can use the following command to load it:
module load slurm
Useful slurm commands
Squeue
squeue
squeue reports the state of jobs in the queue. It will display a list of all active and pending jobs with the following information:
- Job ID number
- Partition
- Job name
- User
- State
- Run time
- Number of nodes requested
- Node list (Reason why job is waiting)
Some state codes are as follows:
Code | Short Name | Meaning |
---|---|---|
R | Running | Job currently has an allocation. |
PD | PENDING | Job is awaiting resource allocation. |
F | FAILED | Job terminated with non-zero exit code or other failure condition. |
S | SUSPENDED | Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. |
CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
Available Partitions
Hodor and Arya are now partitions under the Talon cluster. Walltimes are enforced on all partitions except for the private partitions. The default walltime is 2 hours. Below are the available partitions and their maximum walltimes: talon - Talon CPU. This is the default queue. Maximum walltime is 28 days. talon—gpu - Talon GPU. Talon GPU nodes. Maximum walltime is 28 days. hodor-cpu - Hodor CPU. Maximum walltime is 1 day. hodor-gpu - Hodor GPU. Maximum walltime is 3 days. manu - Private Arya partition. Authorized users only. hoffmann - Private Arya partition. Authorized users only.
Sbatch
sbatch
sbatch is used to submit a job script to the queue. Normally, the job script will contain all of the arguments for sbatch. For a template job script, see the script below.
#!/bin/bash
##### Partition to use
#SBATCH --partition=<name>
##### Number of nodes
#SBATCH -N2
##### Number of tasks per node
#SBATCH --ntasks-per-node=8
#SBATCH --job-name=<YourJobName>
#SBATCH --chdir=./
##### Output file. This and the error file are the first two things we check when we are troubleshooting issues with your job.
#SBATCH -o slurm_run_%j_output.txt
##### Error file. This and the output file are the first two things we check when we are troubleshooting issues with your job.
#SBATCH -e slurm_run_%j_error.txt
# Changes working directory to the directory where this script is submitted from
printf 'Changing to the working directory: %s\n\n' "$SLURM_SUBMIT_DIR"
cd $SLURM_SUBMIT_DIR
# Load Necessary Modules -- Add whatever modules you need to run your program
printf 'Loading modules\n'
module load slurm
# Determine the job host names and write a hosts file
srun -n $SLURM_NTASKS hostname | sort -u > $SLURM_JOB_ID.hosts
# Run program using mpirun
mpirun -np $SLURM_NTASKS -machinefile $SLURM_JOB_ID.hosts <PROGRAM>
# Remove Hosts file
rm ${SLURM_JOB_ID}.hosts
To submit the job you would use the command:
sbatch <yourShellScript>.sh
When you submit, it will give you a job ID number.
Scancel
scancel
is used to cancel jobs using the job_id.
scancel job_id
Srun
srun
can be used to obtain resource requirements.
Adding the following to your job script will create a file which lists the nodes to use.
srun -n $SLURM_NTASKS hostname > $SLURM_JOB_ID.hosts