The University of Arizona
    For questions, please open a UAService ticket and assign to the Tools Team.
Page tree
Skip to end of metadata
Go to start of metadata


Overview

SLURM

The new HPC system, Puma, uses SLURM as a job scheduler rather than PBS Pro. SLURM has several advantages:

  • It provides more robust support for a larger number of jobs in queue.
  • It is used by national HPC groups (XSEDE and TACC) making it easier for users to scale out to those systems.
  • It has more sustainable support.

Allocations and Job Partitions (Queues)

Using Puma with SLURM is similar to using ElGato and Ocelote with PBS. Users will still receive a monthly allocation of cpu hours associated with their PI's group which will be deducted when they run their jobs in standard. Users will also still be able to use windfall to run jobs without consuming their monthly allocations. As on Ocelote and Puma, jobs run using windfall will still be subject to preemption when resources are requested by higher-priority jobs.

To request a specific partition (standard, windfall, or high_priority), see Job Partition Requests below.


Modules and Software

The process of finding, loading, and using software as modules will not change on the new system. Users will still be able to utilize the standard commands described in the Software section in our User Guide. However, in a departure from our previous systems, modules will not be available to load and utilize on the login nodes. To load, use, and test software for job submissions, users will need to request an interactive session. Interactive sessions may be requested by simply using the command "interactive". 

Interactive command

When you are on a login node, you can enter interactive and that will provide a default session on a compute node.  This is useful for checking available modules, testing submission scripts, compiling software for example.
The interactive command actually does this:

srun --nodes=1 --ntasks=1 --ntasks-per-node=1 --mem-per-cpu=4GB --time=01:00:00 --job-name=interactive --account=windfall --pty bash -i

If you find that this session is insufficient, then run the command above, but change what you need.

Do you want to get an interactive session faster.  Choosing standard rather than windfall would help.

srun --job-name=compile --partition=standard --account=pi --nodes=1 --tasks=2 --cpus-per-task=1 --mem=8GB --time=01:30:00 --pty bash -i

Are you using X forwarding?

srun --x11 --nodes=1 --ntasks=1 --ntasks-per-node=2 --time=04:00:00 --job-name=viz --account=pi --partition=standard --pty bash -i
netid@r1u03n1:~ $ xclock



GPU Jobs

To run a job on a GPU you request the resource name that is assigned to the GPU.  So in this example interactive job:

srun --nodes=1 --ntasks-per-node=1 --mem-per-cpu=1GB --time=01:00:00 --job-name=interactive --account=windfall --gres=gpu:1  --pty bash -i

The resource is gpu and the quantity is 1.

PBS → SLURM Rosetta Stone

In general, SLURM can translate and execute scripts written for PBS. This means that if you submit a PBS script written for Ocelote or ElGato on Puma (with the necessary resource request modifications), your script will likely run. However, there are a few caveats that should be noted:

  • You will need to submit your job with the new SLURM command, e.g. sbatch instead of qsub
  • There may be some PBS directives that do not directly translate to SLURM which cannot be interpreted
  • The environment variables specific to PBS and SLURM are different. If your job relies on these, you will need to update them. Common examples are PBS_O_WORKDIR and PBS_ARRAY_INDEX

To help with the transition to SLURM, we've also installed software that converts some basic PBS Pro commands into SLURM commands automatically called pbs2slurm.

To get acquainted with the new scheduling system, refer to the following list of common PBS commands, directives, and environment variables and their SLURM counterparts.


PBSSLURMPurpose
Job Management

qsub <options>

sbatch <options>Batch submission of jobs to run without user input
qsub -I <options>
(note the upper case 'i')
srun <options> --pty bash -i
salloc <options>
Request an interactive job

srun <options>Submit a job for realtime execution. Can also be used to submit an interactive session
qstatsqueueShow all jobs
qstat <jobid>squeue --job <jobid>Check status of a specific job
qstat -u <netid>squeue -u <netid>Check status of jobs specific to user
tracejob <jobid>sacct -j <jobid>Check history of a completed job
qdel <jobid>scancel <jobid>Delete a specific job
qdel -u <netid>scancel -u <netid>Delete all user jobs
qstat -QsinfoView information about nodes and queues
qhold <jobid>scontrol hold <jobid>Places a hold on a job to prevent it from being executed
qrls <jobid>scontrol release <jobid>Releases a hold placed on a job allowing it to be executed
Job Directives
#PBS -W group_list=group_name#SBATCH --account=group_nameSpecify group name where hours are charged
#PBS -q standard#SBATCH --partition=standardSet job queue
#PBS -l walltime=HH:MM:SS #SBATCH --time HH:MM:SSSet job walltime
#PBS -l select=<N>

#SBATCH --nodes=<N>

Select N nodes

#PBS -l ncpus=<N>#SBATCH --ntasks=<N>
#SBATCH --cpus-per-task=<M>
PBS: Select N cpus
SLURM: Each task is assume to require one cpu. Optionally, you may include cpus-per-task if more are required. Requests NxM cpus
Note: Puma has 94 cpus available on each node
#PBS -l mem=<N>gb#SBATCH --mem=<N>gbSelect N gb of memory 
#PBS -l pcmem=<N>gb#SBATCH --mem-per-cpu=<N>gbSelect N gb of memory per cpu
Note: Puma defaults to 5GB per cpu
#PBS J N-M#SBATCH --array=N-MArray job submissions where N and M are integers 
#PBS -l np100s=1

#SBATCH --gres=gpu:1

Optional: Request a GPU
#PBS -N JobName#SBATCH --job-name=JobNameOptional: Set job name
#PBS -j oe(default)Optional: Combine stdout and error

(default)#SBATCH -e <job_name>-%j.err
#SBATCH -o <job_name>-%j.out
Optional: Separate stdout and stderr 
(SLURM: %j is a stand-in for $SLURM_JOB_ID)
#PBS -o filename#SBATCH -o filenameOptional: Standard output filename
#PBS -e filename#SBATCH -e filenameOptional: Error filename
N/A#SBATCH --open-mode=appendOptional: Combine all output into single file. Note: If this is selected, each job run will append to that filename, including preexisting files with that name
#PBS -v var=<value>#SBATCH --export=varOptional: Export single environment variable var to job
#PBS -V#SBATCH --export=all (default)Optional: Export all environment variables to job
(default)#SBATCH --export=noneOptional: Do not export working environment to job
#PBS -m be#SBATCH --mail-type=BEGIN|END|FAIL|ALLOptional: Request email notifications
Beware of mail bombing yourself
#PBS -M <netid>@email.arizona.edu#SBATCH --mail-user=<netid>@email.arizona.eduOptional: email address used for notifications
#PBS -l place=excl#SBATCH --exclusiveOptional: Request exclusive access to node



Environment Variables
$PBS_O_WORKDIR$SLURM_SUBMIT_DIRJob submission directory
$PBS_JOBID$SLURM_JOB_IDJob ID
$PBS_JOBNAME$SLURM_JOB_NAMEJob name
$PBS_ARRAY_INDEX$SLURM_ARRAY_TASK_IDIndex to differentiate tasks in an array
$PBS_O_HOST$SLURM_SUBMIT_HOSTHostname where job was submitted
$PBS_NODEFILE$SLURM_JOB_NODELISTList of nodes allocated to current job
Terminology
QueuePartition
Group ListAssociation
PI Account




Job Partition Requests

SLURM partition requests are slightly different from PBS. Use the following table as a guide for how to use the partition that is relevant to you:

PartitionSLURM Details
standard#SBATCH --account=<PI GROUP>
#SBATCH --partition=standard
Consumes your group's standard allocation. 
windfall#SBATCH --partition=windfallDoes not consume your group's standard allocation. Jobs may be interrupted and restarted by higher-priority jobs. The --account flag needs to be omitted or an error will occur. 
high_priority#SBATCH --account=<PI GROUP>
#SBATCH --partition=standard
#SBATCH --qos=user_qos_<PI GROUP>
Available for groups who have purchased compute resources. The partition flag is left as standard and requires the additional qos flag. Replace <PI GROUP> with your group's name. 





SLURM Output Filename Patterns

Unlike PBS, SLURM offers ways to make your job's output filenames more customizable through the use of character replacements. A table is provided below as a guide with some examples. Variables may be used or combined as desired. Note: character replacements may also be used with other SBATCH directives such as error filename, input filename, and job name.

VariableMeaningExample Slurm Directive(s)Output
%AA job array's main job ID

#SBATCH --array=1-2
#SBATCH -o %A.out
#SBATCH --open-mode=append

12345.out
%aA job array's index number#SBATCH --array=1-2
#SBATCH -o %A_%a.out
12345_1.out
12345_2.out
%JJob ID plus stepid#SBATCH -o %J.out12345.out
%jJob ID#SBATCH -o %j.out12345.out
%NHostname of the first compute node allocated to the job#SBATCH -o %N.outr1u11n1.out
%uUsername#SBATCH -o %u.outnetid.out
%xJobname#SBATCH --job-name=JobName
#SBATCH -o %x.out
JobName.out






Job Examples

Single serial job submission

PBS Script

#!/bin/bash
#PBS -N Sample_PBS_Job
#PBS -l select=1:ncpus=1:mem=1gb
#PBS -l walltime=00:01:00
#PBS -q standard
#PBS -W group_list=<group_name>


cd $PBS_O_WORKDIR
pwd; hostname; date

module load python
python --version

SLURM Script

#!/bin/bash
#SBATCH --job-name=Sample_Slurm_Job
#SBATCH --ntasks=1              
#SBATCH --mem=1gb                     
#SBATCH --time=00:01:00    
#SBATCH --partition=standard
#SBATCH --account=<group_name>    

# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date

module load python
python --version

Array Submission

IMPORTANT:

When submitting jobs with named output files (i.e. with the line #SBATCH -o=Job.out) as arrays, SLURM will write every array element to that filename leaving you with only the output of the last completed job in the array. Use one of the following SLURM directives in your script to prevent this behavior:

  1. Differentiates output files using array indices. Similar to PBS default. See SLURM Output Filename Patterns above for more information.

    #SBATCH --output=Job-%a.out 
  2. Appends the output from all tasks in an array to the same output file. Warning: if a file exists with that name prior to running your job, the output will be appended to that file

    #SBATCH --open-mode=append


PBS Script

#!/bin/bash
#PBS -N Sample_PBS_Job
#PBS -l select=1:ncpus=1:mem=1gb
#PBS -l walltime=00:01:00
#PBS -q standard
#PBS -W group_list=<group_name>
#PBS -J 1-5


cd $PBS_O_WORKDIR
pwd; hostname; date

echo "./sample_command input_file_${PBS_ARRAY_INDEX}.in"
 

SLURM Script

#!/bin/bash
#SBATCH --output=Sample_SLURM_Job-%a.out
#SBATCH --ntasks=1              
#SBATCH --mem=1gb                     
#SBATCH --time=00:01:00    
#SBATCH --partition=standard
#SBATCH --account=<group_name>    
#SBATCH --array 1-5

# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date

echo "./sample_command input_file_${SLURM_ARRAY_TASK_ID}.in"



MPI Example

For openmpi the important variables are set by default, so you do not need to include them in your scripts.

export SBATCH_GET_USER_ENV=1
export OMPI_MCA_btl_openib_cpc_include=rdmacm
export OMPI_MCA_btl_openib_if_include=bnxt_re1
export OMPI_MCA_btl_openib_rroce_enable=1
export OMPI_MCA_btl=vader,self,openib
export OMPI_MCA_oob_tcp_if_include=eth1

For Intel MPI, these variables are set for you:

export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=verbs
export FI_VERBS_IFACE=eth1


PBS Script

#!/bin/bash
#PBS -N Sample_MPI_Job
#PBS -l select=1:ncpus=16:mem=16gb
#PBS -l walltime=00:10:00
#PBS -W group_list=<group_name>
#PBS -q standard






cd $PBS_O_WORKDIR
pwd; hostname; date
module load openmpi   
#On ocelote, openmpi does not load by default
/usr/bin/time -o mpit_prog.timing mpirun -np 16 a.out

SLURM Script

#!/bin/bash
#SBATCH --job-name=Sample_MPI_Job
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=16
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=1gb
#SBATCH --time=00:10:00
#SBATCH --account=<group_name>
#SBATCH --partition=standard
#SBATCH --output=Sample_MPI_Job_%A.out
#SBATCH --error=Sample_MPI_Job_%A.err

# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date
# module load openmpi3 is not necessary. On Puma it is loaded by default
/usr/bin/time -o mpit_prog.timing mpirun -np 16 a.out


rosetta_min.pdf

  • No labels