The University of Arizona
    For questions, please open a UAService ticket and assign to the Tools Team.
Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Section Column
width15%
Image Removed
Column
width85%

Table of Contents

Overview

SLURM

The new HPC system, Puma, uses SLURM as a job scheduler rather than PBS Pro. SLURM has several advantages:

  • More robust support for larger number of jobs in queue.
  • Used by national HPC groups (XSEDE and TACC) so it'll be easier to scale out to those systems.
  • Rejects jobs asking for impossible resource configurations.

Allocations and Job Queues

Using Puma with SLURM is similar to using ElGato and Ocelote with PBS. Users will still receive a monthly allocation of cpu hours associated with their PI's group which will be deducted when they run their jobs in standard. Users will also still be able to use windfall to run jobs without consuming their monthly allocations. As on Ocelote and Puma, jobs run using windfall will still be subject to preemption when resources are requested by higher-priority jobs.

Modules and Software

The process of finding, loading, and using software as modules will not change on the new system. Users will still be able to utilize the standard commands described in the Software section in our User Guide. However, in a departure from our previous systems, modules will not be available to load and utilize on the login nodes. To load, use, and test software for job submissions, users will need to request an interactive session. Interactive sessions may be requested by simply using the command "interactive". 

PBS → SLURM Rosetta Stone

In general, SLURM can translate and execute scripts written for PBS. This means that if you submit a PBS script written for Ocelote or ElGato on Puma, your script will likely run. However, there are a few caveats that should be noted:

  • There may be some PBS directives that do not directly translate to SLURM and may fail. 
  • The environment variables specific to PBS and SLURM are different. If your job relies on these environment variables, you will need to change these. Common examples are PBS_O_WORKDIR and PBS_ARRAY_INDEX

To help with the transition to SLURM, we've installed software that converts some basic PBS Pro commands into SLURM commands automatically.

Below is a comprehensive list of common PBS commands, directives, and environment variables and their SLURM counterparts. 

PBSSLURMPurposeJob Management

qsub <options>

sbatch <options>Batch submission of jobs to run without user inputqsub -I <options>salloc <options>Request an interactive jobsrun <options>Submit a job for realtime execution. Can be used to submit an interactive session.qstatsqueueShow all jobsqstat <jobid>squeue --job <jobid>Check status of a specific jobqstat -u <netid>squeue -u <netid>Check status of jobs specific to userqdel <jobid>scancel <jobid>Delete a specific jobqdel -u <netid>scancel -u <netid>Delete all user jobsqstat -QsinfoView information about nodes and queues. qhold <jobid>scontrol hold <jobid>Places


Panel
borderColor#9c9fb5
bgColor#fcfcfc
titleColor#fcfcfc
titleBGColor#021D61
borderStylesolid
titleContents

Table of Contents
maxLevel2



Image Added



Panel
borderColor#9c9fb5
bgColor#fafafe
borderWidth2
borderStyledouble

Overview

All three clusters, Puma, Ocelote, and ElGato, use SLURM for resource management and job scheduling.

Additional SLURM Resources and Examples

LinkDescription
Official SchedMD User DocumentationOfficial SchedMD user documentation. Includes detailed information on SLURM directives and commands.
PBS ⇔ SLURM Rosetta StoneTable for converting some common PBS job directives to SLURM syntax.
Puma Quick StartHPC Quick Start guide. If you have never submitted a batch job before, this is a great place to start.
Job ExamplesBasic SLURM example scripts. Includes PBS scripts for comparison. 
Even More Job Examples!Growing repository of example SLURM submission scripts
Intro to HPCA recorded video presentation of our Intro to HPC workshop. Keep your eyes peeled for periodic announcements in the HPC listserv on upcoming live sessions!=




Panel
borderColor#9c9fb5
bgColor#fafafe
borderWidth2
borderStyledouble

SLURM and System Commands

CommandPurposeExample(s)
Native Slurm Commands
sbatchSubmits a batch script for executionsbatch script.slurm
srunRun parallel jobs. Can be in place of mpirun/mpiexec. Can be used interactively as well as in batch scriptssrun -n 1 --mpi=pmi2 a.out
sallocRequests a session to  work on a compute node interactivelysee: Interactive Sessions section below
squeueChecks the status of pending and running jobs

squeue --job $JOBID
squeue --user $NETID

scancelCancel a running or pending job

scancel $JOBID
scancel -u $NETID

scontrol holdPlace a hold on a job to prevent it from being executed
qrls <jobid>
scontrol hold $JOBID
scontrol release
<jobid>
Releases a hold placed on a job allowing it to be executed
Job Directives#PBS -W group_list=group_name
scontrol release $JOBID
System Commands
vaDisplays your group membership, your account usage, and CPU allocation. Short for "view allocation"va
interactiveShortcut for quickly requesting an interactive job. Use "interactive --help" to get full usage. interactive -a $GROUP_NAME
job-historyRetrieves a running or completed job's history in a user-friendly formatjob-history $JOBID
seffRetrieves a completed job's memory and CPU efficiencyseff $JOBID
past-jobsRetrieves past jobs run by user. Can be used with option "-d N" to search for jobs run in the past N days.past-jobs -d 5
job-limitsView your group's job resource limits and current usage.job-limits $GROUP
nodes-busyDisplay a visualization of nodes on a cluster and their usagenodes-busy --help
system-busyDisplay a text-based summary of a cluster's usagesystem-busy
cluster-busyDisplay a visualization of all three cluster's overall usagecluster-busy --help




Panel
borderColor#9c9fb5
bgColor#fafafe
borderWidth2
borderStyledouble

Anchor
batch-directives
batch-directives
Batch Job Directives

Command Purpose
#SBATCH --account=group_nameSpecify
group name
the account where hours are charged
#PBS -q standard
. Don't know your group name? Run the command "va" to see which groups you belong to
#SBATCH --partition=
standard
partition_nameSet
job queue#PBS -l walltime=HH:MM:SS 
the job partition. This determines your job's priority and the hours charged. See Job Partition Requests below for additional information
#SBATCH --time=DD-HH:MM:SSSet
job walltime#PBS -l select=N#SBATCH --nodes=NSelect N number of nodes#PBS -l ncpus=NSelect N cpus#PBS -l mem=<N>gb
the job's runtime limit in days, hours, minutes, and seconds
#SBATCH --nodes=N

Allocate N nodes to your job.

For non-MPI enabled jobs, this should be set to "–-nodes=1" to ensure access to all requested resources and prevent memory errors.

#SBATCH --ntasks=N

ntasks specifies the number of tasks (or processes) the job will run. For MPI jobs, this is the number of MPI processes.  Most of the time, you can use ntasks to specify the number of CPUs your job needs. However, in some odd cases you might run into issues. For example, see: Using Matlab

By default, you will be allocated one CPU/task. This can be increased by including the additional directive --cpus-per-task.

The number of CPUs a job is allocated is cpus/task * ntasks, or M*N

#SBATCH --cpus-per-task=M
#SBATCH --mem=
<N>gb
Ngb

Select

memory in GB#PBS -l pcmem=Select memory per cpu#PBS J N-M#PBS -N JobName#SBATCH --

N gb of memory per node. If "gb" is not included, this value defaults to MB. Directives --mem and --mem-per-cpu are mutually exclusive.

#SBATCH --mem-per-cpu=NgbSelect N GB of memory per CPU. Valid values can be found in the Node Types/Example Resource Requests section below. If "gb" is not included, this value defaults to MB.
#SBATCH --gres=gpu:NOptional: Request N GPUs.
#SBATCH --constraint=hi_memOptional: Request a high memory node (Ocelote and Puma only).
#SBATCH --array=N-M
Array job submissions where N and M are integers 
Submits an array job from indices N to M
#SBATCH --job-name=JobNameOptional:
Set job name#PBS -j oeOptional: Combine stdout and error. This is the SLURM default#PBS -o filename
Specify a name for your job. This will not automatically affect the output filename.
#SBATCH -e output_filename.err
#SBATCH -o output_filename.out
Optional:
Standard
Specify output filename
#PBS -e filename#SBATCH -e filenameOptional: Error filename#PBS -v var=<value>
(s). If -e is missing, stdout and stderr will be combined.
#SBATCH --open-mode=appendOptional: Append your job's output to the specified output filename(s). 
#SBATCH --mail-type=BEGIN|END|FAIL|ALLOptional: Request email notifications. Beware of mail bombing yourself.
#SBATCH --mail-user=email@address.xyzOptional: Specify email address. If this is missing, notifications will go to your UArizona email address by default.
#SBATCH --exclusiveOptional: Request exclusive access to node.
#SBATCH --export=
var
VAROptional: Export
single environment variable var to job#PBS -V#SBATCH 
a comma-delimited list of environment variables to a job. 
#SBATCH --export=all (default)Optional: Export
all
your working environment
variables
to your job.
#PBS 
#SBATCH -
m be#SBATCH --mail-type=BEGIN|END|FAIL|ALLOptional: Request email notificationsEnvironment Variables$PBS_O_WORKDIR$SLURM_SUBMIT_DIRJob submission directory$PBS_JOBID
-export=noneOptional: Do not export working environment to your job.




Panel
borderColor#9c9fb5
bgColor#fafafe
borderWidth2
borderStyledouble

SLURM Environment Variables

VariablePurposeExample Value
$SLURM_ARRAY_JOB_IDJob array's parent ID399124
$SLURM_ARRAY_TASK_COUNTTotal number of subjobs in the array4
$SLURM_ARRAY_TASK_IDJob index number (unique for each job in the array)1
$SLURM_ARRAY_TASK_MAXMaximum index for the job array7
$SLURM_ARRAY_TASK_MINMinimum index for the job array1
$SLURM_ARRAY_TASK_STEPJob array's index step size2
$SLURM_CLUSTER_NAMEWhich cluster your job is running onelgato
$SLURM_CONFPoints to the SLURM configuration file/var/spool/slurm/d/conf-cache/slurm.conf
$SLURM_CPUS_ON_NODENumber of CPUs allocated to target node3
$SLURM_GPUS_ON_NODENumber of GPUs allocated to the target node1
$SLURM_GPUS_PER_NODENumber of GPUs per node. Only set if --gpus-per-node is specified1
$SLURM_JOB_ACCOUNTAccount being chargedgroupname
$SLURM_JOB_GPUSThe global GPU IDs of the GPUs allocated to the job. Only set in batch and interactive jobs.0
$SLURM_JOB_IDYour SLURM Job ID
$PBS_JOBNAME
399072
$SLURM_JOB_CPUS_PER_NODENumber of CPUs per node. This can be a list if there is more than one node allocated to the job. The list has the same order as SLURM_JOB_NODELIST3,1
$SLURM_JOB_NAME
Job
The job's name
$PBS
interactive
$SLURM_
ARRAY_INDEX
JOB_NODELISTThe nodes that have been assigned to your jobgpu[73-74]
$SLURM_
ARRAY
JOB_
TASK
NUM_
IDIndex to differentiate tasks in an array$PBS_O_HOST$SLURM_SUBMIT_HOSTHostname where job was submitted$PBS_NODEFILE
NODESThe number of nodes allocated to the job2
$SLURM_JOB_PARTITIONThe job's partitionstandard
$SLURM_JOB_QOSThe job's QOS/Partitionqos_standard_part
$SLURM_JOB_
NODELIST
USER
List of nodes allocated to current jobTerminologyQueuePartitionGroup ListAssociationPI AccountJob Examples

Single serial job submission

Section Column
width50%

PBS Script

Code Block
#!/bin/bash
#PBS -N Sample_PBS_Job
#PBS -l select=1:ncpus=1
#PBS -l mem=1gb
#PBS -l walltime=00:01:00
#PBS -q windfall
#PBS -W group_list=<group_name>

cd $PBS_O_WORKDIR
pwd; hostname; date

module load python
python --version

Column
width50%

SLURM Script

Code Block
#!/bin/bash
#SBATCH --job-name=Sample_Slurm_Job
#SBATCH --ntasks=1              
#SBATCH --mem=1gb                     
#SBATCH --time=00:01:00    
#SBATCH --partition=windfall
#SBATCH --account=<group_name>    

cd $SLURM_WORK_DIR
pwd; hostname; date

module load python
python --version

Array Submission

IMPORTANT:

When submitting named jobs as arrays, SLURM will overwrite the output file with the output of the last processed job in the array. There are two ways around this:

Use the option:

Code Block
#SBATCH --output=slurm-array-test-%a.out
to differentiate each output file by subjob ID. This is the same behavior as seen in PBS
  • Use the option:

    Code Block
    #SBATCH --open-mode=append

    To append the output from all job arrays to the same file

  • Section Column
    width50%

    PBS Script

    Code Block
    #!/bin/bash
    #PBS -N Sample_PBS_Job
    #PBS -l select=1:ncpus=1
    #PBS -l mem=1gb
    #PBS -l walltime=00:01:00
    #PBS -q windfall
    #PBS -W group_list=<group_name>
    #PBS -J 1-5
    
    cd $PBS_O_WORKDIR
    pwd; hostname; date
    
    echo "./sample_command input_file_${PBS_ARRAY_INDEX}.in"
     
    Column
    width50%

    SLURM Script

    Code Block
    #!/bin/bash
    #SBATCH --output=Sample_SLURM_Job-%a.out
    #SBATCH --ntasks=1              
    #SBATCH --mem=1gb                     
    #SBATCH --time=00:01:00    
    #SBATCH --partition=windfall
    #SBATCH --account=<group_name>    
    #SBATCH --array 1-5
    
    cd $SLURM_WORK_DIR
    pwd; hostname; date
    
    echo "./sample_command input_file_${SLURM_ARRAY_TASK_ID}.in"
    rosetta_min.pdf
    The username of the person who submitted the jobnetid
    $SLURM_JOBIDSame as SLURM_JOB_ID, your SLURM Job ID399072
    $SLURM_MEM_PER_CPUThe memory/CPU ratio allocated to the job4096
    $SLURM_NNODESSame as SLURM_JOB_NUM_NODES – the number of nodes allocated to the job2
    $SLURM_NODELISTSame as SLURM_JOB_NODELIST, The nodes that have been assigned to your jobgpu[73-74]
    $SLURM_NPROCSThe number of tasks allocated to your job4
    $SLURM_NTASKSSame as SLURM_NPROCS, the number of tasks allocated to your job4
    $SLURM_SUBMIT_DIRThe directory where sbatch was used to submit the job/home/u00/netid
    $SLURM_SUBMIT_HOSTThe hostname where sbatch was used to submit the jobwentletrap.hpc.arizona.edu
    $SLURM_TASKS_PER_NODEThe number of tasks to be initiated on each node. This can be a list if there is more than one node allocated to the job. The list has the same order as SLURM_JOB_NODELIST3,1
    $SLURM_WORKING_CLUSTERValid for interactive jobs, will be set with remote sibling cluster's IP address, port and RPC version so that any sruns will know which cluster to communicate with.elgato:foo:0000:0000:000




    Panel
    borderColor#9c9fb5
    bgColor#fafafe
    borderWidth2
    borderStyledouble

    SLURM Reason Codes

    Sometimes, if you check a pending job using squeue, there are some messages that show up under Reason indicating why your job may not be running. Some of these codes are non-intuitive so a human-readable translation is provided below:

    ReasonExplanation
    AssocGrpCpuLimitThis is a per-group limitation on the number of CPUs that can be used simultaneously by all group members. Your job is not running because this limit has been reached. Check your group's limits using "job-limits <group_name>".

    AssocGrpMemLimit

    This is a per-group limitation on the amount of memory that can be used simultaneously by all group members. Your job is not running because this limit has been reached. Check your group's limits using "job-limits <group_name>".
    AssocGrpCPUMinutesLimitEither your group is out of CPU hours or your job will exhaust your group's CPU hours.
    AssocGrpGRESThis is a per-group limitation on the number of GPUs that can be used simultaneously by all group members. Your job is not running because this limit has been reached. Check your group's limits using "job-limits <group_name>".

    Dependency

    Your job depends on the completion of another job. It will wait in queue until the target job completes.

    QOSMaxWallDurationPerJobLimit Your job's time limit exceeds the max allowable and will never run. To see an individual job's limits, run "job-limits <group_name>".

    Nodes_required_for_job_are_

    DOWN,_DRAINED_or_reserved_

    or_jobs_in_higher_priority_

    partitions

    This very long message simply means your job is waiting in queue until there is enough space for it to run
    PriorityYour job is waiting in queue until there is enough space for it to run.
    QOSMaxCpuPerUserLimitThis is a per-user limitation on the number of CPUs that you can use simultaneously among all of your jobs. Your job is not running because this limit has been reached. Check your user limits using "job_limits <group_name>".
    ReqNodeNotAvail, Reserved for maintenanceYour job's time limit overlaps with an upcoming maintenance window. Run "uptime_remaining" to see when the system will go offline. If you remove and resubmit your job with a shorter walltime that does not overlap with maintenance, it will likely run. Otherwise, it will remain pending until after the maintenance window.

    Resources

    Your job is waiting in queue until the required resources are available.




    Panel
    borderColor#9c9fb5
    bgColor#fafafe
    borderWidth2
    borderStyledouble

    Anchor
    Job Partition Requests
    Job Partition Requests
    Job Partition Requests

    PartitionSLURM Details
    standard#SBATCH --account=<PI GROUP>
    #SBATCH --partition=standard
    Consumes your group's standard allocation. These jobs cannot be interrupted.
    windfall#SBATCH --partition=windfallDoes not consume your group's standard allocation. Jobs may be interrupted and restarted by higher-priority jobs. The --account flag needs to be omitted or an error will occur. 
    high_priority#SBATCH --account=<PI GROUP>
    #SBATCH --partition=high_priority
    #SBATCH --qos=user_qos_<PI GROUP>
    Available for groups who have purchased compute resources.
    qualified#SBATCH --account=<PI GROUP>
    #SBATCH --partition=standard
    #SBATCH --qos=qual_qos_<PI GROUP>
    Available for groups that have submitted a special project request.




    Panel
    borderColor#9c9fb5
    bgColor#fafafe
    borderWidth2
    borderStyledouble

    SLURM Output Filename Patterns

    SLURM offers ways to make your job's output filenames customizable through the use of character replacements. A table is provided below as a guide with some examples. Variables may be used or combined as desired. Note: character replacements may also be used with other SBATCH directives such as error filename, input filename, and job name.

    VariableMeaningExample Slurm Directive(s)Output
    %AA job array's main job ID

    #SBATCH --array=1-2
    #SBATCH -o %A.out
    #SBATCH --open-mode=append

    12345.out
    %aA job array's index number#SBATCH --array=1-2
    #SBATCH -o %A_%a.out
    12345_1.out
    12345_2.out
    %JJob ID plus stepid#SBATCH -o %J.out12345.out
    %jJob ID#SBATCH -o %j.out12345.out
    %NHostname of the first compute node allocated to the job#SBATCH -o %N.outr1u11n1.out
    %uUsername#SBATCH -o %u.outnetid.out
    %xJobname#SBATCH --job-name=JobName
    #SBATCH -o %x.out
    JobName.out




    Panel
    borderColor#9c9fb5
    bgColor#fafafe
    borderWidth2
    borderStyledouble

    Anchor
    example requests
    example requests
    Node Types/Example Resource Requests

    Standard Nodes

    ClusterMax CPUsMem/CPUMax MemSample Request Statement
    ElGato164gb62gb#SBATCH --nodes=1
    #SBATCH --ntasks=16
    #SBATCH --mem-per-cpu=4gb
    Ocelote286gb168gb

    #SBATCH --nodes=1
    #SBATCH --ntasks=28
    #SBATCH --mem-per-cpu=6gb

    Puma945gb470gb#SBATCH --nodes=1
    #SBATCH --ntasks=94
    #SBATCH --mem-per-cpu=5gb

    GPU Nodes

    Note

    During the quarterly maintenance cycle on April 27, 2022 the ElGato K20s and Ocelote K80s were removed because they are no longer supported by Nvidia.

    GPU jobs are requested using the generic resource, or --gres, SLURM directive. In general, the directive to request N GPUs will be of the form: --gres=gpu:N


    ClusterMax CPUsMem/CPUMax MemSample Request Statement
    Ocelote288gb224gb

    #SBATCH --nodes=1
    #SBATCH --ntasks=28
    #SBATCH --mem-per-cpu=8gb
    #SBATCH --gres=gpu:1

    Puma1945gb470gb

    #SBATCH --nodes=1
    #SBATCH --ntasks=94
    #SBATCH --mem-per-cpu=5gb
    #SBATCH --gres=gpu:1

    Up to four GPUs may be requested on Puma on a single GPU node with --gres=gpu:1, 2, 3, or 4

    High Memory Nodes

    When requesting a high memory node, include both the memory/CPU and constraint directives

    ClusterMax CPUsMem/CPUMax MemSample Request Statement
    Ocelote4841gb2015gb

    #SBATCH --nodes=1
    #SBATCH --ntasks=48
    #SBATCH --mem-per-cpu=41gb
    #SBATCH --constraint=hi_mem

    Puma9432gb3000gb#SBATCH --nodes=1
    #SBATCH --ntasks=94
    #SBATCH --mem-per-cpu=32gb
    #SBATCH --constraint=hi_mem




    Panel
    borderColor#9c9fb5
    bgColor#fafafe
    borderWidth2
    borderStyledouble

    Anchor
    interactive-jobs
    interactive-jobs
    Interactive Jobs

    When you are on a login node, you can request an interactive session on a compute node. This is useful for checking available modules, testing submission scripts, compiling software, and running programs directly from the command line. We have a built-in shortcut command that will allow you to quickly and easily request a session by simply entering: interactive

    When you request a session, the full salloc command being executed will be displayed for verification/copying/editing/pasting purposes. For example:

    Code Block
    languagebash
    themeMidnight
    (ocelote) [netid@junonia ~]$ interactive
    Run "interactive -h for help customizing interactive use"
    Submitting with /usr/local/bin/salloc --job-name=interactive --mem-per-cpu=4GB --nodes=1    --ntasks=1 --time=01:00:00 --account=windfall --partition=windfall
    salloc: Pending job allocation 531843
    salloc: job 531843 queued and waiting for resources
    salloc: job 531843 has been allocated resources
    salloc: Granted job allocation 531843
    salloc: Waiting for resource configuration
    salloc: Nodes i16n1 are ready for job
    [netid@i16n1 ~]$ 

    Notice in the example above how the command prompt changes once your session starts. When you're on a login node, your prompt will show "junonia" or "wentletrap". Once you're in an interactive session, you'll see the name of the compute node you're connected to. 

    If no options are supplied to the command interactive, your job will automatically run using the windfall partition for one hour using one CPU. To use the standard partition, include the flag "-a" followed by your group's name. To see all the customization options:

    Code Block
    languagebash
    themeMidnight
    (ocelote) [netid@junonia ~]$ interactive -h
    Usage: /usr/local/bin/interactive [-x] [-g] [-N nodes] [-m memory per core] [-n ncpus per node] [-Q optional qos] [-t hh::mm:ss] [-a account to charge]

    You may also create your own salloc commands using any desired SLURM directives for maximum customization.



    Panel
    borderColor#9c9fb5
    bgColor#fafafe
    borderWidth2
    borderStyledouble

    MPI Jobs

    OpenMPI

    For openmpi the important variables are set by default, so you do not need to include them in your scripts.

    Code Block
    languagebash
    themeMidnight
    titleDefault OpenMPI variables
    export SBATCH_GET_USER_ENV=1
    export OMPI_MCA_btl_openib_cpc_include=rdmacm
    export OMPI_MCA_btl_openib_if_include=bnxt_re1
    export OMPI_MCA_btl_openib_rroce_enable=1
    export OMPI_MCA_btl=vader,self,openib
    export OMPI_MCA_oob_tcp_if_include=eth1

    Intel MPI

    For Intel MPI, these variables are set for you:

    Code Block
    languagebash
    themeMidnight
    module unload openmpi3 gnu8

    If you're using Intel MPI with mpirun and are getting errors, try replacing mpirun -np $NPROCESSES with:

    Code Block
    languagebash
    themeMidnight
    srun -n $NPROCESSES --mpi=pmi2




    Panel
    borderColor#9c9fb5
    bgColor#fafafe
    borderWidth2
    borderStyledouble

    Parallel Work

    To make proper use of a supercomputer, you will likely want to use the benefit of many cores.  Puma has 94 cores in each node available to Slurm.  The exception to that is running hundreds or thousands of jobs using High Throughput Computing.  

    We have a training course which explains the concepts and terminology of parallel computing with some examples.  Introduction to Parallel Computing

    This practical course in Parallel Analysis in R is also useful