Section |
---|
Column |
---|
|
Image RemovedOverview
SLURM
The new HPC system, Puma, uses SLURM as a job scheduler rather than PBS Pro. SLURM has several advantages:
- It provides more robust support for a larger number of jobs in queue.
- It is used by national HPC groups (XSEDE and TACC) making it easier for users to scale out to those systems.
- It has more sustainable support.
Allocations and Job Partitions (Queues)
Using Puma with SLURM is similar to using ElGato and Ocelote with PBS. Users will still receive a monthly allocation of cpu hours associated with their PI's group which will be deducted when they run their jobs in standard. Users will also still be able to use windfall to run jobs without consuming their monthly allocations. As on Ocelote and Puma, jobs run using windfall will still be subject to preemption when resources are requested by higher-priority jobs.
To request a specific partition (standard, windfall, or high_priority), see Job Partition Requests below.
Modules and SoftwareThe process of finding, loading, and using software as modules will not change on the new system. Users will still be able to utilize the standard commands described in the Software section in our User Guide. However, in a departure from our previous systems, modules will not be available to load and utilize on the login nodes. To load, use, and test software for job submissions, users will need to request an interactive session. Interactive sessions may be requested by simply using the command "interactive".
Interactive command
When you are on a login node, you can request an interactive session on a compute node. This is useful for checking available modules, testing submission scripts, compiling software, and running programs directly from the command line. To get an interactive session, we have a built in command that will allow you to quickly and easily do so by simply entering:
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
interactive |
Submitting this actually runs the following:
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
salloc --job-name=interactive --mem-per-cpu=4GB --nodes=1 --ntasks=1 --time=01:00:00 --account=windfall --partition=windfall |
If you find that this session is insufficient, interactive
has built-in customization flags. For example, if you want to get a session faster, add your PI's account name to use the standard partition:
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
interactive -a account_name |
Are you using X11 forwarding?
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
interactive -a account_name -x |
Full usage:
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
interactive [-x] [-N nodes] [-n ncpus per node] [-Q optional qos] [-t hh::mm:ss] [-a account to charge] |
Any time you submit an interactive
command, it will always print the full salloc being executed for verification and copying/editing/pasting.GPU JobsEach GPU node on Puma offers up to four GPUs that can be reserved for your job. To request a GPU, you will include the resource name using the --gres
SLURM directive. For example, if you wanted to request an interactive session with one GPU, you could run
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
salloc --job-name=interactive --mem-per-cpu=4GB --nodes=1 --ntasks=1 --time=01:00:00 --account=windfall --partition=windfall --gres=gpu:1 |
In a batch script, you would include the number of GPUs as an SBATCH directive. For example:
Code Block |
---|
#SBATCH --gres=gpu:1 |
In both cases above, the jobs are requesting 1 GPU. This number can be increased up to 4 depending on the number of GPUs you need for your workflow.
High Memory Nodes
Puma has two high memory nodes available with 3TB of RAM each. These nodes have a ratio of 32GB of RAM per CPU, so a job requesting N CPUs would be allocated N*32GB of RAM. To request one, you may either explicitly set --mem-per-cpu=32gb
, or --constraint=hi_mem
in your job script. For example, the following directives:
Code Block |
---|
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=32gb |
would run the job on one of the high memory nodes with 160GB of RAM. The following would request identical resources:
Code Block |
---|
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --constraint=hi_mem |
PBS → SLURM Rosetta Stone
In general, SLURM can translate and execute scripts written for PBS. This means that if you submit a PBS script written for Ocelote or ElGato on Puma (with the necessary resource request modifications), your script will likely run. However, there are a few caveats that should be noted:
- You will need to submit your job with the new SLURM command, e.g. sbatch instead of qsub
- There may be some PBS directives that do not directly translate to SLURM which cannot be interpreted
- The environment variables specific to PBS and SLURM are different. If your job relies on these, you will need to update them. Common examples are PBS_O_WORKDIR and PBS_ARRAY_INDEX
To get acquainted with the new scheduling system, refer to the following list of common PBS commands, directives, and environment variables and their SLURM counterparts.
PBS | SLURM | Purpose |
---|
Job Management |
| sbatch <options> | Batch submission of jobs to run without user input |
qsub -I <options> (note the upper case 'i') | srun <options> --pty bash -i salloc <options> | Request an interactive job |
srun <options> | Submit a job for realtime execution. Can also be used to submit an interactive session | qstat | squeue | Show all jobs |
qstat <jobid> | squeue --job <jobid> | Check status of a specific job |
qstat -u <netid> | squeue -u <netid> | Check status of jobs specific to user |
tracejob <jobid> | sacct -j <jobid> | Check history of a completed job |
qdel <jobid> | scancel <jobid> | Delete a specific job |
qdel -u <netid> | scancel -u <netid> | Delete all user jobs |
qstat -Q | sinfo | View information about nodes and queues |
qhold <jobid> | scontrol hold <jobid> | Places a hold on a job to prevent it from being executed |
qrls <jobid> | scontrol release <jobid> | Releases a hold placed on a job allowing it to be executed |
Job Directives |
#PBS -W group_list=group_name | #SBATCH --account=group_name | Specify group name where hours are charged |
#PBS -q standard | #SBATCH --partition=standard | Set job queue |
#PBS -l walltime=HH:MM:SS | #SBATCH --time HH:MM:SS | Set job walltime |
#PBS -l select=<N> | #SBATCH --nodes=<N> | Select N nodes |
#PBS -l ncpus=<N> | #SBATCH --ntasks=<N> #SBATCH --cpus-per-task=<M> | PBS: Select N cpus SLURM: Each task is assume to require one cpu. Optionally, you may include cpus-per-task if more are required. Requests NxM cpus Note: Puma has 94 cpus available on each node |
#PBS -l mem=<N>gb | #SBATCH --mem=<N>gb | Select N gb of memory |
#PBS -l pcmem=<N>gb | #SBATCH --mem-per-cpu=<N>gb | Select N gb of memory per cpu Note: Puma defaults to 5GB per cpu |
#PBS J N-M | #SBATCH --array=N-M | Array job submissions where N and M are integers |
#PBS -l np100s=1 | #SBATCH --gres=gpu:1 | Optional: Request a GPU |
#PBS -N JobName | #SBATCH --job-name=JobName | Optional: Set job name |
#PBS -j oe | (default) | Optional: Combine stdout and error(default) | #SBATCH-e <job_name>-%j.err
#SBATCH-o <job_name>-%j.outOptional: Separate stdout and stderr (SLURM: %j is a stand-in for $SLURM_JOB_ID) | #PBS -o filename | #SBATCH -o filename | Optional: Standard output filename |
#PBS -e filename | #SBATCH -e filename | Optional: Error filename |
N/A | #SBATCH --open-mode=append | Optional: Combine all output into single file. Note: If this is selected, each job run will append to that filename, including preexisting files with that name |
#PBS -v var=<value> | #SBATCH --export=var | Optional: Export single environment variable var to job |
#PBS -V | #SBATCH --export=all (default) | Optional: Export all environment variables to job |
(default) | #SBATCH --export=none | Optional: Do not export working environment to job |
#PBS -m be | #SBATCH --mail-type=BEGIN|END|FAIL|ALL | Optional: Request email notifications Beware of mail bombing yourself |
#PBS -M <netid>@email.arizona.edu | #SBATCH --mail-user=<netid>@email.arizona.edu | Optional: email address used for notifications |
#PBS -l place=excl | #SBATCH --exclusive | Optional: Request exclusive access to node |
Environment Variables |
$PBS_O_WORKDIR | $SLURM_SUBMIT_DIR | Job submission directory |
$PBS_JOBID | $SLURM_JOB_ID | Job ID |
$PBS_JOBNAME | $SLURM_JOB_NAME | Job name |
$PBS_ARRAY_INDEX | $SLURM_ARRAY_TASK_ID | Index to differentiate tasks in an array |
$PBS_O_HOST | $SLURM_SUBMIT_HOST | Hostname where job was submitted |
$PBS_NODEFILE | $SLURM_JOB_NODELIST | List of nodes allocated to current job |
Terminology |
Queue | Partition | Group List | Association | PI | Account | Anchor |
---|
Job Partition Requests | Job Partition Requests | Job Partition RequestsSLURM partition requests are slightly different from PBS. Use the following table as a guide for how to use the partition that is relevant to you:Partition | SLURM | Details |
---|
standard | #SBATCH --account=<PI GROUP> #SBATCH --partition=standard | Consumes your group's standard allocation. |
windfall | #SBATCH --partition=windfall | Does not consume your group's standard allocation. Jobs may be interrupted and restarted by higher-priority jobs. The --account flag needs to be omitted or an error will occur. |
high_priority | #SBATCH --account=<PI GROUP> #SBATCH --partition=standard #SBATCH --qos=user_qos_<PI GROUP> | Available for groups who have purchased compute resources. The partition flag is left as standard and requires the additional qos flag. Replace <PI GROUP> with your group's name. |
SLURM Output Filename PatternsUnlike PBS, SLURM offers ways to make your job's output filenames more customizable through the use of character replacements. A table is provided below as a guide with some examples. Variables may be used or combined as desired. Note: character replacements may also be used with other SBATCH directives such as error filename, input filename, and job name.
Variable | Meaning | Example Slurm Directive(s) | Output |
---|
%A | A job array's main job ID | #SBATCH --array=1-2 #SBATCH -o %A.out #SBATCH --open-mode=append | 12345.out |
%a | A job array's index number | #SBATCH --array=1-2 #SBATCH -o %A_%a.out | 12345_1.out 12345_2.out |
%J | Job ID plus stepid | #SBATCH -o %J.out | 12345.out |
%j | Job ID | #SBATCH -o %j.out | 12345.out |
%N | Hostname of the first compute node allocated to the job | #SBATCH -o %N.out | r1u11n1.out |
%u | Username | #SBATCH -o %u.out | netid.out |
%x | Jobname | #SBATCH --job-name=JobName #SBATCH -o %x.out | JobName.out |
Job ExamplesSingle serial job submission
Section |
---|
Column |
---|
|
PBS Script
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
#!/bin/bash
#PBS -N Sample_PBS_Job
#PBS -l select=1:ncpus=1:mem=1gb
#PBS -l walltime=00:01:00
#PBS -q standard
#PBS -W group_list=<group_name>
cd $PBS_O_WORKDIR
pwd; hostname; date
module load python
python --version
|
Column |
---|
|
SLURM Script
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
#!/bin/bash
#SBATCH --job-name=Sample_Slurm_Job
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:01:00
#SBATCH --partition=standard
#SBATCH --account=<group_name>
# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date
module load python
python --version
|
Array Submission
IMPORTANT:
When submitting jobs with named output files (i.e. with the line #SBATCH -o=Job.out) as arrays, SLURM will write every array element to that filename leaving you with only the output of the last completed job in the array. Use one of the following SLURM directives in your script to prevent this behavior:
Differentiates output files using array indices. Similar to PBS default. See SLURM Output Filename Patterns above for more information.
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
#SBATCH --output=Job-%a.out |
Appends the output from all tasks in an array to the same output file. Warning: if a file exists with that name prior to running your job, the output will be appended to that file
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
#SBATCH --open-mode=append |
Section |
---|
Column |
---|
|
PBS Script
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
#!/bin/bash
#PBS -N Sample_PBS_Job
#PBS -l select=1:ncpus=1:mem=1gb
#PBS -l walltime=00:01:00
#PBS -q standard
#PBS -W group_list=<group_name>
#PBS -J 1-5
cd $PBS_O_WORKDIR
pwd; hostname; date
echo "./sample_command input_file_${PBS_ARRAY_INDEX}.in"
|
Column |
---|
|
SLURM Script
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
#!/bin/bash
#SBATCH --output=Sample_SLURM_Job-%a.out
#SBATCH --ntasks=1
#SBATCH --mem=1gb
#SBATCH --time=00:01:00
#SBATCH --partition=standard
#SBATCH --account=<group_name>
#SBATCH --array 1-5
# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date
echo "./sample_command input_file_${SLURM_ARRAY_TASK_ID}.in" |
MPI Example
Section |
---|
For openmpi the important variables are set by default, so you do not need to include them in your scripts.
export SBATCH_GET_USER_ENV=1
export OMPI_MCA_btl_openib_cpc_include=rdmacm
export OMPI_MCA_btl_openib_if_include=bnxt_re1
export OMPI_MCA_btl_openib_rroce_enable=1
export OMPI_MCA_btl=vader,self,openib
export OMPI_MCA_oob_tcp_if_include=eth1
For Intel MPI, these variables are set for you:
export I_MPI_FABRICS=shm:ofi
export FI_PROVIDER=verbs
export FI_VERBS_IFACE=eth1
Column |
---|
|
PBS Script
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
#!/bin/bash
#PBS -N Sample_MPI_Job
#PBS -l select=1:ncpus=16:mem=16gb
#PBS -l walltime=00:10:00
#PBS -W group_list=<group_name>
#PBS -q standard
cd $PBS_O_WORKDIR
pwd; hostname; date
module load openmpi
/usr/bin/time -o mpit_prog.timing mpirun -np 16 a.out |
Column |
---|
|
SLURM Script
Code Block |
---|
language | bash |
---|
theme | Confluence |
---|
|
#!/bin/bash
#SBATCH --job-name=Sample_MPI_Job
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=16
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=1gb
#SBATCH --time=00:10:00
#SBATCH --account=<group_name>
#SBATCH --partition=standard
#SBATCH --output=Sample_MPI_Job_%A.out
#SBATCH --error=Sample_MPI_Job_%A.err
# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date
module load openmpi3
/usr/bin/time -o mpit_prog.timing mpirun -np 16 a.out |
rosetta_min.pdf Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fcfcfc |
---|
titleColor | #fcfcfc |
---|
titleBGColor | #021D61 |
---|
borderStyle | solid |
---|
title | Contents |
---|
|
|
Image Added
Excerpt Include |
---|
| Getting Help |
---|
| Getting Help |
---|
nopanel | true |
---|
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
OverviewAll three clusters, Puma, Ocelote, and ElGato, use SLURM for resource management and job scheduling. Additional SLURM Resources and ExamplesLink | Description |
---|
Official SchedMD User Documentation | Official SchedMD user documentation. Includes detailed information on SLURM directives and commands. | PBS ⇔ SLURM Rosetta Stone | Table for converting some common PBS job directives to SLURM syntax. | Puma Quick Start | HPC Quick Start guide. If you have never submitted a batch job before, this is a great place to start. | Job Examples | Basic SLURM example scripts. Includes PBS scripts for comparison. | Even More Job Examples! | Growing repository of example SLURM submission scripts | Intro to HPC | A recorded video presentation of our Intro to HPC workshop. Keep your eyes peeled for periodic announcements in the HPC listserv on upcoming live sessions!= |
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
SLURM and System CommandsCommand | Purpose | Example(s) |
---|
Native Slurm Commands | sbatch | Submits a batch script for execution | sbatch script.slurm | srun | Run parallel jobs. Can be in place of mpirun/mpiexec. Can be used interactively as well as in batch scripts | srun -n 1 --mpi=pmi2 a.out | salloc | Requests a session to work on a compute node interactively | see: Interactive Sessions section below | squeue | Checks the status of pending and running jobs | squeue --job $JOBID squeue --user $NETID
| scancel | Cancel a running or pending job | scancel $JOBID
scancel -u $NETID
| scontrol hold | Place a hold on a job to prevent it from being executed | scontrol hold $JOBID | scontrol release | Releases a hold placed on a job allowing it to be executed | scontrol release $JOBID | System Commands | va | Displays your group membership, your account usage, and CPU allocation. Short for "view allocation" | va | interactive | Shortcut for quickly requesting an interactive job. Use "interactive --help" to get full usage. | interactive -a $GROUP_NAME | job-history | Retrieves a running or completed job's history in a user-friendly format | job-history $JOBID | seff | Retrieves a completed job's memory and CPU efficiency | seff $JOBID | past-jobs | Retrieves past jobs run by user. Can be used with option "-d N" to search for jobs run in the past N days. | past-jobs -d 5 | job-limits | View your group's job resource limits and current usage. | job-limits $GROUP | nodes-busy | Display a visualization of nodes on a cluster and their usage | nodes-busy --help | system-busy | Display a text-based summary of a cluster's usage | system-busy | cluster-busy | Display a visualization of all three cluster's overall usage | cluster-busy --help |
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
Anchor |
---|
| batch-directives |
---|
| batch-directives |
---|
| Batch Job DirectivesCommand | Purpose |
---|
#SBATCH --account=group_name | Specify the account where hours are charged. Don't know your group name? Run the command "va" to see which groups you belong to | #SBATCH --partition=partition_name | Set the job partition. This determines your job's priority and the hours charged. See Job Partition Requests below for additional information | #SBATCH --time=DD-HH:MM:SS | Set the job's runtime limit in days, hours, minutes, and seconds | #SBATCH --nodes=N | Allocate N nodes to your job. For non-MPI enabled jobs, this should be set to "–-nodes=1" to ensure access to all requested resources and prevent memory errors. | #SBATCH --ntasks=N | ntasks specifies the number of tasks (or processes) the job will run. For MPI jobs, this is the number of MPI processes. Most of the time, you can use ntasks to specify the number of CPUs your job needs. However, in some odd cases you might run into issues. For example, see: Using Matlab By default, you will be allocated one CPU/task. This can be increased by including the additional directive --cpus-per-task. The number of CPUs a job is allocated is cpus/task * ntasks, or M*N | #SBATCH --cpus-per-task=M | #SBATCH --mem=Ngb | Select N gb of memory per node. If "gb" is not included, this value defaults to MB. Directives --mem and --mem-per-cpu are mutually exclusive. | #SBATCH --mem-per-cpu=Ngb | Select N GB of memory per CPU. Valid values can be found in the Node Types/Example Resource Requests section below. If "gb" is not included, this value defaults to MB. | #SBATCH --gres=gpu:N | Optional: Request N GPUs. | #SBATCH --gres=gpu:ampere:N | Optional: Request N A100 GPUs. | #SBATCH --gres=gpu:volta:N | Optional: Request N V100s GPUs. | #SBATCH --constraint=hi_mem | Optional: Request a high memory node (Ocelote and Puma only). | #SBATCH --array=N-M | Submits an array job from indices N to M | #SBATCH --job-name=JobName | Optional: Specify a name for your job. This will not automatically affect the output filename. | #SBATCH -e output_filename.err
#SBATCH -o output_filename.out | Optional: Specify output filename(s). If -e is missing, stdout and stderr will be combined. | #SBATCH --open-mode=append | Optional: Append your job's output to the specified output filename(s). | #SBATCH --mail-type=BEGIN|END|FAIL|ALL | Optional: Request email notifications. Beware of mail bombing yourself. | #SBATCH --mail-user=email@address.xyz | Optional: Specify email address. If this is missing, notifications will go to your UArizona email address by default. | #SBATCH --exclusive | Optional: Request exclusive access to node. | #SBATCH --export=VAR | Optional: Export a comma-delimited list of environment variables to a job. | #SBATCH --export=all (default) | Optional: Export your working environment to your job. | #SBATCH --export=none | Optional: Do not export working environment to your job. |
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
SLURM Environment VariablesVariable | Purpose | Example Value |
---|
$SLURM_ARRAY_JOB_ID | Job array's parent ID | 399124 | $SLURM_ARRAY_TASK_COUNT | Total number of subjobs in the array | 4 | $SLURM_ARRAY_TASK_ID | Job index number (unique for each job in the array) | 1 | $SLURM_ARRAY_TASK_MAX | Maximum index for the job array | 7 | $SLURM_ARRAY_TASK_MIN | Minimum index for the job array | 1 | $SLURM_ARRAY_TASK_STEP | Job array's index step size | 2 | $SLURM_CLUSTER_NAME | Which cluster your job is running on | elgato | $SLURM_CONF | Points to the SLURM configuration file | /var/spool/slurm/d/conf-cache/slurm.conf | $SLURM_CPUS_ON_NODE | Number of CPUs allocated to target node | 3 | $SLURM_GPUS_ON_NODE | Number of GPUs allocated to the target node | 1 | $SLURM_GPUS_PER_NODE | Number of GPUs per node. Only set if --gpus-per-node is specified | 1 | $SLURM_JOB_ACCOUNT | Account being charged | groupname | $SLURM_JOB_GPUS | The global GPU IDs of the GPUs allocated to the job. Only set in batch and interactive jobs. | 0 | $SLURM_JOB_ID | Your SLURM Job ID | 399072 | $SLURM_JOB_CPUS_PER_NODE | Number of CPUs per node. This can be a list if there is more than one node allocated to the job. The list has the same order as SLURM_JOB_NODELIST | 3,1 | $SLURM_JOB_NAME | The job's name | interactive | $SLURM_JOB_NODELIST | The nodes that have been assigned to your job | gpu[73-74] | $SLURM_JOB_NUM_NODES | The number of nodes allocated to the job | 2 | $SLURM_JOB_PARTITION | The job's partition | standard | $SLURM_JOB_QOS | The job's QOS/Partition | qos_standard_part | $SLURM_JOB_USER | The username of the person who submitted the job | netid | $SLURM_JOBID | Same as SLURM_JOB_ID, your SLURM Job ID | 399072 | $SLURM_MEM_PER_CPU | The memory/CPU ratio allocated to the job | 4096 | $SLURM_NNODES | Same as SLURM_JOB_NUM_NODES – the number of nodes allocated to the job | 2 | $SLURM_NODELIST | Same as SLURM_JOB_NODELIST, The nodes that have been assigned to your job | gpu[73-74] | $SLURM_NPROCS | The number of tasks allocated to your job | 4 | $SLURM_NTASKS | Same as SLURM_NPROCS, the number of tasks allocated to your job | 4 | $SLURM_SUBMIT_DIR | The directory where sbatch was used to submit the job | /home/u00/netid | $SLURM_SUBMIT_HOST | The hostname where sbatch was used to submit the job | wentletrap.hpc.arizona.edu | $SLURM_TASKS_PER_NODE | The number of tasks to be initiated on each node. This can be a list if there is more than one node allocated to the job. The list has the same order as SLURM_JOB_NODELIST | 3,1 | $SLURM_WORKING_CLUSTER | Valid for interactive jobs, will be set with remote sibling cluster's IP address, port and RPC version so that any sruns will know which cluster to communicate with. | elgato:foo:0000:0000:000 |
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
SLURM Reason CodesSometimes, if you check a pending job using squeue, there are some messages that show up under Reason indicating why your job may not be running. Some of these codes are non-intuitive so a human-readable translation is provided below: Reason | Explanation |
---|
AssocGrpCpuLimit | This is a per-group limitation on the number of CPUs that can be used simultaneously by all group members. Your job is not running because this limit has been reached. Check your group's limits using "job-limits <group_name>". | AssocGrpMemLimit
| This is a per-group limitation on the amount of memory that can be used simultaneously by all group members. Your job is not running because this limit has been reached. Check your group's limits using "job-limits <group_name>". | AssocGrpCPUMinutesLimit | Either your group is out of CPU hours or your job will exhaust your group's CPU hours. | AssocGrpGRES | This is a per-group limitation on the number of GPUs that can be used simultaneously by all group members. Your job is not running because this limit has been reached. Check your group's limits using "job-limits <group_name>". | Dependency
| Your job depends on the completion of another job. It will wait in queue until the target job completes. | QOSGrpCPUMinutesLimit | This message indicates that your high priority or qualified hours allocation has been exhausted for the month. | QOSMaxWallDurationPerJobLimit | Your job's time limit exceeds the max allowable and will never run. To see an individual job's limits, run "job-limits <group_name>". | Nodes_required_for_job_are_
DOWN,_DRAINED_or_reserved_
or_jobs_in_higher_priority_
partitions
| This very long message simply means your job is waiting in queue until there is enough space for it to run | Priority | Your job is waiting in queue until there is enough space for it to run. | QOSMaxCpuPerUserLimit | This is a per-user limitation on the number of CPUs that you can use simultaneously among all of your jobs. Your job is not running because this limit has been reached. Check your user limits using "job_limits <group_name>". | ReqNodeNotAvail, Reserved for maintenance | Your job's time limit overlaps with an upcoming maintenance window. Run "uptime_remaining" to see when the system will go offline. If you remove and resubmit your job with a shorter walltime that does not overlap with maintenance, it will likely run. Otherwise, it will remain pending until after the maintenance window. | Resources
| Your job is waiting in queue until the required resources are available. |
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
Anchor |
---|
| Job Partition Requests |
---|
| Job Partition Requests |
---|
| Job Partition RequestsPartition | SLURM | Details |
---|
standard | #SBATCH --account=<PI GROUP>
#SBATCH --partition=standard | Consumes your group's standard allocation. These jobs cannot be interrupted. | windfall | #SBATCH --partition=windfall | Does not consume your group's standard allocation. Jobs may be interrupted and restarted by higher-priority jobs. The --account flag needs to be omitted or an error will occur. | high_priority | #SBATCH --account=<PI GROUP>
#SBATCH --partition=high_priority
#SBATCH --qos=user_qos_<PI GROUP> | Available for groups who have purchased compute resources. | qualified | #SBATCH --account=<PI GROUP>
#SBATCH --partition=standard
#SBATCH --qos=qual_qos_<PI GROUP> | Available for groups that have submitted a special project request. |
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
SLURM Output Filename PatternsSLURM offers ways to make your job's output filenames customizable through the use of character replacements. A table is provided below as a guide with some examples. Variables may be used or combined as desired. Note: character replacements may also be used with other SBATCH directives such as error filename, input filename, and job name. Variable | Meaning | Example Slurm Directive(s) | Output |
---|
%A | A job array's main job ID | #SBATCH --array=1-2
#SBATCH -o %A.out
#SBATCH --open-mode=append
| 12345.out | %a | A job array's index number | #SBATCH --array=1-2
#SBATCH -o %A_%a.out | 12345_1.out 12345_2.out | %J | Job ID plus stepid | #SBATCH -o %J.out | 12345.out | %j | Job ID | #SBATCH -o %j.out | 12345.out | %N | Hostname of the first compute node allocated to the job | #SBATCH -o %N.out | r1u11n1.out | %u | Username | #SBATCH -o %u.out | netid.out | %x | Jobname | #SBATCH --job-name=JobName
#SBATCH -o %x.out | JobName.out |
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
Anchor |
---|
| example requests |
---|
| example requests |
---|
| Node Types/Example Resource RequestsStandard NodesCluster | Max CPUs | Mem/CPU | Max Mem | Sample Request Statement |
---|
ElGato | 16 | 4gb | 62gb | #SBATCH --nodes=1 #SBATCH --ntasks=16 #SBATCH --mem-per-cpu=4gb | Ocelote | 28 | 6gb | 168gb | #SBATCH --nodes=1
#SBATCH --ntasks=28
#SBATCH --mem-per-cpu=6gb
| Puma | 94 | 5gb | 470gb | #SBATCH --nodes=1
#SBATCH --ntasks=94
#SBATCH --mem-per-cpu=5gb |
GPU Nodes Note |
---|
During the quarterly maintenance cycle on April 27, 2022 the ElGato K20s and Ocelote K80s were removed because they are no longer supported by Nvidia. |
GPU jobs are requested using the generic resource, or --gres , SLURM directive. In general, the directive to request N GPUs will be of the form: --gres=gpu:N
Cluster | Max CPUs | Mem/CPU | Max Mem | Sample Request Statement |
---|
Ocelote | 28 | 8gb | 224gb | #SBATCH --nodes=1
#SBATCH --ntasks=28
#SBATCH --mem-per-cpu=8gb #SBATCH --gres=gpu:1
| Puma1 | 94 | 5gb | 470gb | #SBATCH --nodes=1
#SBATCH --ntasks=94
#SBATCH --mem-per-cpu=5gb
#SBATCH --gres=gpu:1
| 1 Up to four GPUs may be requested on Puma on a single GPU node with --gres=gpu:1, 2, 3, or 4 |
High Memory NodesWhen requesting a high memory node, include both the memory/CPU and constraint directives Cluster | Max CPUs | Mem/CPU | Max Mem | Sample Request Statement |
---|
Ocelote | 48 | 41gb | 2015gb | #SBATCH --nodes=1
#SBATCH --ntasks=48
#SBATCH --mem-per-cpu=41gb
#SBATCH --constraint=hi_mem
| Puma | 94 | 32gb | 3000gb | #SBATCH --nodes=1
#SBATCH --ntasks=94
#SBATCH --mem-per-cpu=32gb
#SBATCH --constraint=hi_mem |
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
Total Job Memory vs. CPU Count
The memory your job is allocated is dependent on the number of CPUs you request. For example, on Puma standard nodes, you get 5G for each CPU you request. This means a standard job using 4 CPUs gets 5G/CPU × 4 CPUs = 20G of total memory. Each node has its own memory ratio that's dependent on its total memory ÷ total number of CPUs. A reference for all the node types, the memory ratios, and how to request each can be found in the Node Types/Example Resource Requests section above. What Happens if My Memory and CPU Requests Don't Match?Our systems are configured to try to help when your memory request does not match your CPU count. For example, if you request 1 CPU and 470G of memory on Puma, the system will automatically scale up your CPU count to 94 to ensure that you get your full memory requirements. This does not go the other way, so if you request less memory than would be provided by your CPU count, no adjustments are made. If you omit the --memory flag entirely, the system will use the memory ratio for the standard nodes on that cluster. Possible Problems You Might Encounter- Be careful when using
--mem-per-cpu ratio. If you use a higher value than a standard node ratio, you may inadvertently wind up in queue for a high memory node. On Puma there are three of these machines available for standard jobs and only one on Ocelote. This means the wait times are frequently longer than those for standard nodes. If you notice your job is in queue much longer than you would expect, check your job using job-history to ensure the memory ratio looks correct. - Stick to using
--ntasks=N and --cpus-per-task=M to request N × M CPUs. Using the flag -c N to request CPUs has been found to cause problems with memory requests and may inadvertently limit you to ~4MB of total memory.
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
Anchor |
---|
| interactive-jobs |
---|
| interactive-jobs |
---|
| Interactive Jobs
Tip |
---|
Want your session to start faster? Try one or both of the following: - Switch to ElGato. This cluster shares the same operating system, software, and file system as Puma so often your workflows are portable across clusters. Ocelote and ElGato standard nodes have 28 and 16 CPUs, respectively, and are often less utilized than Puma meaning much shorter wait times. Before you run the interactive command, type elgato to switch.
Use the account flag. By default, interactive will request a session using the windfall partition. Windfall is lower priority than standard and so these types of jobs take longer to get through the queue. If you include the account flag, that will switch your partition to standard. An example of this type of request: Code Block |
---|
language | bash |
---|
theme | Midnight |
---|
| $ interactive -a YOUR_GROUP |
|
When you are on a login node, you can request an interactive session on a compute node. This is useful for checking available modules, testing submission scripts, compiling software, and running programs directly from the command line. We have a built-in shortcut command that will allow you to quickly and easily request a session by simply entering: interactive When you request a session, the full salloc command being executed will be displayed for verification/copying/editing/pasting purposes. For example: Code Block |
---|
language | bash |
---|
theme | Midnight |
---|
| (ocelote) [netid@junonia ~]$ interactive
Run "interactive -h for help customizing interactive use"
Submitting with /usr/local/bin/salloc --job-name=interactive --mem-per-cpu=4GB --nodes=1 --ntasks=1 --time=01:00:00 --account=windfall --partition=windfall
salloc: Pending job allocation 531843
salloc: job 531843 queued and waiting for resources
salloc: job 531843 has been allocated resources
salloc: Granted job allocation 531843
salloc: Waiting for resource configuration
salloc: Nodes i16n1 are ready for job
[netid@i16n1 ~]$ |
Notice in the example above how the command prompt changes once your session starts. When you're on a login node, your prompt will show "junonia" or "wentletrap". Once you're in an interactive session, you'll see the name of the compute node you're connected to. If no options are supplied to the command interactive , your job will automatically run using the windfall partition for one hour using one CPU. To use the standard partition, include the flag "-a" followed by your group's name. To see all the customization options: Code Block |
---|
language | bash |
---|
theme | Midnight |
---|
| (ocelote) [netid@junonia ~]$ interactive -h
Usage: /usr/local/bin/interactive [-x] [-g] [-N nodes] [-m memory per core] [-n ncpus per node] [-Q optional qos] [-t hh::mm:ss] [-a account to charge] |
You may also create your own salloc commands using any desired SLURM directives for maximum customization. |
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
MPI JobsOpenMPIFor openmpi the important variables are set by default, so you do not need to include them in your scripts. Code Block |
---|
language | bash |
---|
theme | Midnight |
---|
title | Default OpenMPI variables |
---|
| export SBATCH_GET_USER_ENV=1
export OMPI_MCA_btl_openib_cpc_include=rdmacm
export OMPI_MCA_btl_openib_if_include=bnxt_re1
export OMPI_MCA_btl_openib_rroce_enable=1
export OMPI_MCA_btl=vader,self,openib
export OMPI_MCA_oob_tcp_if_include=eth1 |
Intel MPIFor Intel MPI, these variables are set for you: Code Block |
---|
language | bash |
---|
theme | Midnight |
---|
| module unload openmpi3 gnu8 |
If you're using Intel MPI with mpirun and are getting errors, try replacing mpirun -np $NPROCESSES with: Code Block |
---|
language | bash |
---|
theme | Midnight |
---|
| srun -n $NPROCESSES --mpi=pmi2 |
|
Panel |
---|
borderColor | #9c9fb5 |
---|
bgColor | #fafafe |
---|
borderWidth | 2 |
---|
borderStyle | double |
---|
|
Parallel WorkTo make proper use of a supercomputer, you will likely want to use the benefit of many cores. Puma has 94 cores in each node available to Slurm. The exception to that is running hundreds or thousands of jobs using High Throughput Computing. We have a training course which explains the concepts and terminology of parallel computing with some examples. Introduction to Parallel Computing This practical course in Parallel Analysis in R is also useful |