Section | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
Overview
SLURM
The new HPC system, Puma, uses SLURM as a job scheduler rather than PBS Pro. SLURM has several advantages:
- It provides more robust support for a larger number of jobs in queue.
- It is used by national HPC groups (XSEDE and TACC) making it easier for users to scale out to those systems.
- It will reject jobs asking for impossible resource configurations.
Allocations and Job Queues
Using Puma with SLURM is similar to using ElGato and Ocelote with PBS. Users will still receive a monthly allocation of cpu hours associated with their PI's group which will be deducted when they run their jobs in standard. Users will also still be able to use windfall to run jobs without consuming their monthly allocations. As on Ocelote and Puma, jobs run using windfall will still be subject to preemption when resources are requested by higher-priority jobs.
Modules and Software
The process of finding, loading, and using software as modules will not change on the new system. Users will still be able to utilize the standard commands described in the Software section in our User Guide. However, in a departure from our previous systems, modules will not be available to load and utilize on the login nodes. To load, use, and test software for job submissions, users will need to request an interactive session. Interactive sessions may be requested by simply using the command "interactive".
PBS → SLURM Rosetta Stone
In general, SLURM can translate and execute scripts written for PBS. This means that if you submit a PBS script written for Ocelote or ElGato on Puma (with the necessary resource request modifications), your script will likely run. However, there are a few caveats that should be noted:
- You will need to submit your job with the new SLURM commands, e.g. sbatch instead of qsub
- There may be some PBS directives that do not directly translate to SLURM which cannot be interpreted
- The environment variables specific to PBS and SLURM are different. If your job relies on these, you will need to update them. Common examples are PBS_O_WORKDIR and PBS_ARRAY_INDEX
To help with the transition to SLURM, we've also installed software that converts some basic PBS Pro commands into SLURM commands automatically called pbs2slurm.
To get acquainted with the new scheduling system, refer to the following list of common PBS commands, directives, and environment variables and their SLURM counterparts.
PBS | SLURM | Purpose |
---|---|---|
Job Management | ||
qsub <options> | sbatch <options> | Batch submission of jobs to run without user input |
qsub -I <options> | salloc <options> | Request an interactive job |
srun <options> | Submit a job for realtime execution. Can also be used to submit an interactive session. | |
qstat | squeue | Show all jobs |
qstat <jobid> | squeue --job <jobid> | Check status of a specific job |
qstat -u <netid> | squeue -u <netid> | Check status of jobs specific to user |
qdel <jobid> | scancel <jobid> | Delete a specific job |
qdel -u <netid> | scancel -u <netid> | Delete all user jobs |
qstat -Q | sinfo | View information about nodes and queues. |
qhold <jobid> | scontrol hold <jobid> | Places a hold on a job to prevent it from being executed |
qrls <jobid> | scontrol release <jobid> | Releases a hold placed on a job allowing it to be executed |
Job Directives | ||
#PBS -W group_list=group_name | #SBATCH --account=group_name | Specify group name where hours are charged |
#PBS -q standard | #SBATCH --partition=standard | Set job queue |
#PBS -l walltime=HH:MM:SS | #SBATCH --time HH:MM:SS | Set job walltime |
#PBS -l select=<N> | #SBATCH --nodes=<N> | Select N nodes. |
#PBS -l ncpus=<N> | #SBATCH --ntasks=<N> #SBATCH --cpus-per-task=<M> | PBS: Select N cpus. SLURM: Each task is assume to require one cpu. Optionally, you may include cpus-per-task if more are required. Requests NxM cpus. |
#PBS -l mem=<N>gb | #SBATCH --mem=<N>gb | Select N gb of memory |
#PBS -l pcmem=<N>gb | #SBATCH --mem-per-cpu=<N>gb | Select N gb of memory per cpu |
#PBS J N-M | #SBATCH --array=N-M | Array job submissions where N and M are integers |
#PBS -N JobName | #SBATCH --job-name=JobName | Optional: Set job name |
#PBS -j oe | Optional: Combine stdout and error. This is the SLURM default | |
#PBS -o filename | #SBATCH -o filename | Optional: Standard output filename |
#PBS -e filename | #SBATCH -e filename | Optional: Error filename |
#PBS -v var=<value> | #SBATCH --export=var | Optional: Export single environment variable var to job |
#PBS -V | #SBATCH --export=all (default) | Optional: Export all environment variables to job. |
#PBS -m be | #SBATCH --mail-type=BEGIN|END|FAIL|ALL | Optional: Request email notifications |
Environment Variables | ||
$PBS_O_WORKDIR | $SLURM_SUBMIT_DIR | Job submission directory |
$PBS_JOBID | $SLURM_JOB_ID | Job ID |
$PBS_JOBNAME | $SLURM_JOB_NAME | Job name |
$PBS_ARRAY_INDEX | $SLURM_ARRAY_TASK_ID | Index to differentiate tasks in an array |
$PBS_O_HOST | $SLURM_SUBMIT_HOST | Hostname where job was submitted |
$PBS_NODEFILE | $SLURM_JOB_NODELIST | List of nodes allocated to current job |
Terminology | ||
Queue | Partition | |
Group List | Association | |
PI | Account |
Job Examples
Single serial job submission
Section | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Array Submission
IMPORTANT:
When submitting named jobs (i.e. with the line #SBATCH job-name=JobName) as arrays, SLURM will overwrite the output file with the output of the last item in the array. Use one of the following SLURM directives in your script to prevent this behavior:
Differentiates output file by array index. Similar to PBS default
Code Block #SBATCH --output=JobName-%a.out
Appends the output from all tasks in an array to the same output file. Warning: if a file exists with the output name prior to running your job, the output will be appended to that file
Code Block #SBATCH --open-mode=append
Section | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
MPI Example
Section | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|