Table of Contents
Slurm@ICS
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We use slurm here at ICS in order to allow fair access to computer clusters for instruction and research.
External Documentation
Configuration
Compile and Installation
Module
Use the module command to add slurm to your environment.
module load slurm
Use the following command to update your profile rc files to load the module on login permanently:
module initadd slurm
Slurm Partitions
These may not be current so run 'sinfo' to see the partitions available to you to use.
Partition | Nodes | Cores per Node | CPU Core Types | RAM | Time Limit | MaxJobs | MPI Suitable? | GPU Capable |
---|---|---|---|---|---|---|---|---|
opengpu.p | 1 | 40 | Intel Xeon Silver 4114 @ 2.20GHz | 96GB | None | Yes | No | |
openlab.p | 48 | 24 | Intel Westmere | 96GB | None | Yes | No | |
openlab20.p | 48 | 24 | Intel Westmere | 96GB | None | Yes | No |
Slurm Daemons
Slurm Controller Daemon (slurmctld)
This daemon runs exclusively on the control nodes. One control node is active at any time. If the first control node is down for any reason, another control node will take over. In this case it may be moot since the primary control node is also the only database node.
- addison-v9.ics.uci.edu
- broadchurch-v1.ics.uci.edu
Slurm Database Daemon (slurmdbd)
The database node runs mariadb and is the repository for accounting information.
- Hostname: addision-v9.ics.uci.edu
Slurm Daemon (slurmd)
This daemon runs on all worker nodes.
Commands
https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting
- sacct -displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database
- sacctmgr - Used to view and modify Slurm account information.
- salloc - Obtain a Slurm job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished.
- sattach - Attach to a Slurm job step
- sbatch - Submit a batch script to Slurm.
- sbcast - transmit a file to the nodes allocated to a Slurm job
- scancel -Used to signal jobs or job steps that are under the control of Slurm.
- scontrol - view or modify Slurm configuration and state.
- sdiag - Scheduling diagnostic tool for Slurm
- sinfo - view information about Slurm nodes and partitions.
- sprio - view the factors that comprise a job's scheduling priority
- squeue - view information about jobs located in the Slurm scheduling queue
- sreport - generate reports from the slurm accounting data
- srun - Run parallel jobs
- sshare - Tool for listing the shares of associations to a cluster
- sstat - Display various status information of a running job/step
- strigger - Used set, get or clear Slurm trigger information
- sview - gui for interacting with Slurm.
Shortcuts
Description | Command |
---|---|
Run a single job | sbatch -p openlab.p hello_world.sh |
Run an interactive bash shell on 1 node with 1 task on openlab partition | srun -N1 -n1 -p openlab.p bash -i |
List running jobs | squeue |
Cancel job number | scancel job# |
Run 1 task for max 10 minutes | sbatch –ntasks=1 –time=10 pre_process.bash |
Run on 2 nodes, 16 tasks at a time, 1000 tasks | sbatch –nodes=2 -t 1-1000 –ntasks-per-node=16 -J sbatch-500 sleeper2A.sh 5 1 |
Allocated 2 nodes, 6 tasks per, srun hostname | salloc -N2 -n6 srun hostname |
Administrative Shortcuts
Description | Command |
---|---|
Resume Node | scontrol update NodeName=circinus-1 State=resume |
Writing a Slurm Script (Examples)
In the examples below, the #SBATCH are not comments. They are options used when the sbatch command is run. For example, in the Hello World example, the job is being submitted to the openlab.p partition.
Hello World
Copy and paste this code snippet into your terminal window:
cat << EOF > hello_world.sh #!/bin/bash #SBATCH -n 1 # Number of tasks to run (equal to 1 cpu/core per task) #SBATCH -N 1 # Ensure that all cores are on one machine #SBATCH -t 0-00:10 # Max Runtime in D-HH:MM, minimum of 10 minutes #SBATCH -p openlab.p # Partition to submit to #SBATCH -o myoutput.out # File to which STDOUT will be written, %j inserts jobid #SBATCH -e myerrors.err # File to which STDERR will be written, %j inserts jobid perl -e 'print "Hello World.\n"' EOF
Submit the job:
sbatch hello_world.sh
Look for the results in the local directory in the myoutput.out and myerrors.err files.
Note: Use the %j substitution to generate unique, job based file names, e.g. myoutput_%j.out or myerrors_%j.err
Note: To run on specific node(s), add to example above to run on circinus-1 only:
#SBATCH -w circinus-1 # node(s) to run it on, comma delimited
Sleepy time
Copy and past this code:
cat << EOF > sleepy_time.sh #!/bin/bash # #SBATCH --job-name=sleepy_time #SBATCH --output=sleepy_time_%j.out #SBATCH --error=sleepy_time_%j.out #SBATCH --ntasks=1 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=1 srun hostname srun sleep 60 EOF
Submit the job:
sbatch sleepy_time.sh
MPI Fun with Pi
Grab the MPI_PI programs from github:
git clone https://github.com/kiwenlau/MPI_PI
The Monte Carlo routines are silly, so go into Trapezoid1:
cd MPI_PI/Trapezoid1
If running on CentOS, add the mpi directory to your path before starting:
PATH=/usr/lib64/openmpi/bin:$PATH
Compile Trapezoid mpi_pi (don't forget to include the math library):
mpicc -o pi mpi_pi.c -lm
Create a Batch file (make sure you include the PATH= line above if running on CentOS):
#!/bin/bash # uncomment on CentOS # PATH=/usr/lib64/openmpi/bin:$PATH #SBATCH --job-name=you-are-kewl #SBATCH --output=you-are-kewl.out #SBATCH --partition=openlab.p for i in $(seq 1 8 ) do mpirun -np $i ./pi done
Submit your batch job:
sbatch -N 5 mpirun_batch.sh
Distributed Stress Test
#!/bin/sh # # Script runs one process per node on ten nodes. The stress test runs for 30 minutes and exits. # #SBATCH -n 1 #SBATCH -N 10 #SBATCH -t 30 #SBATCH -p openlab.p #SBATCH -o myoutput_%j.out #SBATCH -e myoutput_%j.err cd /extra/baldig11/hans /home/hans/bin/stress -d 10
Execute it:
sbatch distributed_stress_test.sh
Array Job
An array job will allow you to run parallel jobs. This method does not take variables though. See “SRUN inside SBATCH” for how to run parallel jobs but provide variables to each job.
Put the following into a file ~/bin/arrayjob.sh and make it executable `chmod 700 ~/bin/arrayjob.sh`
#!/bin/bash #SBATCH --job-name=array_job_test # Job name #SBATCH --mail-type=FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=email@ics.uci.edu # Where to send mail #SBATCH --ntasks=1 # Run a single task #SBATCH --mem=1gb # Job Memory #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --output=/home/hans/tmp/array_%A-%a.log # Standard output and error log #SBATCH --array=1-5000 # Array range pwd; hostname; date echo This is task $SLURM_ARRAY_TASK_ID date
Execute
At the command prompt, enter:
sbatch ~/bin/arrayjob.sh`
SRUN inside SBATCH
You can ask for a specific number of cores and memory using an SBATCH script. Within the SBATCH script, you can assign out the resources that have been allocated via an SRUN command. What makes this different than an array job is that you can pass in variables to each SRUN command.
The following is a simple example where 50 cores are requested. Instead of the default 4GB memory per core assignment, I only asked for 1GB per core. (This is necessary when there the number of cores x default memory is higher than the available memory in the server and hence would fail). In this example, I am running the job only one specific node in the partition. If you wish to run it on all nodes in the partition, then just remove the line with the “-w” option.
#!/bin/bash #SBATCH -n 1 # Number of tasks to run (equal to 1 cpu/core per task) #SBATCH -N 1 # Ensure that all cores are on one machine #SBATCH -t 0-05:00 # Max Runtime in D-HH:MM, minimum of 10 minutes #SBATCH -p openlab.p # Partition to submit to #SBATCH -w odin # node(s) to run it on, comma delimited #SBATCH --mem-per-cpu=1024 # amount of memory to allocate per core, default is 4096 MB #SBATCH -o myoutput.out # File to which STDOUT will be written, %j inserts jobid #SBATCH -e myerrors.err # File to which STDERR will be written, %j inserts jobid # i is passed in as an argument to the srun_script.sh script for i in `seq 50`; do srun --exclusive -p openlab.p -w odin --nodes 1 --ntasks 1 /home/dutran/scripts/slurm/srun_script.sh ${i} & done # important to make sure the batch job won't exit before all the # simultaneous runs are completed. wait
The “wait” at the end of the script is important so do not delete it. In this example, “i” was passed as an argument to an srun_script.sh:
#!/bin/bash echo "Hello World $1" >> /tmp/srun_output.txt sleep $1
Before running the SBATCH script, it is beneficial to run the srun command in your script first to make sure there are no errors before trying out the sbatch script.
Single core / single GPU job
If you need only a single CPU core and one GPU:
#!/bin/bash #SBATCH --ntasks=1 # Number of tasks to run (equal to 1 cpu/core per task) #SBATCH --gres=gpu:1 # Number of GPUs (per node) #SBATCH --mem=4000M # memory (per node) #SBATCH --time=0-03:00 # time (DD-HH:MM) ./program # you can use 'nvidia-smi' for a test
Multi-threaded core / single GPU job
If you need only a 6 CPU core and one GPU:
#!/bin/bash #SBATCH --cpus-per-task=6 # CPU cores/threads #SBATCH --gres=gpu:1 # Number of GPUs (per node) #SBATCH --mem=4000M # memory (per node) #SBATCH --time=0-03:00 # time (DD-HH:MM) ./program # you can use 'nvidia-smi' for a test
View Available Resources
Please see https://slurm.schedmd.com/sinfo.html for further sinfo options.
This will list information on the openlab20.p partitions of allocated resources:
sinfo -p openlab20.p --Format NodeList:30,StateCompact:10,FreeMem:15,AllocMem:10,Memory:10,CPUsState:15,CPUsLoad:10,GresUsed:35
NODELIST STATE FREE_MEM ALLOCMEM MEMORY CPUS(A/I/O/T) CPU_LOAD GRES_USED circinus-[50,52-76,78-94,96] down* 85503-89736 0 96000 0/0/1056/1056 0.00-1.12 gpu:0,mps:0,bandwidth:0 circinus-[49,51] mix 83852-84050 8000 96000 4/44/0/48 0.22-0.34 gpu:0,mps:0,bandwidth:0 circinus-95 mix 80303 92160 96000 2/22/0/24 1.29 gpu:0,mps:0,bandwidth:0 circinus-77 down 93486 0 96000 0/0/24/24 0.00 gpu:0,mps:0,bandwidth:0
This will show you what is available and which nodes are down.
It is less clear for circinus-[68,80] though. There is 96000 MB memory available per node, not 96000 MB combined. There is 8000 MB allocated on 68 and 80. The CPU count is the combined CPUs available between both nodes though with 4 allocated between the two of them and 44 left idle.
The rest of the circinus machines are idle.
To further inspect what is available and allocated on circinus-68 CPU-wise, run:
sinfo -p openlab20.p -n circinus-68 -o "%11n %5a %10A %13C %10O %10e %m %15G %15h "
HOSTNAMES AVAIL NODES(A/I) CPUS(A/I/O/T) CPU_LOAD FREE_MEM MEMORY GRES OVERSUBSCRIBE circinus-68 up 0/1 0/24/0/24 0.26 91897 96000 (null) NO
This command will list CPU usage on all nodes in a partition:
sinfo -p openlab20.p --Node -o "%11n %5a %10A %13C %10O %10e %m %15G %15h "
PySlurm API
Compilation
Virtual Environment
https://github.com/PySlurm/pyslurm/wiki/Installing-PySlurm
Load a python 3 module with a working version of virtualenv. Version 3.7.1 will do:
module load python/3.7.1
Create virtual environment:
virtualenv pyslurmenv source pyslurmenv/bin/activate
Make sure Cython is installed
pip install Cython
Clone the pyslurm git repo:
git clone https://github.com/PySlurm/pyslurm.git pyslurm.src cd pyslurm.src
Compile (I've used the native gcc 4.8.5 on CentOS7)
python setup.py build --slurm-lib=/pkg/slurm/19.05.3-2/lib/ --slurm-inc=/pkg/slurm/19.05.3-2/include python setup.py install
Using the PySlurm API
Security
Jobs are transmitted to the running node via plaintext. Please do not incorporate sensitive data into your slurm scripts. Use Vault @ ICS to store and retrieve sensitive data.
Commands
sacct
This command will show allocated jobs and the number of GPUs assigned to it.
sacct -a -X --format=JobID,AllocCPUS,Reqgres
Show job account information for a specific job:
sacct -j <jobid> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
sacctmgr
This one setups up the MariaDB accounting database for the cluster:
sacctmgr add cluster ics_general_use
sacctmgr show associations
sacctmgr show configuration
Create Accounts
Example:
sudo /pkg/slurm/19.05.3-2/bin/sacctmgr add account grad.a Description="ICS Grad Students" Organization=ics parent=ics.a
Accounts are similar to UNIX groups:
sacctmgr add account <account_name.a> Description="Bren School of Information and Computer Science" Organization=ics
sacctmgr add account support.a Description="Computing Support Group" Organization=sgroup parent=ics
sacctmgr add account baldig.a Description="Institute for Genomics and Bioinformatics" Organization=igb parent=ics
sacctmgr add account ugrad.a Description="Undergraduate Instruction" Organization=ics parent=inst
Create Users
sacctmgr create user name=hans DefaultAccount=sgroup.a sacctmgr create user name=yuzok DefaultAccount=igb.a
[11:17:49 root@addison-v9]sacctmgr create user name=sources DefaultAccount=sgroup Adding User(s) sources Settings = Default Account = sgroup Associations = U = sources A = sgroup C = ics_genera Would you like to commit changes? (You have 30 seconds to decide)
sacctmgr add user hans Account inst
List Users
List all users:
sacctmgr show user sacctmgr show user -s
List information of a specific user:
sacctmgr show user <username>
List all accounts
sacctmgr show account
List all user in an account
sacctmgr show account -s <account>
List users:
sacctmgr show assoc format=account,user,partition where user=rnail
salloc
Needs an example
sbatch
sbcast
Use sbatch to copy data to allocated nodes.
The following allocates 100 nodes running bash, copies my_data to each of the 100 nodes and then executes a program, a.out.
salloc -N100 bash sbacast --force my_data /tmp/luser/my_data srun a.out
scancel
Cancel job # 77:
scancel 77
Cancel job 77 step 1.
scancel 77.1
Cancel all running and queued jobs:
scancel --user=hans --state=pending
scontrol
scontrol show config
release a held job
scrontrol release 77
Bring node back online:
scontrol update NodeName=circinux-1 State=resume
Set a node down, hung_proc
scontrol update NodeName=<node> State=down Reason=hung_proc
Lots of information about a job:
scontrol show job scontrol show job 77
sdiag
Needs an example.
sinfo
sinfo will show you the partitions your account is allowed to submit jobs to.
sinfo
sinfo -j 54
sinfo --summarize
sinfo --list-reasons --long
sinfo --Node --long
This will show the number of GPUs available and on which hosts:
sinfo -o "%P %.10G %N"
sprio
squeue
sreport
srun
Interactive Jobs
This will run an interactive session on a specific node in a specific partition with 4 GPUs allocated:
srun -p datalab.p --gres=gpu:4 -w datalab-gpu1 bash
Running with 1 GPU on a specific host:
srun -p biodatascience.p -w lucy --gres=gpu:1 --pty /bin/bash -i
- -p/–partition: Request a specific partition for the resource allocation.
- -w/–nodelist: Request a specific list of hosts
- –gres: Specifies a comma delimited list of generic consumable resources.
- –pty: execute task 0 in a terminal mode
sshare
sstat
strigger
strigger --set [OPTIONS...] strigger --get [OPTIONS...] strigger --clear
sview
SGE Equivalences
Job Submission
SGE | SLURM |
---|---|
qsub | sbatch |
qstat -f -q OPENLLAB | squeue |
SGE
qsub \ -q openlab.q \ -o /home/$USER/myjob.out \ -e /home/$USER/myjob.err \ /home/$USER/bin/myjob.sh arg1 arg2 arg3 ...
Slurm
sbatch \ -p openlab.q \ -o /home/$USER/myjob.%j.out \ -e /home/$USER/myjob.%j.err \ /home/$USER/bin/myjob.sh arg1 arg2 arg3
Example 2
SGE
qsub -l h_rt=72:00:00 -l s_rt=72:00:00 -q openlab.q \ -o $LOGS/$CASE.$i.$EXP.$SET.$count.out \ -e $LOGS/$CASE.$i.$EXP.$SET.$count.err \ /bin/sh python $SCRIPTS/$PYEXEC $f $i
SLURM
sbatch -t 72:00:00 -p openlab.p \ -o $LOGS/$CASE.$i.$EXP.$SET.$count.out \ -e $LOGS/$CASE.$i.$EXP.$SET.$count.err \ /bin/sh python $SCRIPTS/$PYEXEC $f $i
Job Reporting
SGE | SLURM |
---|---|
qstat -u luser | squeue -u sources |
qstat -j 77 | squeue -j 77 |
qstat -explain [acE] | squeue -l |
qstat -explain [acE] | scontrol show job |
Accounting
SGE | SLURM |
---|---|
qacct -j 77 | sacct -j 77 |
Alter Items in Queue
SGE | SLURM |
---|---|
qstat -j 77 | sacct -j 77 |
Other Equivalencies
SGE Command: qsub
Slurm Command: sbatch
SGE | SLURM | |
---|---|---|
-V | –export=ALL | All of the users environment will be loaded |
-q queue | -p parition | Request a specific partition for the resource allocation. |
-t 1:X | -a 1-X | Submit a job array, multiple jobs to be executed with identical parameters. |
-r y | –requeue | Specifies that the batch job should eligible to being requeue. |
-sync | -W | Do not exit until the submitted job terminates. |
-cwd | –chdir | Set the working directory of the batch script to directory before it is executed. |
-N | -J | Job name |
-p | –nice | Run the job with an adjusted scheduling priority within Slurm. |
Old command in SGE:
qsub \ -p -1000 \ -cwd \ -o ~/sge_logs/'$JOB_NAME-$HOSTNAME-$TASK_ID' \ -e ~/sge_logs/'$JOB_NAME-$HOSTNAME-$TASK_ID' \ -t 1:100 \ -N "$JOB" \ -r y \ -sync y \ -V \ "$@"
Equivalent in slurm:
sbatch \ --nice=1000 \ --chdir=`pwd` \ -o ~/slurm_logs/%j-%N-%t.out \ -e ~/slurm_logs/%j-%N-%t.err \ -sync y \ -a 1-100 \ -J MyJob \ --requeue \ -W \ --export=ALL \ "$@"
Checkpointing
Coming soon.
States
Drain
Jobe exited with code > 0. Use scontrol to resume. Use sacct -j job# to pull up data.
Idle
Troubleshooting
Unable to register with slurm controller
OOM Killer killed my process
You recieve an error that indicates your job was liquidated by the OOM Killer.
slurmstepd: error: Detected 1 oom-kill event(s) in step 8337246.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Most obviously, you need to check your code for a memory leak.
However, if you are confident the code does not have a memory leak, you need to make sure you are on a cluster that has enourgh ram. The OOM-Killer has an algorithm it follows in order to select a process to kill and there may be memory pressure on the node you are running on and your process was just the one with the highest score. In these cases:
- Select a cluster that has a sufficient amount of memory for your work.
- Make sure you reserve an appropriate chunk of memory when submitting your job: #SBATCH –mem=1gb
Nov 8 20:38:11 addison-v9 slurmctld[18105]: error: Node medusa has low cpu count (4 < 80)
Make sure the slurm.conf reflects the true core count. If it is too far off, then the slurmctld will not accept the slurmd node.
Debugging
slurmctld
/pkg/slurm/19.05.3-2/sbin/slurmctld -v -D -f /etc/slurm/slurm.conf
Invalid account or account/partition combination specified
Error in logs
After user submitted hello_world.sh (for example) with sbatch command
Mar 23 14:28:06 addison-v9 slurmctld[28692]: error: User 13430 not found Mar 23 14:28:06 addison-v9 slurmctld[28692]: _job_create: invalid account or partition for user 13430, account '(null)', and partition 'openlab.p' Mar 23 14:28:06 addison-v9 slurmctld[28692]: _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified
Error on cli
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
Solution
Log into the slurm master and run the following commands to see which account the user is a member of:
This will show what account user is part of.
sacctmgr show assoc format=account,user,partition where user=<username> sacctmgr show user <username> -s
If a blank table is the result, then the will need to be created in slurm. Create the user using the sacctmgr command and add to relevant accounts.
Account managment is typically handled by the /usr/local/bin/slurm-ldap/convert-accounts.py script on the slurm-master.
Also, make sure the user is in one of the posix groups allowed to login to the host they are running their jobs on:
getent group <group_name>
Use sinfo command to see which partition the user can submit jobs to.
slurmctld: fatal: CLUSTER NAME MISMATCH.
slurmctld has been started with “ClusterName=ics.c”, but read “ics_general_use” from the state files in StateSaveLocation. Running multiple clusters from a shared StateSaveLocation WILL CAUSE CORRUPTION. Remove /var/spool/slurm.state/clustername to override this safety check if this is intentional (e.g., the ClusterName has changed). [3:49:13 root@broadchurch-v1]cat /var/spool/slurm.state/clustername
Solution
Do what the output tells you, rm /var/spool/slurm.state/clustername
fatal: Unable to determine this slurmd's NodeName
Solution
Confirm that this host exists in the /etc/slurm/slurm.conf file.
Partition in drain state
Solution
Resume the node on slurm. This also works for a down node.