Slurm@ICS

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We use slurm here at ICS in order to allow fair access to computer clusters for instruction and research.

External Documentation

Configuration

Compile and Installation

Module

Use the module command to add slurm to your environment.

module load slurm

Use the following command to update your profile rc files to load the module on login permanently:

module initadd slurm

Slurm Partitions

These may not be current so run 'sinfo' to see the partitions available to you to use.

PartitionNodesCores per NodeCPU Core TypesRAMTime LimitMaxJobsMPI Suitable?GPU Capable
opengpu.p 1 40 Intel Xeon Silver 4114 @ 2.20GHz 96GB NoneYesNo
openlab.p4824Intel Westmere96GB NoneYesNo
openlab20.p4824Intel Westmere96GB NoneYesNo

Slurm Daemons

Slurm Controller Daemon (slurmctld)

This daemon runs exclusively on the control nodes. One control node is active at any time. If the first control node is down for any reason, another control node will take over. In this case it may be moot since the primary control node is also the only database node.

  • addison-v9.ics.uci.edu
  • broadchurch-v1.ics.uci.edu

Slurm Database Daemon (slurmdbd)

The database node runs mariadb and is the repository for accounting information.

  • Hostname: addision-v9.ics.uci.edu

Slurm Daemon (slurmd)

This daemon runs on all worker nodes.

Commands

https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting

  • sacct -displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database
  • sacctmgr - Used to view and modify Slurm account information.
  • salloc - Obtain a Slurm job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished.
  • sattach - Attach to a Slurm job step
  • sbatch - Submit a batch script to Slurm.
  • sbcast - transmit a file to the nodes allocated to a Slurm job
  • scancel -Used to signal jobs or job steps that are under the control of Slurm.
  • scontrol - view or modify Slurm configuration and state.
  • sdiag - Scheduling diagnostic tool for Slurm
  • sinfo - view information about Slurm nodes and partitions.
  • sprio - view the factors that comprise a job's scheduling priority
  • squeue - view information about jobs located in the Slurm scheduling queue
  • sreport - generate reports from the slurm accounting data
  • srun - Run parallel jobs
  • sshare - Tool for listing the shares of associations to a cluster
  • sstat - Display various status information of a running job/step
  • strigger - Used set, get or clear Slurm trigger information
  • sview - gui for interacting with Slurm.

Shortcuts

DescriptionCommand
Run a single jobsbatch -p openlab.p hello_world.sh
Run an interactive bash shell on 1 node with 1 task on openlab partition srun -N1 -n1 -p openlab.p bash -i
List running jobssqueue
Cancel job numberscancel job#
Run 1 task for max 10 minutessbatch –ntasks=1 –time=10 pre_process.bash
Run on 2 nodes, 16 tasks at a time, 1000 taskssbatch –nodes=2 -t 1-1000 –ntasks-per-node=16 -J sbatch-500 sleeper2A.sh 5 1
Allocated 2 nodes, 6 tasks per, srun hostname salloc -N2 -n6 srun hostname

Administrative Shortcuts

DescriptionCommand
Resume Nodescontrol update NodeName=circinus-1 State=resume

Writing a Slurm Script (Examples)

In the examples below, the #SBATCH are not comments. They are options used when the sbatch command is run. For example, in the Hello World example, the job is being submitted to the openlab.p partition.

Hello World

Copy and paste this code snippet into your terminal window:

cat << EOF > hello_world.sh
#!/bin/bash
#SBATCH -n 1                # Number of tasks to run (equal to 1 cpu/core per task)
#SBATCH -N 1                # Ensure that all cores are on one machine
#SBATCH -t 0-00:10          # Max Runtime in D-HH:MM, minimum of 10 minutes
#SBATCH -p openlab.p   # Partition to submit to
#SBATCH -o myoutput.out  # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e myerrors.err  # File to which STDERR will be written, %j inserts jobid

perl -e 'print "Hello World.\n"'
EOF

Submit the job:

sbatch hello_world.sh

Look for the results in the local directory in the myoutput.out and myerrors.err files.

Note: Use the %j substitution to generate unique, job based file names, e.g. myoutput_%j.out or myerrors_%j.err

Note: To run on specific node(s), add to example above to run on circinus-1 only:

#SBATCH -w circinus-1  # node(s) to run it on, comma delimited

Sleepy time

Copy and past this code:

cat << EOF > sleepy_time.sh
#!/bin/bash
#
#SBATCH --job-name=sleepy_time
#SBATCH --output=sleepy_time_%j.out
#SBATCH --error=sleepy_time_%j.out
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=1

srun hostname
srun sleep 60
EOF

Submit the job:

sbatch sleepy_time.sh

MPI Fun with Pi

Grab the MPI_PI programs from github:

git clone https://github.com/kiwenlau/MPI_PI

The Monte Carlo routines are silly, so go into Trapezoid1:

cd MPI_PI/Trapezoid1

If running on CentOS, add the mpi directory to your path before starting:

PATH=/usr/lib64/openmpi/bin:$PATH

Compile Trapezoid mpi_pi (don't forget to include the math library):

mpicc -o pi mpi_pi.c -lm 

Create a Batch file (make sure you include the PATH= line above if running on CentOS):

#!/bin/bash

# uncomment on CentOS
# PATH=/usr/lib64/openmpi/bin:$PATH

#SBATCH --job-name=you-are-kewl
#SBATCH --output=you-are-kewl.out
#SBATCH --partition=openlab.p

for i in $(seq 1 8 )
do
mpirun -np $i ./pi
done

Submit your batch job:

sbatch -N 5 mpirun_batch.sh

Distributed Stress Test

#!/bin/sh
#
# Script runs one process per node on ten nodes.   The stress test runs for 30 minutes and exits.
#
#SBATCH -n 1
#SBATCH -N 10
#SBATCH -t 30
#SBATCH -p openlab.p
#SBATCH -o myoutput_%j.out
#SBATCH -e myoutput_%j.err

cd /extra/baldig11/hans
/home/hans/bin/stress -d 10

Execute it:

sbatch distributed_stress_test.sh

Array Job

An array job will allow you to run parallel jobs. This method does not take variables though. See “SRUN inside SBATCH” for how to run parallel jobs but provide variables to each job.

Put the following into a file ~/bin/arrayjob.sh and make it executable `chmod 700 ~/bin/arrayjob.sh`

#!/bin/bash
#SBATCH --job-name=array_job_test     # Job name
#SBATCH --mail-type=FAIL              # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ics.uci.edu # Where to send mail	
#SBATCH --ntasks=1                    # Run a single task
#SBATCH --mem=1gb                     # Job Memory
#SBATCH --time=00:05:00               # Time limit hrs:min:sec
#SBATCH --output=/home/hans/tmp/array_%A-%a.log      # Standard output and error log
#SBATCH --array=1-5000                # Array range
pwd; hostname; date

echo This is task $SLURM_ARRAY_TASK_ID

date
Execute

At the command prompt, enter:

sbatch ~/bin/arrayjob.sh`

SRUN inside SBATCH

You can ask for a specific number of cores and memory using an SBATCH script. Within the SBATCH script, you can assign out the resources that have been allocated via an SRUN command. What makes this different than an array job is that you can pass in variables to each SRUN command.

The following is a simple example where 50 cores are requested. Instead of the default 4GB memory per core assignment, I only asked for 1GB per core. (This is necessary when there the number of cores x default memory is higher than the available memory in the server and hence would fail). In this example, I am running the job only one specific node in the partition. If you wish to run it on all nodes in the partition, then just remove the line with the “-w” option.

#!/bin/bash
#SBATCH -n 1                # Number of tasks to run (equal to 1 cpu/core per task)
#SBATCH -N 1                # Ensure that all cores are on one machine
#SBATCH -t 0-05:00          # Max Runtime in D-HH:MM, minimum of 10 minutes
#SBATCH -p openlab.p   # Partition to submit to
#SBATCH -w odin  # node(s) to run it on, comma delimited
#SBATCH --mem-per-cpu=1024  # amount of memory to allocate per core, default is 4096 MB
#SBATCH -o myoutput.out  # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e myerrors.err  # File to which STDERR will be written, %j inserts jobid

# i is passed in as an argument to the srun_script.sh script
for i in `seq 50`; do
  srun --exclusive -p openlab.p -w odin --nodes 1 --ntasks 1 /home/dutran/scripts/slurm/srun_script.sh ${i} &
done

# important to make sure the batch job won't exit before all the
# simultaneous runs are completed.
wait

The “wait” at the end of the script is important so do not delete it. In this example, “i” was passed as an argument to an srun_script.sh:

#!/bin/bash

echo "Hello World $1" >> /tmp/srun_output.txt
sleep $1

Before running the SBATCH script, it is beneficial to run the srun command in your script first to make sure there are no errors before trying out the sbatch script.

Single core / single GPU job

If you need only a single CPU core and one GPU:

#!/bin/bash
#SBATCH --ntasks=1                # Number of tasks to run (equal to 1 cpu/core per task)
#SBATCH --gres=gpu:1              # Number of GPUs (per node)
#SBATCH --mem=4000M               # memory (per node)
#SBATCH --time=0-03:00            # time (DD-HH:MM)
./program                         # you can use 'nvidia-smi' for a test

Multi-threaded core / single GPU job

If you need only a 6 CPU core and one GPU:

#!/bin/bash
#SBATCH --cpus-per-task=6         # CPU cores/threads
#SBATCH --gres=gpu:1              # Number of GPUs (per node)
#SBATCH --mem=4000M               # memory (per node)
#SBATCH --time=0-03:00            # time (DD-HH:MM)
./program                         # you can use 'nvidia-smi' for a test

View Available Resources

Please see https://slurm.schedmd.com/sinfo.html for further sinfo options.

This will list information on the openlab20.p partitions of allocated resources:

sinfo -p openlab20.p --Format NodeList:30,StateCompact:10,FreeMem:15,AllocMem:10,Memory:10,CPUsState:15,CPUsLoad:10,GresUsed:35
NODELIST                      STATE     FREE_MEM       ALLOCMEM  MEMORY    CPUS(A/I/O/T)  CPU_LOAD  GRES_USED                       
circinus-[50,52-76,78-94,96]  down*     85503-89736    0         96000     0/0/1056/1056  0.00-1.12 gpu:0,mps:0,bandwidth:0         
circinus-[49,51]              mix       83852-84050    8000      96000     4/44/0/48      0.22-0.34 gpu:0,mps:0,bandwidth:0         
circinus-95                   mix       80303          92160     96000     2/22/0/24      1.29      gpu:0,mps:0,bandwidth:0         
circinus-77                   down      93486          0         96000     0/0/24/24      0.00      gpu:0,mps:0,bandwidth:0         

This will show you what is available and which nodes are down.

It is less clear for circinus-[68,80] though. There is 96000 MB memory available per node, not 96000 MB combined. There is 8000 MB allocated on 68 and 80. The CPU count is the combined CPUs available between both nodes though with 4 allocated between the two of them and 44 left idle.

The rest of the circinus machines are idle.

To further inspect what is available and allocated on circinus-68 CPU-wise, run:

sinfo -p openlab20.p -n circinus-68 -o "%11n %5a %10A %13C %10O %10e %m %15G %15h "
HOSTNAMES   AVAIL NODES(A/I) CPUS(A/I/O/T) CPU_LOAD   FREE_MEM   MEMORY GRES            OVERSUBSCRIBE
circinus-68 up    0/1        0/24/0/24     0.26       91897      96000 (null)          NO

This command will list CPU usage on all nodes in a partition:

sinfo -p openlab20.p --Node -o "%11n %5a %10A %13C %10O %10e %m %15G %15h "

PySlurm API

Compilation

Virtual Environment

https://github.com/PySlurm/pyslurm/wiki/Installing-PySlurm

Load a python 3 module with a working version of virtualenv. Version 3.7.1 will do:

module load python/3.7.1

Create virtual environment:

virtualenv pyslurmenv
source pyslurmenv/bin/activate

Make sure Cython is installed

pip install Cython

Clone the pyslurm git repo:

git clone https://github.com/PySlurm/pyslurm.git pyslurm.src
cd pyslurm.src

Compile (I've used the native gcc 4.8.5 on CentOS7)

python setup.py  build --slurm-lib=/pkg/slurm/19.05.3-2/lib/ --slurm-inc=/pkg/slurm/19.05.3-2/include
python setup.py  install

Using the PySlurm API

Security

Jobs are transmitted to the running node via plaintext. Please do not incorporate sensitive data into your slurm scripts. Use Vault @ ICS to store and retrieve sensitive data.

Commands

sacct

This command will show allocated jobs and the number of GPUs assigned to it.

sacct -a -X --format=JobID,AllocCPUS,Reqgres

Show job account information for a specific job:

sacct -j <jobid> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist

sacctmgr

This one setups up the MariaDB accounting database for the cluster:

sacctmgr add cluster ics_general_use
sacctmgr show associations
sacctmgr show configuration

Create Accounts

Example:

sudo /pkg/slurm/19.05.3-2/bin/sacctmgr add account grad.a Description="ICS Grad Students" Organization=ics parent=ics.a

Accounts are similar to UNIX groups:

sacctmgr add account <account_name.a> Description="Bren School of Information and Computer Science" Organization=ics 
sacctmgr add account support.a Description="Computing Support Group" Organization=sgroup parent=ics
sacctmgr add account baldig.a Description="Institute for Genomics and Bioinformatics" Organization=igb parent=ics
sacctmgr add account ugrad.a Description="Undergraduate Instruction" Organization=ics parent=inst

Create Users

sacctmgr create user name=hans DefaultAccount=sgroup.a
sacctmgr create user name=yuzok DefaultAccount=igb.a
[11:17:49 root@addison-v9]sacctmgr create user name=sources DefaultAccount=sgroup                                      
 Adding User(s)
  sources
 Settings =
  Default Account = sgroup
 Associations =
  U = sources   A = sgroup     C = ics_genera
Would you like to commit changes? (You have 30 seconds to decide)
sacctmgr add user hans Account inst

List Users

List all users:

sacctmgr show user
sacctmgr show user -s

List information of a specific user:

sacctmgr show user <username>

List all accounts

sacctmgr show account

List all user in an account

sacctmgr show account -s <account>

List users:

sacctmgr show assoc format=account,user,partition where user=rnail

salloc

Needs an example

sbatch

sbcast

Use sbatch to copy data to allocated nodes.

The following allocates 100 nodes running bash, copies my_data to each of the 100 nodes and then executes a program, a.out.

salloc -N100 bash
sbacast  --force my_data /tmp/luser/my_data
srun a.out

scancel

Cancel job # 77:

scancel 77

Cancel job 77 step 1.

scancel 77.1

Cancel all running and queued jobs:

scancel --user=hans --state=pending

scontrol

scontrol show config

release a held job

scrontrol release 77

Bring node back online:

scontrol update NodeName=circinux-1 State=resume

Set a node down, hung_proc

scontrol update NodeName=<node> State=down Reason=hung_proc

Lots of information about a job:

scontrol show job
scontrol show job 77

sdiag

Needs an example.

sinfo

sinfo will show you the partitions your account is allowed to submit jobs to.

sinfo
sinfo -j 54 
sinfo --summarize
sinfo --list-reasons --long
sinfo --Node --long

This will show the number of GPUs available and on which hosts:

sinfo  -o "%P %.10G %N"

sprio

squeue

sreport

srun

Interactive Jobs

This will run an interactive session on a specific node in a specific partition with 4 GPUs allocated:

srun -p datalab.p --gres=gpu:4 -w datalab-gpu1 bash

Running with 1 GPU on a specific host:

srun -p biodatascience.p -w lucy --gres=gpu:1 --pty /bin/bash -i
  • -p/–partition: Request a specific partition for the resource allocation.
  • -w/–nodelist: Request a specific list of hosts
  • –gres: Specifies a comma delimited list of generic consumable resources.
  • –pty: execute task 0 in a terminal mode

sshare

sstat

strigger

strigger --set   [OPTIONS...]
strigger --get   [OPTIONS...]
strigger --clear 

sview

Run this command to evoke the gui:

SGE Equivalences

Job Submission

SGESLURM
qsubsbatch
qstat -f -q OPENLLABsqueue

SGE

qsub \
-q openlab.q \
-o /home/$USER/myjob.out \
-e /home/$USER/myjob.err \
/home/$USER/bin/myjob.sh arg1 arg2 arg3 ...

Slurm

sbatch \
-p openlab.q \
-o /home/$USER/myjob.%j.out \
-e /home/$USER/myjob.%j.err \
/home/$USER/bin/myjob.sh arg1 arg2 arg3

Example 2

SGE

qsub -l h_rt=72:00:00 -l s_rt=72:00:00 -q openlab.q \
  -o $LOGS/$CASE.$i.$EXP.$SET.$count.out \
  -e $LOGS/$CASE.$i.$EXP.$SET.$count.err \
  /bin/sh python $SCRIPTS/$PYEXEC $f $i

SLURM

sbatch -t 72:00:00 -p openlab.p \
  -o $LOGS/$CASE.$i.$EXP.$SET.$count.out \
  -e $LOGS/$CASE.$i.$EXP.$SET.$count.err \
  /bin/sh python $SCRIPTS/$PYEXEC $f $i

Job Reporting

SGESLURM
qstat -u lusersqueue -u sources
qstat -j 77squeue -j 77
qstat -explain [acE]squeue -l
qstat -explain [acE]scontrol show job

Accounting

SGESLURM
qacct -j 77sacct -j 77

Alter Items in Queue

SGESLURM
qstat -j 77sacct -j 77

Other Equivalencies

SGE Command: qsub

Slurm Command: sbatch

SGESLURM
-V–export=ALL All of the users environment will be loaded
-q queue-p paritionRequest a specific partition for the resource allocation.
-t 1:X -a 1-XSubmit a job array, multiple jobs to be executed with identical parameters.
-r y–requeueSpecifies that the batch job should eligible to being requeue.
-sync-WDo not exit until the submitted job terminates.
-cwd–chdirSet the working directory of the batch script to directory before it is executed.
-N-JJob name
-p–niceRun the job with an adjusted scheduling priority within Slurm.

Old command in SGE:

qsub \
-p -1000 \
-cwd \
-o ~/sge_logs/'$JOB_NAME-$HOSTNAME-$TASK_ID' \
-e ~/sge_logs/'$JOB_NAME-$HOSTNAME-$TASK_ID' \
-t 1:100 \
-N "$JOB" \
-r y \
-sync y \
-V \
"$@"

Equivalent in slurm:

sbatch \
--nice=1000 \
--chdir=`pwd` \
-o ~/slurm_logs/%j-%N-%t.out \
-e ~/slurm_logs/%j-%N-%t.err \
-sync y \
-a 1-100 \
-J MyJob \
--requeue \
-W \
--export=ALL \
"$@"

Checkpointing

Coming soon.

States

Drain

Jobe exited with code > 0. Use scontrol to resume. Use sacct -j job# to pull up data.

Idle

Troubleshooting

Unable to register with slurm controller

OOM Killer killed my process

You recieve an error that indicates your job was liquidated by the OOM Killer.

slurmstepd: error: Detected 1 oom-kill event(s) in step 8337246.batch
cgroup. Some of your processes may have been killed by the cgroup
out-of-memory handler.

Most obviously, you need to check your code for a memory leak.

However, if you are confident the code does not have a memory leak, you need to make sure you are on a cluster that has enourgh ram. The OOM-Killer has an algorithm it follows in order to select a process to kill and there may be memory pressure on the node you are running on and your process was just the one with the highest score. In these cases:

  1. Select a cluster that has a sufficient amount of memory for your work.
  2. Make sure you reserve an appropriate chunk of memory when submitting your job: #SBATCH –mem=1gb

Nov 8 20:38:11 addison-v9 slurmctld[18105]: error: Node medusa has low cpu count (4 < 80)

Make sure the slurm.conf reflects the true core count. If it is too far off, then the slurmctld will not accept the slurmd node.

Debugging

slurmctld

/pkg/slurm/19.05.3-2/sbin/slurmctld -v -D -f /etc/slurm/slurm.conf 

Invalid account or account/partition combination specified

Error in logs

After user submitted hello_world.sh (for example) with sbatch command

  Mar 23 14:28:06 addison-v9 slurmctld[28692]: error: User 13430 not found
  Mar 23 14:28:06 addison-v9 slurmctld[28692]: _job_create: invalid account or partition for user 13430, account '(null)', and partition 'openlab.p'
  Mar 23 14:28:06 addison-v9 slurmctld[28692]: _slurm_rpc_submit_batch_job: Invalid account or account/partition combination specified

Error on cli

srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

Solution

Log into the slurm master and run the following commands to see which account the user is a member of:

This will show what account user is part of.

sacctmgr show assoc format=account,user,partition where user=<username>
sacctmgr show user <username> -s 

If a blank table is the result, then the will need to be created in slurm. Create the user using the sacctmgr command and add to relevant accounts.

Account managment is typically handled by the /usr/local/bin/slurm-ldap/convert-accounts.py script on the slurm-master.

Also, make sure the user is in one of the posix groups allowed to login to the host they are running their jobs on:

getent group <group_name>

Use sinfo command to see which partition the user can submit jobs to.

slurmctld: fatal: CLUSTER NAME MISMATCH.

slurmctld has been started with “ClusterName=ics.c”, but read “ics_general_use” from the state files in StateSaveLocation. Running multiple clusters from a shared StateSaveLocation WILL CAUSE CORRUPTION. Remove /var/spool/slurm.state/clustername to override this safety check if this is intentional (e.g., the ClusterName has changed). [3:49:13 root@broadchurch-v1]cat /var/spool/slurm.state/clustername

Solution

Do what the output tells you, rm /var/spool/slurm.state/clustername

fatal: Unable to determine this slurmd's NodeName

Solution

Confirm that this host exists in the /etc/slurm/slurm.conf file.

Partition in drain state

Solution

Resume the node on slurm. This also works for a down node.

services/slurm.txt · Last modified: 2022/05/05 17:26 by dutran
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0