Sun Grid Engine (SGE) Common Tasks

Submission and Administrative Hosts

Submission Hosts

Submission (or submit) hosts are used to dispatch jobs to the various SGE clusters. This is a very volatile list and frequently updated.

You may login to odin.ics.uci.edu in order to perform submissions. All users with an ICS shell account can log in to this host in order to submit jobs.

Administrative Hosts

Adminstrative hosts are used to run SGE monitoring tools, alter running jobs, and edit queues.

You may log in to odin.ics.uci.edu in order to perform adminstrative tasks. All users with an ICS shell account can login to this host.

Setting up your Shell Environment for SGE

In order to setup your shell environment to run SGE commands, load the sge module:

module load sge

That modules makes the following changes to your path:

append-path     PATH    /opt/sge/bin/lx-amd64   :
append-path     MANPATH /opt/sge/man    :
append-path     LD_LIBRARY_PATH /opt/sge/lib/lx-amd64   :

That module also sets the following environment variables:

setenv  MAILER                  /bin/mailx
setenv  XTERM                   /usr/bin/xterm
setenv  SGE_ROOT                /opt/sge
setenv  SDM_DIST                /opt/sge
setenv  SGE_CELL                uci-ics
setenv  SGE_QMASTER_PORT        6556
setenv  SGE_EXECD_PORT          6557
setenv  ARC                     lx-amd64

See the modules page for more information about modules @ ICS.

Common Commands (CLI)

Submission(qsub)

These directions are predicated that you are logged into a host that is either a submit or administrative host. Again, the host odin.ics.uci.edu is recomended.

Submitting Jobs

How to submit jobs to the SGE queues. Typical qsub commands take the following form:

qsub -q <queuename> [-M youremail@provicder] [-m beas] <scriptname>

Common options are explained below. For additional information please see the man page.

Option Explanation Common Example
-q Name of the queue that you are submitting your work to 15day.q, 12hour.q
-M Email address to send job email notifications to
-m Conditions on which to send email notifications beas: begins, ends, aborts, suspends
<scriptname> Name of script to run
-o File to use for standard out ~/tmp/$JOB_ID.out
-e File to use for standard errors ~/tmp/$JOB_ID.err
-S Shell to use /bin/bash
-P Name of project to associate with.
-pe wc_pe_name slot_range request slot range for parallel jobsopenmpi 3
-t nSubmit an array of jobs

Example

qsub -q 15day.q -M luser@ics.uci.edu -m beas -o sge_script.$JOB_ID.out -e sge_script.$JOB_ID.err /home/luser/bin/sge_script.sh
Submitting to a Specific Project

Include the -P projectname.p on the command line:

qsub -P sana.p -q 15day.q -M luser@ics.uci.edu -m beas -o sge_script.$JOB_ID.out -e sge_script.$JOB_ID.err /home/luser/bin/sge_script.sh
Submitting Jobs in Exclusive Mode

The exclusive scheduling feature allows a user to request that a job be given exclusive access to a specific host. This helps to guarantee predictable performance and to avoid interference when a job is not using all of the slots that are available on a host.

Example:

qsub -l excl=true -q 15day.q -M luser@ics.uci.edu -m beas -o sge_script.$JOB_ID.out -e sge_script.$JOB_ID.err /home/luser/bin/sge_script.sh

See SGE Queues for information about which queues offer exclusive queues.

Submitting Jobs with Memory Restrictions

Possible Limits

  • h_vmem: job to be terminated when crossing this threshold
  • s_vmem: signal sent to job when crossing this threshold
  • mem_free(mf): The ammount of memory needed for the job
  • mem_used(mu): TBD

From the queue_conf managpage:

The resource limit parameters s_vmem and h_vmem  are  imple-
mented  by  Sun  Grid  Engine  as a job limit. They impose a
limit on the amount of combined virtual memory  consumed  by
all the processes in the job. If h_vmem is exceeded by a job
running in the queue, it is aborted  via  a  SIGKILL  signal
(see  kill(1)).   If  s_vmem  is exceeded, the job is sent a
SIGXCPU signal which can be caught by the job.  If you  wish
to  allow  a  job  to  be "warned" so it can exit gracefully
before it is killed then you should set the s_vmem limit  to
a  lower  value  than  h_vmem.   For parallel processes, the
limit is applied per slot which means that the limit is mul-
tiplied  by the number of slots being used by the job before
being applied.

Examples:

Use the -l option, along with the s_vmem or h_vmem values to limit memory usage or mem_free. The following example specifies that a host with a minimum 100G free memory be used and the job will be auto-killed if it uses 200G of memory (a signal from the job will go out, presumably by email, when it crosses the 190G threshold):

qsub -l h_vmem=200G -l s_vmem=190G  -l mem_free=100G -q openlab.q -M luser@ics.uci.edu -m beas -o sge_script.$JOB_ID.out -e sge_script.$JOB_ID.err /home/luser/bin/sge_script.sh
Submitting Jobs with Time Restrctions

Use the -l option with h_rt and s_rt to limit runtime. When a job is submitted with h_rt the job will be killed when it reaches that hard limit.

qsub -l **h_rt=02:00:00** -q openlab.q -M luser@ics.uci.edu -m beas -o sge_script.$JOB_ID.out -e sge_script.$JOB_ID.err 

Using the s_rt command will instead signal you when then job overruns this limit:

qsub -l **s_rt=02:00:00** -q openlab.q -M luser@ics.uci.edu -m beas -o sge_script.$JOB_ID.out -e sge_script.$JOB_ID.err 
Submitting Parallel Jobs

To run parallel jobs on centos 6 hosts, use /pkg/openmpi/2.0.2/bin/mpirun. For centos 7 use /pkg/openmpi/2.0.2-centos7/bin/mpirun.

Example script: script.sh

#!/bin/sh -x
hostname
date sum=0
count=0
while [ $count -lt 100 ]; do
    count=`expr $count + 1 `
    sum=`expr $sum + $count `
    sleep 1
done
echo $sum
date

Example openmpi script: openmpi.sh

#!/bin/sh
#$ -o /home/<user>/openmpi.out
#$ -e /home/<user>/openmpi.err
/pkg/openmpi/2.0.2-centos7/bin/mpirun /home/<user>/script.sh

How to start the parallel job

qsub -pe openmpi <slot range> -q <queue> openmpi.sh
Submitting an Array of Jobs

Use the -t option to submit an array of jobs.

Sample scripts below were found at this site: http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto

To run a simple shell script 100 times:

qsub -t 100  /opt/sge/examples/jobs/simple.sh

The following script will run program 100 times while sending output a file per task:

#!/bin/sh
# Tell the SGE that this is an array job, with "tasks" to be numbered 1 to 10000
#$ -t 1-10000
# When a single command in the array job is sent to a compute node,
# its task number is stored in the variable SGE_TASK_ID,
# so we can use the value of that variable to get the results we want:
~/programs/program -i ~/data/input.$SGE_TASK_ID -o ~/results/output.$SGE_TASK_ID

The following script will submit 10000 jobs, providing each run with a different line from the seed file as an argument:

#!/bin/sh
#$ -t 1-10000
SEEDFILE=~/data/seeds
SEED=$(awk "NR==$SGE_TASK_ID" $SEEDFILE)
~/programs/simulation -s $SEED -o ~/results/output.$SGE_TASK_ID
Submitting Jobs, Excluding Hosts

The following command will exclude the host tristram when submitting jobs to the openlab cluster:

qsub -l 'h=!tristram' -q openlab.q -M luser@ics.uci.edu -m beas -o sge_script.$JOB_ID.out -e sge_script.$JOB_ID.err 

The following command will exclude the hosts odin and tristram when submitting jobs to the openlab cluster:

qsub -l 'h=!(tristram|odin)' -q openlab.q -M luser@ics.uci.edu -m beas -o sge_script.$JOB_ID.out -e sge_script.$JOB_ID.err 

Queue Status (qstat)

Command Action
qstat -f -q 15day_cluster.q Status of hosts in a queue
qstat -g c -q syngene\* Get summary status of queues that begin with syngene
qstat -explain [acE] Explanation for errors of type [acE]
qstat List your own jobs in queue
qstat -u '*' List jobs of all users in queue
qstat -j <jobid> Display errors associated with your job
qstat -j Display summary errors associated with all jobs
qstat -f -qs E Check all queues with status E {acdosuACDES}
qstat -f -q syngene_lisp_cluster.q -qs aE Get status of single cluster
Explanation for the Error State

Use the -explain switch in order to get an explanation regarding the nature of the error being reported.

qstat -explain c
 benzene.q@c1ccccc1-45.ics.uci. BIP   0/0/16         0.04     lx24-amd64    c
   warning: "slots" has ambiguous value ("@benzene", "@benzene24core")
   warning: "complex_values" has ambiguous value ("@benzene", "@benzene24core")
qstat -explain E
 baldig_guest.q@chem-11.ics.uci BIP   0/0/2          0.03     lx24-amd64    E
   queue baldig_guest.q marked QERROR as result of job 6346986's failure at host chem-11.ics.uci.edu
qstat -explain a
 bayonet_cluster.q@bayonet-05.i BIP   0/8/8          -NA-     lx24-amd64    au
   error: no value for "np_load_avg" because execd is in unknown state
List Your Jobs
qstat -u


Queue Configuration (qconf)

Command Action
qconf -sql List Available queues
qconf -as syngene14.ics.uci.edu Set host as a submit host
qconf -ds syngene14.ics.uci.edu Unset host as a submit host
qconf -ah syngene14.ics.uci.edu Set host as an adminstration host
qconf -dh syngene14.ics.uci.edu Unset host as an administration host
qconf -ahgrp @resources-test.h Add a hostgroup (via your pager)
qconf -dhgrp @resources-test.h Delete a hostgroup
qconf -mhgrp @resource-test.h Modify a hostgroup (via your pager)
qconf -shgrp @andromeda.h Display hosts in hostgroup
qconf -aq resource-test.q Create and modify a new queue (via pager)
qconf -mq resource-test.q Create and modify a new queue (via pager)
qconf -dq resource-test.q Delete queue resource-test.q
qconf -su userset_name List the members of a userset
Setting Consumable Resources

Memory: Memory(h_vmem) has been setup as consumable resource in the openlab queue.

Queue Accounting

Command Explanation
qacct -j <jobid> Get accounting information for job with jobid
qacct -j <jobid>* Get account for all jobs with id like jobid

Altering items in Queue

Modifying Pending Jobs to Another Queue with `qalter`

Command Action
qalter -u <username> <queue> Modifies all jobs submitted by specified user
qalter -q <queue> <jobid> Move specified job to another queue
qalter -h u <jobid> Put pending job “on hold”
qalter -U <jobid> Remove hold state of pending job

Setting Complex Values

Complex values are host specific settings, at leaast for h_vmem and s_vmem. Use the qconf -matter exechost line for that:

foreach h in `netg -h andromeda_cluster`;do
echo qconf -mattr exechost complex_values h_vmem 30G $h
qconf -mattr exechost complex_values h_vmem=30G,s_vmem=29.9G,excl=true $h
done

Display complex values for hosts in a netgroup:

foreach h in `netg -h archer_cluster`;do
qconf -se $h
done
services/sun_grid_engine/sge_common_tasks.txt · Last modified: 2019/12/05 09:08 by Hans
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0