SGE 6

1. SGE 6 vs SGE 5

1.1. Cluster Queues and Instance Queues

SGE 6 is significantly different from SGE 5.3. With SGE 6 come cluster-queues and host-groups — SGE 5.3's queues still exist and are now called queue-instances.

In SGE 5.3 a queue lived on a particular machine/node. In contrast, with SGE 6, we have cluster-queues which can be thought of as both a front-end to a cluster of nodes to which jobs can be submitted for execution on one of the nodes in the cluster, and as a type/class of which instances live on particular machines/nodes.

1.2. Cluster-Queues vs Parallel-Environments

With SGE 6, unless you are actually dealing with MPI, PVM, etc, there is no need to handle parallel-environments, at all. Good.

1.3. Host-Groups

To aid in the set up of cluster-queues, host-groups exist, which are simply a group (list) of machines/nodes.

2. Example

2.1. What We Want

...for a user to be able to submit a bunch (e.g., a dozen) of single-processor jobs to a "queue" which is a front-end to a number of machines/nodes each of which has say, 1 or 2 CPUs on it, and run say 3 or 4 jobs at once on each node (think hyperthreading, multicore) and for 3 × nodes (or 4 × nodes) jobs to run at once, with remaining jobs waiting their turn.

2.2. Solution — A Cluster Queue, Some Instance Queues and the Scheduler

N.B. Ignore parallel-environments a la SGE 5.3; use a cluster-queue and instances of it.

2.2.1. Cluster Queue

qconf -sq simonh.q:

  qname                 simonh.q
  hostlist              @allhosts           ## instances of this queue will
                                            ## exist on these hosts
  seq_no                0
  load_thresholds       np_load_avg=1.75
  suspend_thresholds    NONE
  nsuspend              1
  suspend_interval      00:05:00
  priority              0
  min_cpu_interval      00:05:00
  processors            1
  qtype                 BATCH INTERACTIVE   ## 
  ckpt_list             NONE
  pe_list               NONE
  rerun                 FALSE
  slots                 1                   ## number of jobs expect to run in
                                            ## an instance (i.e., on a compute
                                            ## node) usually the number of CPUs
                                            ## or cores (on a node)
  tmpdir                /tmp
  shell                 /bin/bash
  .
  .

2.2.2. Scheduler

qconf -ssconf:

  algorithm                         default
  schedule_interval                 0:0:5
  maxujobs                          10               ## IMPORTANT!  Max num
                                                     ## of total jobs per user 
                                                     ## across all queues
  queue_sort_method                 load
  job_load_adjustments              np_load_avg=0.50
  load_adjustment_decay_time        0:7:30
  load_formula                      np_load_avg
  schedd_job_info                   true
  flush_submit_sec                  1                ## default is 0
  flush_finish_sec                  1                ## default is 0
  params                            none 
  reprioritize_interval             0:1:0
  halftime                          168
  usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
  .
  .

3. Qconf, Qmod and Qmon

3.1. qmod

Configure, modify the state of, monitor queues using a gui. The most significant windows:

3.2. qconf

Show or modify the configuration of the given queue.

give no arguments for command-line help

  -mq <queuename>     modify a queue config

  -sq <queuename>     show queue config

  -sql                list all queues

  -ssconf             show schedular config  

3.3. qmod

Modify the state of the given queue, e.g., disable it, clear an error state...

give no arguments for command-line help

  -cq <queuename>    clear error state (E) of given queue (instance)

4. Scheduler

4.1. Limitations — Other Schedulers

It seems that the schedular that comes with SGE is quite limited:

However, other schedulers can be used, e.g., Maui — this allows, for example, Multi-Dimensional Fairness Policies such as MAXJOB[Class,User] which can limit the number of jobs a given user can run on a particular queue.

4.2. Configuration

Use either qmon -> Scheduler Configuration or

  qconf  | grep -i schedul

    [-k{m|s}]                  shutdown master|scheduling daemon
    [-msconf]                  modify scheduler configuration
    [-Msconf fname]            modify scheduler configuration from file
    [-sss]                     show scheduler state
    [-ssconf]                  show scheduler configuration
    [-tsm]                     trigger scheduler monitoring

5. Cluster Queues and Instance Queues

5.1. Cluster Queues

5.2. Instance Queues

5.3. Configuration

5.3.1. User configuration

5.3.2. General configuration

slots
queue instance job slots are usually set equal to the number of available system CPUs


processors


type (batch/interactive)
cluster-queues can have zero, one or both the batch and interactive attributes: with qrsh, interactive must be set on the pertaining queue (unless -now no is used); with qsub, batch must be set.


6. Messages — What went wrong?

Have a look in

    $SGE_ROOT/default/spool/qmaster/messages

7. Complexes

8. Calendars

9. Parallel Environments