Stuff

UoM::RCS::Talby::Danzek::SGE



Page Group

How can a user influence job priority?

 -- deadline jobs
 -- posix priority
 -- resource reservation
 -- advance reservation

Bugs/Features

Troubleshooting

Job Scheduling







Name and Address Resolution and Troubleshooting

1. 

Messages

Look in:

2. 

Debug Mode

Run the daemons and utilities in debug mode to get extra messages (cf. ssh -v). Example:

    source SGE_ROOT/util/dl.sh
    dl <level>
        # ...choose debug level...

    /etc/init.d/sge_qmaster start
        # ...starts, does not background, splits out messages...
where debug level is a number from 1 to 10, as described on DanT's blog. In a second terminal:
    source SGE_ROOT/util/dl.sh
    dl <level>

    qstat
        # ...splits messages in addition to usual info (or not!)...    

3. 

FAQ

/etc/init.d/sge_qmaster simply locks up; qstat likewise
/etc/init.d/sge_qmaster simply locks up and refuses to complete. Use of ps shows that an instance of sge_qmaster is apparently running, but qping is stuck. qstat locks and returns nowt.

Check that traffic from all qmaster host network interfaces can get through the local loopback interface, lo/127.0.0.1. Bizarrely, SGE requires that packets with source address 10.99.203.190, 10.2.2.250, 10.3.3.250 and 10.2.49.100 all traverse lo. Did you have a pinhole firewall in operation?
commlib error (client IP resolved to host ""
 -- ensure /etc/hosts on qmaster correct for ALL interfaces on the qmaster host
 -- ensure SGE_ROOT/default/common/host_aliases has all required entries
 -- and the above ensure that 
        hostname
        hostname -f
    and, for ALL network interfaces names (long and short) and ips, the 
    appropriate one of
        SGE_ROOT/utilbin/<arch>/gethostbyname -aname <qmastername>
        SGE_ROOT/utilbin/<arch>/gethostbyaddr -aname <qmasterip>
    all return the same

Example:

  In /etc/hosts:

    127.0.0.1       localhost.localdomain localhost
    10.99.203.190   test.manchester.ac.uk  test
    #
    10.2.49.100     login-stg.test.manchester.ac.uk  login-stg
    #
    10.2.2.250      login.test.manchester.ac.uk login
    10.3.3.250      login-3.test.manchester.ac.uk login-3

In host_aliases:

    login.test.manchester.ac.uk login login-3.test.manchester.ac.uk login-3 \
        login-stg.test.manchester.ac.uk login-stg test.manchester.ac.uk test

Some tests:

    hostname
    login.test.manchester.ac.uk

    hostname -f
    login.test.manchester.ac.uk

    ./gethostbyname -aname login-3
    login.test.manchester.ac.uk
    ./gethostbyname -aname login-3.test.manchester.ac.uk
    login.test.manchester.ac.uk
    ./gethostbyname -aname login.test.manchester.ac.uk
    login.test.manchester.ac.uk
    .    .
    .    .

    ./gethostbyaddr -aname 10.99.203.190
    login.test.manchester.ac.uk
    ./gethostbyaddr -aname 10.2.2.250
    login.test.manchester.ac.uk
    .    .
    .    .