Show the list of execution hosts and the list of queues

qconf -sel
qconf -sql

Show the configuration of a queue

qconf -sq all.q

Show the states of a queue

Suspend and resubmit stalled jobs

# as user:
qstat | grep neteler | tr -s ' ' ' '  | cut -d' ' -f2 > /tmp/to_suspend.sge cat /tmp/to_suspend.sge

# as root (?):
su -
for i in `cat /tmp/to_suspend.sge` ; do qmod -sj $i ; done
qstat

# remove crashed blade from list of execution hosts:
qconf -de blade14

# delete host from list:
qconf -mhgrp "@allhosts"

# apply new list:
qconf -shgrp "@allhosts"

# verify queue stats:
qstat -f

# resubmit jobs to other nodes (as job user!!):
exit
for i in `cat /tmp/to_suspend.sge` ; do qresub $i ; done
qstat

Reactive broken blade

todo

[root@head ~]# qconf -mhgrp "@allhosts"
root@head modified "@allhosts" in host group list

[root@head ~]# qconf -shgrp "@allhosts"
group_name @allhosts hostlist c00 c01 c02

[root@head ~]# ssh c01 "/etc/init.d/sgeexecd stop ; /etc/init.d/sgeexecd start"

Shutting down Grid Engine execution daemon starting

sge_execd

[root@head ~]# qhost

Qstat errors

State Eqw: If your Grid Engine job is hanging with an Eqw state, try running:
qstat -j
This will give you more than enough information to work with and usually the root cause of your problem - usually path errors.

Multi-queue management: Suspend and resume queues

Suspend a queue (add -f in case sge_execd is not reachable):

qmod -s q_name1

Suspend two queues (add -f in case sge_execd is not reachable):

qmod -s q_name1, q_name2

Resumes (unsuspend) a queue:

qmod -us -f q_name1

Disable/Enable a particular queue for some reason

... for example for maintenance... Disable a particular queue:
qconf -sql  # add -f in case sge_execd is not reachable qmod -d q_name

To enable back the queue:

qmod -e q_name

Wildcards can be used to specify a range of queues:

qmod -e q_name*

See also