IBM Books

Using and Administering


Chapter 8. Routing Jobs to NQS Machines

Users can submit NQS scripts to LoadLeveler and have them routed to a machine outside of the LoadLeveler cluster that runs NQS. LoadLeveler supports COSMIC NQS version 2.0 and other versions of NQS that support the same commands and options and produce similar output for those commands.

The following diagram illustrates a typical environment that allows users to have their jobs routed to machines outside of LoadLeveler for processing:

Figure 31. Environment illustrating jobs being routed to NQS machines.

View figure.

As the diagram illustrates, machines A, B, and C, are members of the LoadLeveler cluster. Machine A has the central manager running on it and machine B has both LoadLeveler and NQS running on it. Machine C is a third member of the cluster. Machine D is outside of the cluster and is running NQS.

When a user submits a job to LoadLeveler, machine A, that runs the central manager, schedules the job to machine B. LoadLeveler running on machine B routes the job to machine D using NQS. Keep this diagram in mind as you continue to read this chapter.


Setting Up the NQS Environment

Setting up the NQS environment involves the following:


Designating Machines to Which Jobs Will be Routed

To designate a machine to which your jobs will be routed, follow these steps:

  1. Set up a special class in the LoadL_admin file by adding the following class definitions to the file:

    NQS_class =  true  | false

    When this flag is set to true, any job submitted to this class will be routed to an NQS machine.

    NQS_submit = name

    The name of the NQS pipe queue to which the job will be routed. When the job is dispatched by LoadLeveler, LoadLeveler will invoke the qsub command using the name of the this queue.

    NQS_query = queue names

    A blank delimited list of queue names (including host names if necessary) to be used with the qstat command to monitor the job and qdel to cancel the job.

    You can set up multiple classes to access different machines.

  2. Modify the local configuration file on the machines that you want to accept this class of jobs.

  3. Add the NQS_DIR keyword to the LoadL_config file:

    NQS_DIR = NQS directory

    defines the directory where NQS commands qsub, qstat, and qdel reside. The default is /usr/bin.

Sample Routing Jobs to NQS Machines Scenario

The following example walks you through the process of setting up your environment to route jobs to machines that run NQS.

Assume Figure 31 depicts your environment. You have three machines in the cluster named A, B, and C. Outside of the cluster, you have machine D running NQS.

Task 1: Modify the Administration File

After setting up your NQS environment, modify the LoadL_admin file by defining the class NQS including the following stanzas:

NQS:
type = class
NQS_class = true
NQS_submit = pipe_a
NQS_query = queue@chevy.kgn.ibm.com

Task 2: Modify the Configuration File

Modify the LoadL_config.local on the machine(s) that you want to accept this class of jobs. In this example, you would modify machine B's LoadL_config.local file. To do this, add a class statement similar to:

CLASS = {"NQS" "a" "b" ....}

where NQS is the name of the class of jobs that will be routed to the machines that run NQS, and a and b are names of additional classes.

Task 3: Submit the Jobs

After you perform the previous tasks, users can route their jobs to machines running NQS using the llsubmit command. The job command file must specify the class keyword. For example:

class = NQS

The job command file must also contain the shell script to be submitted to the NQS node. NQS accepts only shell scripts, binaries are not allowed. All options in the command file pertaining to scheduling the job will be used by LoadLeveler to schedule the job. When the job is dispatched to the node running the specified NQS class, the LoadLeveler options pertaining to the runtime environment are converted to NQS options and the job is submitted to the specified NQS queue.

LoadLeveler command file options are used as follows:

arguments
error message generated and job not submitted

checkpoint
error message generated and job not submitted

class
used only for LoadLeveler scheduling

core_limit
converted to -lc option

cpu_limit
converted to -lt option

data_limit
converted to -ld option

environment
if COPY_ALL is specified, the option is converted to -x, otherwise error message generated and job not submitted

error
converted to -e

executable
error message generated and job not submitted

file_limit
converted to -lf option

hold
used only for LoadLeveler scheduling

image_size
error message generated and job not submitted

initialdir
error message generated and job not submitted

input
error message generated and job not submitted

notification
If the option specified is

always
converted to -mband -me options

error
converted to -me option

start
converted to -mb option

never
ignored

complete
converted to -me option

notify_user
converted to -mu option

output
converted to -o option

preferences
used only for LoadLeveler scheduling

queue
places one copy of job in the LoadLeveler queue

requirements
used only for LoadLeveler scheduling

restart
If the option specified is

yes
ignored

no
converted to -nr option

rss_limit
converted to -lw option

shell
converted to -s option

stack_limit
converted to -ls option

start_date
used only for LoadLeveler scheduling

user_priority
used only for LoadLeveler scheduling

Users can also submit an NQS script. In this case, any NQS options in the script are used to schedule the job and once dispatched by LoadLeveler, the file is sent to NQS unmodified.

LoadLeveler schedules these jobs the same as it schedules other jobs. When the job is dispatched, LoadLeveler determines whether or not it is running in an NQS class. If it is, an NQS command qsub is issued.

LoadLeveler monitors the job by periodically invoking a qstat command. A qstat command is first issued for the pipe queue on the local host. If the request id is not found, a qstat is issued for each queue listed in the NQS_query class keyword. If the request id is still not found, starter marks the job as complete.

When a job is sent to an NQS class, llsubmit saves the following environment variables:

When LoadLeveler dispatches the job, these environment variables are installed so that they are available to qsub. llsubmit also saves the name of the current directory (pwd) and the current value of the user file create mask (umask).

Task 4: Obtain Status of NQS Jobs

Users can obtain status of NQS jobs in the same way as they obtain status of LoadLeveler jobs - either by using the llq command or by viewing the Jobs window on the graphical user interface. The users can identify the NQS jobs by the class field on the Jobs window.

LoadLeveler monitors the job until qstat shows the job is no longer in any specified queue.

NQS does not provide job accounting. Therefore, the only accounting information LoadLeveler will have is the total time for the job.

LoadLeveler will not send mail when the job completes. The LoadLeveler notification option is translated to the appropriate NQS flag (me or mb) and NQS will send the mail.

Task 5: Cancel NQS Jobs

Users can cancel NQS jobs using the LoadLeveler llcancel command. All they need to know is the LoadLeveler job id for the NQS job. Once they submit their request to cancel the job, LoadLeveler forwards their request to the appropriate node and a qdel will be issued for the job for the queue listed in the the NQS_submit and NQS_query keywords.


NQS Scripts

Scripts originally written for NQS that contain NQS options are acceptable to LoadLeveler. The options are mapped as closely as possible to the features provided by LoadLeveler, but the exact function is not always available. NQS options map to LoadLeveler as follows:

a
startdate
e
error
ke
ignored
ko
ignored
lc
core_limit
ld
data_limit
lf
file_limit
lm
rss_limit
lM
ignored
ln
ignored
ls
stack_limit
lt
cpu_limit
lT
ignored
lv
ignored
lw
ignored
mb
notification (always)
me
notification (complete)
mu
notify_user
nr
restart = no
o
output
p
user_priority
q
class
r
ignored
re
ignored
ro
ignored
s
shell
x
environment = copyall
z
suppresses messages but not mail


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]