Using and Administering
Purpose
Controls LoadLeveler daemons on all members of the LoadLeveler
cluster.
Syntax
llctl [-?] [-H]
[-v] [-q] [-g |
-h host] [keyword]
Flags
- -?
- Provides a short usage message.
- -H
- Provides entended help information.
- -v
- Outputs the name of the command, release number, service level, service
level date, and operating system used to build the command.
- -q
- Specifies quiet mode: print no messages other than error
messages.
- -g
- Indicates that the command applies globally to all machines in the
administration file.
- -h host
- Indicates that the command applies to only the host machine in
the LoadLeveler cluster. If neither -h nor -g is
specified, the default is the machine on which the llctl command is
issued.
- keyword
- Must be specified after all flags and can be the following:
-
- purge list_of_machines
- Forces a schedd to delete any queued transaction to the machines in the
list_of_machines. If all jobs on the listed machines have
completed, and there are no messages pending to that machine, this option is
not necessary.
This option is intended for recovery and cleanup after a machine has
permanently crashed or was inadvertantly removed from the LoadLeveler cluster
before all activity on it was quiesced. Do not use this option unless
the specified list_of_machines are guaranteed not to return to the
LoadLeveler cluster.
If you need to return the machine to the cluster later, you must clear all
files from the spool and execute directory of the machine which was
deleted.
- capture eventname
- Captures accounting data for all jobs running on the designated
machines. eventname is the name you associate with the data,
and must be a character string containing no blanks. For more
information, see Collecting Job Resource Data Based on Events.
- drain [schedd|startd [classlist |allclasses]]
- When you issue drain with no options, the following
happens: (1) no more LoadLeveler jobs can begin running on this machine,
and (2) no more LoadLeveler jobs can be submitted through this machine.
When you issue drain schedd, the following happens: (1) the
schedd machine accepts no more LoadLeveler jobs for submission, (2) jobs in
the Starting or Running state in the schedd queue are allowed to continue
running, and (3) jobs in the Idle state in the schedd queue are drained,
meaning they will not get dispatched. When you issue drain
startd, the following happens: (1) the startd machine accepts no
more LoadLeveler jobs to be run, and (2) jobs already running on the startd
machine are allowed to complete. When you issue drain startd
classlist, the classes you specify which are available on the startd
machine are drained (made unavailable). When you issue drain
startd allclasses, all available classes on the startd machine are
drained.
- flush
- Terminates running jobs on this machine and sends them back, in the Idle
state, to the negotiator to await redispatch (provided restart=yes
in the job command file). No new jobs are sent to this machine until
resume is issued. Forces a checkpoint if jobs are enabled
for checkpointing. However, the checkpoint gets cancelled if it does
not complete within a five minute period.
- purgeschedd
- Requests that all jobs scheduled by the specified host machine
be purged (removed). To use this keyword, you must first specify
schedd_fenced=true in the machine stanza for this
host. The -g option cannot be specified with this
keyword. For more information, see "How Do I Recover Resources
Allocated by a schedd Machine?" in the IBM LoadLeveler for AIX:
Diagnosis and Messages Guide.
- reconfig
- Forces all daemons to reread the configuration files.
- recycle
- Stops all LoadLeveler daemons and restarts them.
- resume [schedd|startd [classlist
|allclasses]]
- When you issue resume with no options, job submission and job
execution on this machine is resumed. When you issue resume
schedd, the schedd machine resumes the submission of jobs. When
you issue resume startd, the startd machine resumes the execution
of jobs. When you issue resume startd classlist,
the startd machine resumes the execution of those job classes you specify
which are also configured (defined on the machine). When you issue
resume startd allclasses, the startd machine resumes the execution
of all configured classes.
- start
- Starts the LoadLeveler daemons on the specified machine. You must
have rsh privileges to start LoadLeveler on a remote machine.
- stop
- Stops the LoadLeveler daemons on the specified machine.
- suspend
- Suspends all jobs on this machine. This is not supported for
parallel jobs.
- version
- Displays version and release data at the screen.
Description
This command sends a message to the master daemon on the target machine
requesting that action be taken on the members of the LoadLeveler
cluster. Note the following when using this command:
- After you make changes to the configuration files for a running cluster,
be sure to issue llctl reconfig. This command causes the
LoadLeveler daemons to reread the configuration files, and prevents problems
that can occur when the LoadLeveler commands are using a new configuration
while the daemons are using an old configuration.
- The llctl drain startd classlist command drains
classes on the startd machine, and the startd daemon remains
operational. If you reconfigure the daemon, the draining of classes
remains in effect. However, if the startd goes down and is brought up
again (either by the master daemon or by a LoadLeveler administrator), the
startd daemon is configured according to the global or local configuration
file in effect, and therefore the draining of classes is lost.
Draining all the classes on a startd machine is not equivalent
to draining the startd machine. When you drain all the classes, the
startd enters the Idle state. When you drain the startd, the startd
enters the Drained state. Similarly, resuming all the classes on a
startd machine is not equivalent to resuming the startd
machine.
- If a parallel job is running on a machine that receives the llctl
recycle command, or the llctl stop and llctl start
commands, the running job is terminated. You can restart the job by
resubmitting the job or by specifying the restart=yes option in the
job command file.
If a serial job is running on a machine that receives the llctl
recycle command, or the llctl stop and llctl start
commands, the running job is terminated. You can restart the job by
resubmitting the job or by enabling checkpointing and specifying the
restart=yes option in the job command file.
- If you find that the llctl -g command (even if it is specified
with additional options) is taking a long time to complete, you should
consider using the SP dsh command to send llctl commands
(omitting the -g flag) to multiple nodes in a parallel
fashion. For more information on dsh, see IBM RS/6000
Scalable
POWERparallel Systems: Administration Guide, (SH26-2486).
- When a node running a schedd daemon fails, resources that have been
allocated to any of the jobs scheduled by that schedd are unavailable until
the schedd is restarted. Administrators can, however, recover these
resources by using the llctl command's purgeschedd keyword to
purge (remove) all of the jobs scheduled by the schedd on the down
node. The purgeschedd keyword can only work in conjunction with the
schedd_fenced keyword, which causes the central manager to ignore
(fence) the target schedd node. You must reconfigure the central
manager so it can recognize this fence. To use the purgeschedd
keyword:
- Recognize that a node running a schedd daemon is down, and that the node
will be down long enough to necessitate that you recover the resources
allocated to jobs scheduled by that schedd.
- Add the statement "schedd_fenced = true" to the failed node's
administration file machine stanza.
- Reconfigure the central manager node, so that the central manager
recognizes the fenced node.
- Invoke "llctl -h host purgeschedd" to purge all of the jobs
scheduled by the schedd on the failed node.
- Remove all of the files in the LoadLeveler spool directory for that
node. Once the failed node is working again, remove the "schedd_fenced
= true" statement from the administration file, then reconfigure the central
manager node.
Examples
This example stops LoadLeveler on the machine named iron:
llctl -h iron stop
This example starts the LoadLeveler daemons on all members of the
LoadLeveler cluster, starting with the central manager, as defined in the
machine stanzas of the administration file:
llctl -g start
This example causes the LoadLeveler daemons on machine iron to
re-read the configuration files, which may contain new configuration
information for the iron machine:
llctl -h iron reconfig
For the next three examples, suppose the classes small,
medium, and large are available on the machine called
iron.
This example drains the classes medium and large on
the machine named iron.
llctl -h iron drain startd medium large
This example drains the classes medium and large on
all machines.
llctl -g drain startd medium large
This example stops all the jobs on the system, then allows only jobs of a
certain class (medium) to run.
llctl -g drain startd allclasses
llctl -g flush
llctl -g resume
llctl -g resume startd medium
This example resumes the classes medium and large on
the machine named iron.
llctl -h iron resume startd medium large
This example illustrates how to capture accounting information on a work
shift called day on the machine iron:
llctl -h iron capture day
You can capture accounting information on all the machines in the
LoadLeveler cluster by using the -g option, or you can collect
accounting information on the local machine by simply issuing the
following:
llctl capture day
Capturing information on the local machine is the default. For more
information, see Collecting Job Resource Data Based on Events.
Assume the machine earth has crashed while running jobs.
Its hard disk needs to be replaced. You try to cancel the jobs that
were running on that machine. The schedd marks the job Remove Pending
until it gets confirmation from earth that the jobs were
removed. Since earth will be reinstalled, you need to inform
schedd that it should not wait for confirmation.
Assume the schedd is named mars, and the running jobs are named
mars.1.0 and
mars.1.1. First you want to tell the
negotiator to remove the jobs:
llcancel mars.1.0
llcancel mars.1.1
Next, tell the schedd not to wait for confirmation from earth
before marking the jobs removed:
llctl -h mars purge earth
Results
The following shows the result of the llctl -h mars purge earth
command:
llctl: Sent purge command to host mars
[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]