Using and Administering

llctl - Control LoadLeveler Daemons

Purpose

Controls LoadLeveler daemons on all members of the LoadLeveler cluster.

Syntax

llctl [-?] [-H] [-v] [-q] [-g | -h host] [keyword]

Flags

-?: Provides a short usage message.
-H: Provides entended help information.
-v: Outputs the name of the command, release number, service level, service level date, and operating system used to build the command.
-q: Specifies quiet mode: print no messages other than error messages.
-g: Indicates that the command applies globally to all machines in the administration file.
-h host: Indicates that the command applies to only the host machine in the LoadLeveler cluster. If neither -h nor -g is specified, the default is the machine on which the llctl command is issued.
keyword: Must be specified after all flags and can be the following:

purge list_of_machines

Forces a schedd to delete any queued transaction to the machines in the list_of_machines. If all jobs on the listed machines have completed, and there are no messages pending to that machine, this option is not necessary.

This option is intended for recovery and cleanup after a machine has permanently crashed or was inadvertantly removed from the LoadLeveler cluster before all activity on it was quiesced. Do not use this option unless the specified list_of_machines are guaranteed not to return to the LoadLeveler cluster.

If you need to return the machine to the cluster later, you must clear all files from the spool and execute directory of the machine which was deleted.

capture eventname

Captures accounting data for all jobs running on the designated machines. eventname is the name you associate with the data, and must be a character string containing no blanks. For more information, see Collecting Job Resource Data Based on Events.

drain [schedd|startd [classlist |allclasses]]

When you issue drain with no options, the following happens: (1) no more LoadLeveler jobs can begin running on this machine, and (2) no more LoadLeveler jobs can be submitted through this machine. When you issue drain schedd, the following happens: (1) the schedd machine accepts no more LoadLeveler jobs for submission, (2) jobs in the Starting or Running state in the schedd queue are allowed to continue running, and (3) jobs in the Idle state in the schedd queue are drained, meaning they will not get dispatched. When you issue drain startd, the following happens: (1) the startd machine accepts no more LoadLeveler jobs to be run, and (2) jobs already running on the startd machine are allowed to complete. When you issue drain startd classlist, the classes you specify which are available on the startd machine are drained (made unavailable). When you issue drain startd allclasses, all available classes on the startd machine are drained.

flush

Terminates running jobs on this machine and sends them back, in the Idle state, to the negotiator to await redispatch (provided restart=yes in the job command file). No new jobs are sent to this machine until resume is issued. Forces a checkpoint if jobs are enabled for checkpointing. However, the checkpoint gets cancelled if it does not complete within a five minute period.

purgeschedd

Requests that all jobs scheduled by the specified host machine be purged (removed). To use this keyword, you must first specify schedd_fenced=true in the machine stanza for this host. The -g option cannot be specified with this keyword. For more information, see "How Do I Recover Resources Allocated by a schedd Machine?" in the IBM LoadLeveler for AIX: Diagnosis and Messages Guide.

reconfig

Forces all daemons to reread the configuration files.

recycle

Stops all LoadLeveler daemons and restarts them.

resume [schedd|startd [classlist |allclasses]]

When you issue resume with no options, job submission and job execution on this machine is resumed. When you issue resume schedd, the schedd machine resumes the submission of jobs. When you issue resume startd, the startd machine resumes the execution of jobs. When you issue resume startd classlist, the startd machine resumes the execution of those job classes you specify which are also configured (defined on the machine). When you issue resume startd allclasses, the startd machine resumes the execution of all configured classes.

start

Starts the LoadLeveler daemons on the specified machine. You must have rsh privileges to start LoadLeveler on a remote machine.

stop

Stops the LoadLeveler daemons on the specified machine.

suspend

Suspends all jobs on this machine. This is not supported for parallel jobs.

version

Displays version and release data at the screen.

Description

This command sends a message to the master daemon on the target machine requesting that action be taken on the members of the LoadLeveler cluster. Note the following when using this command:

After you make changes to the configuration files for a running cluster, be sure to issue llctl reconfig. This command causes the LoadLeveler daemons to reread the configuration files, and prevents problems that can occur when the LoadLeveler commands are using a new configuration while the daemons are using an old configuration.
The llctl drain startd classlist command drains classes on the startd machine, and the startd daemon remains operational. If you reconfigure the daemon, the draining of classes remains in effect. However, if the startd goes down and is brought up again (either by the master daemon or by a LoadLeveler administrator), the startd daemon is configured according to the global or local configuration file in effect, and therefore the draining of classes is lost.
Draining all the classes on a startd machine is not equivalent to draining the startd machine. When you drain all the classes, the startd enters the Idle state. When you drain the startd, the startd enters the Drained state. Similarly, resuming all the classes on a startd machine is not equivalent to resuming the startd machine.
If a parallel job is running on a machine that receives the llctl recycle command, or the llctl stop and llctl start commands, the running job is terminated. You can restart the job by resubmitting the job or by specifying the restart=yes option in the job command file.
If a serial job is running on a machine that receives the llctl recycle command, or the llctl stop and llctl start commands, the running job is terminated. You can restart the job by resubmitting the job or by enabling checkpointing and specifying the restart=yes option in the job command file.
If you find that the llctl -g command (even if it is specified with additional options) is taking a long time to complete, you should consider using the SP dsh command to send llctl commands (omitting the -g flag) to multiple nodes in a parallel fashion. For more information on dsh, see IBM RS/6000 Scalable POWERparallel Systems: Administration Guide, (SH26-2486).
When a node running a schedd daemon fails, resources that have been allocated to any of the jobs scheduled by that schedd are unavailable until the schedd is restarted. Administrators can, however, recover these resources by using the llctl command's purgeschedd keyword to purge (remove) all of the jobs scheduled by the schedd on the down node. The purgeschedd keyword can only work in conjunction with the schedd_fenced keyword, which causes the central manager to ignore (fence) the target schedd node. You must reconfigure the central manager so it can recognize this fence. To use the purgeschedd keyword:
1. Recognize that a node running a schedd daemon is down, and that the node will be down long enough to necessitate that you recover the resources allocated to jobs scheduled by that schedd.
2. Add the statement "schedd_fenced = true" to the failed node's administration file machine stanza.
3. Reconfigure the central manager node, so that the central manager recognizes the fenced node.
4. Invoke "llctl -h host purgeschedd" to purge all of the jobs scheduled by the schedd on the failed node.
5. Remove all of the files in the LoadLeveler spool directory for that node. Once the failed node is working again, remove the "schedd_fenced = true" statement from the administration file, then reconfigure the central manager node.

Examples

This example stops LoadLeveler on the machine named iron:

llctl -h iron stop

This example starts the LoadLeveler daemons on all members of the LoadLeveler cluster, starting with the central manager, as defined in the machine stanzas of the administration file:

llctl -g start

This example causes the LoadLeveler daemons on machine iron to re-read the configuration files, which may contain new configuration information for the iron machine:

llctl -h iron reconfig

For the next three examples, suppose the classes small, medium, and large are available on the machine called iron.

This example drains the classes medium and large on the machine named iron.

llctl -h iron drain startd medium large

This example drains the classes medium and large on all machines.

llctl -g drain startd medium large

This example stops all the jobs on the system, then allows only jobs of a certain class (medium) to run.

llctl -g drain startd allclasses
llctl -g flush
llctl -g resume
llctl -g resume startd medium

This example resumes the classes medium and large on the machine named iron.

llctl -h iron resume startd medium large

This example illustrates how to capture accounting information on a work shift called day on the machine iron:

llctl -h iron capture day

You can capture accounting information on all the machines in the LoadLeveler cluster by using the -g option, or you can collect accounting information on the local machine by simply issuing the following:

llctl capture day

Capturing information on the local machine is the default. For more information, see Collecting Job Resource Data Based on Events.

Assume the machine earth has crashed while running jobs. Its hard disk needs to be replaced. You try to cancel the jobs that were running on that machine. The schedd marks the job Remove Pending until it gets confirmation from earth that the jobs were removed. Since earth will be reinstalled, you need to inform schedd that it should not wait for confirmation.

Assume the schedd is named mars, and the running jobs are named mars.1.0 and mars.1.1. First you want to tell the negotiator to remove the jobs:

llcancel  mars.1.0
llcancel  mars.1.1

Next, tell the schedd not to wait for confirmation from earth before marking the jobs removed:

llctl -h mars purge earth

Results

The following shows the result of the llctl -h mars purge earth command:

llctl: Sent purge command to host mars

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]