Using and Administering

Configuration File Structure and Syntax

The information in both the LoadL_config and the LoadL_config.local files is in the form of a statement. These statements are made up of keywords and values. There are three types of configuration file keywords:

Keywords, described in Customizing the Global and Local Configuration Files and in Step 17: Specify Additional Configuration File Keywords
User-defined variables, described in User-Defined Variables
LoadLeveler variables, described in LoadLeveler Variables

Configuration file statements take one of the following formats:

keyword=value
keyword:value

Statements in the form keyword=value are used primarily to customize an environment. Statements in the form keyword:value are used by LoadLeveler to characterize the machine and are known as part of the machine description. Every machine in LoadLeveler has its own machine description which is read by the central manager when LoadLeveler is started.

To continue configuration file statements, use the back-slash character (\).

In the configuration file, comments must be on a separate line from keyword statements.

You can use the following types of constants and operators in the configuration file.

Numerical and Alphabetical Constants

Constants may be represented as:

Boolean expressions
Signed integers
Floating point values
Strings enclosed in double quotes (" ").

Mathematical Operators

You can use the following C operators. The operators are listed in order of precedence. All of these operators are evaluated from left to right:

* /

- +

< <= > >=

== !=

Customizing the Global and Local Configuration Files

This section presents a step-by-step approach to configuring LoadLeveler. You do not have to perform the steps in the order that they appear here. Other keywords which are not specifically mentioned in any of these steps are discussed in Step 17: Specify Additional Configuration File Keywords.

Step 1: Define LoadLeveler Administrators

Specify the following keyword:

LOADL_ADMIN = list of user names (required)

where list of user names is a blank-delimited list of those individuals who will have administrative authority. These users are able to invoke the administrator-only commands such as llctl, llfavorjob, and llfavoruser. These administrators can also invoke the administrator-only GUI functions. For more information, see Administrative Uses for the Graphical User Interface.

LoadLeveler administrators on this list also receive mail describing problems that are encountered by the master daemon. When DCE is enabled, the LOADL_ADMIN list is used only as a mailing list. For more information, see Step 16: Configuring LoadLeveler to use DCE Security Services.

An administrator on a machine is granted administrative privileges on that machine. It does not grant him administrative privileges on other machines. To be an administrator on all machines in the LoadLeveler cluster either specify your user ID in the global configuration file with no entries in the local configuration file or specify your userid in every local configuration file that exists in the LoadLeveler cluster.

For example, to grant administrative authority to users bob and mary, enter the following in the configuration file:

LOADL_ADMIN = bob mary

Step 2: Define LoadLeveler Cluster Characteristics

You can use the following keywords to define the characteristics of the LoadLeveler cluster:

CUSTOM_METRIC = number: Specifies a machine's relative priority to run jobs. This is an an arbitrary number which you can use in the MACHPRIO expression. If you specify neither CUSTOM_METRIC nor CUSTOM_METRIC_COMMAND, CUSTOM_METRIC = 1 is assumed. For more information, see Step 7: Prioritize the Order of Executing Machines Maintained by the Negotiator.
CUSTOM_METRIC_COMMAND = command: Specifies an executable and any required arguments. The exit code of this command is assigned to CUSTOM_METRIC. If this command does not exit normally, CUSTOM_METRIC is assigned a value of 1. This command is forked every (POLLING_FREQUENCY * POLLS_PER_UPDATE) period.
MACHINE_AUTHENTICATE = true |false: Specifies whether machine validation is performed. When set to true, LoadLeveler only accepts connections from machines specified in the administration file. When set to false, LoadLeveler accepts connections from any machine.
When set to true, every communication between LoadLeveler processes will verify that the sending process is running on a machine which is identified via a machine stanza in the administration file. The validation is done by capturing the address of the sending machine when the accept function call is issued to accept a connection. The gethostbyaddr function is called to translate the address to a name, and the name is matched with the list derived from the administration file.

Choosing a Scheduler

This section discusses the types of schedulers that are available under LoadLeveler, and the keywords you use to define these schedulers.

The default LoadLeveler scheduler. This scheduler runs both serial and parallel jobs, but is primarily meant for serial jobs. It efficiently uses CPU time by scheduling jobs on what otherwise would be idle nodes (and workstations). It does not require that users set a wall clock limit. Also, this scheduler starts, suspends, and resumes jobs based on workload. The default scheduler uses a reservation method to schedule parallel jobs. A possible drawback to the reservation method occurs when LoadLeveler tries to schedule a job requiring a large number of nodes. As LoadLeveler reserves nodes for the job, the reserved nodes will be idle for a period of time. Also, if the job cannot accumulate all the nodes it needs to run, the job may not get dispatched.
See Keyword Considerations for Parallel Jobs for information on which keywords associated with parallel jobs are supported by the default scheduler.
The Backfill scheduler. This scheduler runs both serial and parallel jobs, but is primarily meant for parallel jobs. Backfilling is the capability to schedule a job that is short in duration, or which requires a small number of nodes, before a higher priority job. Any idle resources available between the current time and the earliest projected start time of the highest priority job can be used to run other waiting jobs. Jobs will only be backfilled if they will not delay the start of the higher priority job. The scheduler makes this determination by comparing the projected start time of the highest priority job with the wall_clock_limit of the potential backfilled job. If the backfilled job will end before the higher priority job's start time, then it is eligible to run.
For example: on a rack with 10 nodes, 8 of the nodes are being used by Job A. Job B has the highest priority in the queue, and requires 10 nodes. Job C has the next highest priority in the queue, and requires only two nodes. Job B has to wait for Job A to finish so that it can use the freed nodes. Because Job A is only using 8 of the 10 nodes, the Backfill scheduler can schedule Job C (which only needs the two available nodes) to run as long as it finishes before Job A finishes (and Job B starts). To determine whether or not Job C has time to run, the Backfill scheduler uses Job C's wall_clock_limit value to determine whether or not it will finish before Job A ends. If Job C has a wall_clock_limit of unlimited, it may not finish before Job B's start time, and it won't be dispatched.
The Backfill scheduler supports:
- The scheduling of multiple tasks per node.
- The scheduling of multiple user space tasks per adapter.
The above functions are not supported by the default LoadLeveler scheduler.
Note the following when using the Backfill scheduler:
- To use this scheduler, either users must set a wall clock limit in their job command file or the administrator must define a wall clock limit value for the class to which a job is assigned. Jobs with the wall_clock_limit of unlimited cannot be used to backfill because they may not finish in time.
- You should use only the default settings for the START expression and the other job control functions described in Step 8: Manage a Job's Status Using Control Expressions. If you do not use these default settings, jobs will still run but the scheduler will not be as efficient. For example, the scheduler will not be able to guarantee a time at which the highest priority job will run.
- You should configure any multiprocessor (SMP) nodes such that the number of jobs that can run on a node (determined by the MAX_STARTERS keyword) is always less than or equal to the number of processors on the node.
- Due to the characteristics of the Backfill algorithm, in some cases this scheduler may not honor the MACHPRIO statement. For more information on MACHPRIO, see Step 7: Prioritize the Order of Executing Machines Maintained by the Negotiator.
See Keyword Considerations for Parallel Jobs for information on which keywords associated with parallel jobs are supported by the Backfill scheduler.
The Workload Management API. This API allows you to enable an external scheduler, such as the Extensible Argonne Scheduling sYstem (EASY). The API is intended for installations that want to create a scheduling algorithm for parallel jobs based on site-specific requirements. This API provides a time-based (rather than an event-based) interface. That is, your application must use the API to poll LoadLeveler at specific times for machine and job information. Also, some LoadLeveler functions are not available when you use this API. For more information, see Workload Management API.

Use the following keywords to define your scheduler:

SCHEDULER_API = YES |NO

where YES disables the default LoadLeveler scheduling algorithm. Specifying YES implies you will use the job control API to communicate to LoadLeveler scheduling decisions made by an external scheduler. For more information, see Workload Management API. Note that if you change the scheduler from SCHEDULER=BACKFILL to SCHEDULER_API=YES, you must stop and restart LoadLeveler using llctl.

Specify NO to run the default LoadLeveler scheduler.

SCHEDULER_TYPE = BACKFILL

where BACKFILL specifies the LoadLeveler Backfill scheduler. Note that when you specify this keyword:

You override the SCHEDULER_API keyword (if it is used).
You should use only the default settings for the START expression and the other job control expressions described in Step 8: Manage a Job's Status Using Control Expressions.

Step 3: Define LoadLeveler Machine Characteristics

You can use the following keywords to define the characteristics of machines in the LoadLeveler cluster:

ARCH = string (required)

Indicates the standard architecture of the system. The architecture you specify here must be specified in the same format in the requirements and preferences statements in job command files. The administrator defines the character string for each architecture.

For example, to define a machine as a RISC System/6000, the keyword would look like:

  ARCH = R6000

CLASS = { "class1" "class2" ... } | { "No_Class" }

where "class1" "class2" ... is a blank delimited list of class names. This keyword determines whether a machine will accept jobs of a certain job class. For parallel jobs, you must define a class for each task you want to run on a node.

You can specify a default_class in the default user stanza of the administration file to set a default class. If you don't, jobs will be assigned the class called No_Class.

In order for a LoadLeveler job to run on a machine, the machine must have a vacancy for the class of that job. If the machine is configured for only one No_Class job and a LoadLeveler job is already running there, then no further LoadLeveler jobs are started on that machine until the current job completes.

You can have a maximum of 1024 characters in the class statement. You cannot use allclasses as a class name, since this is a reserved LoadLeveler keyword.

You can assign multiple classes to the same machine by specifying the classes in the LoadLeveler configuration file (called LoadL_config) or in the local configuration file (called LoadL_config.local). The classes, themselves, should be defined in the administration file. See Setting Up a Single Machine To Have Multiple Job Classes and Step 3: Specify Class Stanzas for more information on classes.

Defining Classes - Examples

Example 1

This example defines the default class:

Class = { "No_Class" }

This is the default. The machine will only run one LoadLeveler job at a time that has either defaulted to, or explicitly requested class No_Class. A LoadLeveler job with class CPU_bound, for example, would not be eligible to run here. Only one LoadLeveler job at a time will run on the machine.

Example 2

This example specifies multiple classes:

Class = { "No_Class" "No_Class" }

The machine will only run jobs that have either defaulted to or explicitly requested class No_Class. A maximum of two LoadLeveler jobs are permitted to run simultaneously on the machine if the MAX_STARTERS keyword is not specified. See Step 5: Specify How Many Jobs a Machine Can Run for more information on MAX_STARTERS.

Example 3

This example specifies multiple classes:

Class = { "No_Class" "Small" "Medium" "Large" }

The machine will only run a maximum of four LoadLeveler jobs that have either defaulted to, or explicitly requested No_Class, Small, Medium, or Large class. A LoadLeveler job with class IO_bound, for example, would not be eligible to run here.

Example 4

This example specifies multiple classes:

Class = { "B" "B" "D" }

The machine will run only LoadLeveler jobs that have explicitly requested class B or D. Up to three LoadLeveler jobs may run simultaneously: two of class B and one of class D. A LoadLeveler job with class No_Class, for example, would not be eligible to run here.

Feature = {"string" ...}

where string is the (optional) characteristic to use to match jobs with machines.

You can specify unique characteristics for any machine using this keyword. When evaluating job submissions, LoadLeveler compares any required features specified in the job command file to those specified using this keyword. You can have a maximum of 1024 characters in the feature statement.

For example, if a machine has licenses for installed products ABC and XYZ, in the local configuration file you can enter the following:

Feature = {"abc" "xyz"}

When submitting a job that requires both of these products, you should enter the following in your job command file:

requirements = (Feature == "abc") && (Feature == "xyz")

START_DAEMONS = true| false

Specifies whether to start the LoadLeveler daemons on the node. When true, the daemons are started.

In most cases, you will probably want to set this keyword to true. An example of why this keyword would be set to false is if you want to run the daemons on most of the machines in the cluster but some individual users with their own local configuration files do not want their machines to run the daemons. The individual users would modify their local configuration files and set this keyword to false. Because the global configuration file has the keyword set to true, their individual machines would still be able to participate in the LoadLeveler cluster.

Also, to define the machine as strictly a submit-only machine, set this keyword to false. For more information, see the submit-only keyword.

SCHEDD_RUNS_HERE = true| false

Specifies whether the schedd daemon runs on the host. If you do not want to run the schedd daemon, specify false.

To define the machine as an executing machine only, set this keyword to false. For more information, see the submit-only keyword.

SCHEDD_SUBMIT_AFFINITY = true| false

Specifies that the llsubmit command submits a job to the machine where the command was invoked, provided that the schedd daemon is running on that machine (this is called schedd affinity). Installations with a large number of nodes should consider setting this keyword to false. For more information, see Scaling Considerations.

STARTD_RUNS_HERE = true| false

Specifies whether the startd daemon runs on the host. If you do not want to run the startd daemon, specify false.

X_RUNS_HERE = true| false

Set X_RUNS_HERE to true if you want to start the keyboard daemon.

Step 4: Define Consumable Resources

The LoadLeveler scheduler can schedule jobs based on the availability of consumable resources. You can use the following keywords to use Consumable Resources:

SCHEDULE_BY_RESOURCES = name name ... name: specifies which consumable resources are considered by the LoadLeveler schedulers. Each consumable resource name may be an administrator-defined alphanumeric string, or may be one of the following predefined resources: ConsumableCpus, ConsumableMemory, or ConsumableVirtualMemory. Each string may only appear in the list once. These resources are either floating resources, or machine resources. If any resource is specified incorrectly with the SCHEDULE_BY_RESOURCES keyword, then all scheduling resources will be ignored.
FLOATING_RESOURCES = name(count) name(count) ... name(count): specifies which consumable resources are available collectively on all of the machines in the LoadLeveler cluster. The count for each resource must be an integer greater than or equal to zero, and each resource can only be specified once in the list. Any resource specified for this keyword that is not already listed in the SCHEDULE_BY_RESOURCES keyword will not affect job scheduling. If any resource is specified incorrectly with the FLOATING_RESOURCES keyword, then all floating resources will be ignored. ConsumableCpus, ConsumableMemory, and ConsumableVirtualMemory may not be specified as floating resources.

Step 5: Specify How Many Jobs a Machine Can Run

To specify how many jobs a machine can run, you need to take into consideration both the MAX_STARTERS keyword, which is described in this section, and the Class statement, which is mentioned here and described in more detail in Step 3: Define LoadLeveler Machine Characteristics

The syntax for MAX_STARTERS is:

MAX_STARTERS = number: Where number specifies the maximum number of tasks that can run simultaneously on a machine. In this case, a task can be a serial job step, a parallel task, or an instance of the PVM daemon (PVMD). If not specified, the default is the number of elements in the Class statement. MAX_STARTERS defines the number of initiators on the machine (the number of tasks that can be initiated from a startd).

For example, if the configuration file contains these statements:

Class = { "A" "B" "B" "C"}
MAX_STARTERS = 2

the machine can run a maximum of two LoadLeveler jobs simultaneously. The possible combinations of LoadLeveler jobs are:

A and B
A and C
B and B
B and C
Only A, or only B, or only C

If this keyword is specified in conjunction with a Class statement, the maximum number of jobs that can be run is equal to the lower of the two numbers. For example, if:

MAX_STARTERS = 2
Class = { "class_a" }

then the maximum number of job steps that can be run is one (the Class statement above defines one class).

If you specify MAX_STARTERS keyword without specifying a Class statement, by default one class still exists (called No_Class). Therefore, the maximum number of jobs that can be run when you do not specify a Class statement is one.

If this keyword is not defined in either the global configuration file or the local configuration file, the maximum number of jobs that the machine can run is equal to the number of classes in the Class statement.

Step 6: Prioritize the Queue Maintained by the Negotiator

Each job submitted to LoadLeveler is assigned a system priority number, based on the evaluation of the SYSPRIO keyword expression in the configuration file of the central manager. The LoadLeveler system priority number is assigned when the central manager adds the new job to the queue of jobs eligible for dispatch. Once assigned, the system priority number for a job is never changed (unless jobs for a user swap their SYSPRIO, or NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL is not zero). Jobs assigned higher SYSPRIO numbers are considered for dispatch before jobs with lower numbers. See How Does a Job's Priority Affect Dispatching Order? for more information on job priorities.

You can use the following LoadLeveler variables to define the SYSPRIO expression:

ClassSysprio: The priority for the class of the job step, defined in the class stanza in the administration file. The default is 0.
GroupQueuedJobs: The number of job steps associated with a LoadLeveler group which are either running or queued. (That is, job steps which are in one of these states: Running, Starting, Pending, or Idle.)
GroupRunningJobs: The number of job steps for the LoadLeveler group which are in one of these states: Running, Starting, or Pending.
GroupSysprio: The priority for the group of the job step, defined in the group stanza in the administration file. The default is 0.
GroupTotalJobs: The total number of job steps associated with this LoadLeveler group. Total job steps are all job steps reported by the llq command.
QDate: The difference in the UNIX date when the job step enters the queue and the UNIX date when the negotiator starts up.
UserPrio: The user-defined priority of the job step, specified in the job command file with the user_priority keyword. The default is 50.
UserQueuedJobs: The number of job steps either running or queued for the user. (That is, job steps which are in one of these states: Running, Starting, Pending, or Idle.)
UserRunningJobs: The number of job step steps for the user which are in one of these states: Running, Starting, or Pending.
UserSysprio: The priority of the user who submitted the job step, defined in the user stanza in the administration file. The default is 0.
UserTotalJobs: The total number of job steps associated with this user. Total job steps are all job steps reported by the llq command.

Usage Notes for the SYSPRIO Keyword

The SYSPRIO keyword is valid only on the machine where the central manager is running. Using this keyword in a local configuration file has no effect.
It is recommended that you do not use UserPrio in the SYSPRIO expression, since user jobs are already ordered by UserPrio.
You can use the UserRunningJobs, GroupRunningJobs, UserQueuedJobs, GroupQueuedJobs, UserQueuedJobs, GroupQueuedJobs UserTotalJobs, and GroupTotalJobs parameters to prioritize the queue based on current usage. You should also set NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL so that the priorities are adjusted according to current usage rather than usage only at submission time.

Using the SYSPRIO Keyword - Examples

Example 1

This example creates a FIFO job queue based on submission time:

SYSPRIO : 0 - (QDate)

Example 2

This example accounts for Class, User, and Group system priorities:

SYSPRIO : (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (QDate)

Example 3

This example orders the queue based on the number of jobs a user is currently running. The user who has the fewest jobs running is first in the queue. You should set NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL in conjunction with this SYSPRIO expression.

SYSPRIO : 0 - UserRunningJobs

Step 7: Prioritize the Order of Executing Machines Maintained by the Negotiator

Each executing machine is assigned a machine priority number, based on the evaluation of the MACHPRIO keyword expression in the configuration file of the central manager. The LoadLeveler machine priority number is updated every time the central manager updates its machine data. Machines assigned higher MACHPRIO numbers are considered to run jobs before machines with lower numbers. For example, a machine with a MACHPRIO of 10 is considered to run a job before a machine with a MACHPRIO of 5. Similarly, a machine with a MACHPRIO of -2 would be considered to run a job before a machine with a MACHPRIO of -3.

Note that the MACHPRIO keyword is valid only on the machine where the central manager is running. Using this keyword in a local configuration file has no effect.

When you use a MACHPRIO expression that is based on load average, the machine may be temporarily ordered later in the list immediately after a job is scheduled to that machine. This is because the negotiator adds a compensating factor to the startd machine's load average every time the negotiator assigns a job. For more information, see the NEGOTIATOR_INTERVAL keyword.

You can use the following LoadLeveler variables in the MACHPRIO expression:

LoadAvg: The Berkeley one-minute load average of the machine, reported by startd.
Cpus: The number of processors of the machine, reported by startd.
Speed: The relative speed of the machine, defined in a machine stanza in the administration file. The default is 1.
Memory: The size of real memory in megabytes of the machine, reported by startd.
VirtualMemory: The size of available swap space in kilobytes of the machine, reported by startd.
Disk: The size of free disk space in kilobytes on the filesystem where the executables reside.
CustomMetric: Allows you to set a relative priority number for one or more machines, based on the value of the CUSTOM_METRIC keyword. (See "Example 4" for more information.)
MasterMachPriority: A value that is equal to 1 for nodes which are master nodes (those with master_node_exclusive = true); this value is equal to 0 for nodes which are not master nodes. Assigning a high priority to master nodes may help job scheduling performance for parallel jobs which require master node features.
ConsumableCpus: If ConsumableCpus is specified in the SCHEDULE_BY_RESOURCES keyword, then this is the number of ConsumableCpus available on the machine. If ConsumableCpus is not specified in the SCHEDULE_BY_RESOURCES keyword, then this is the same as Cpus.
ConsumableMemory: This is the number of megabytes of ConsumableMemory available on the machine, provided that ConsumableMemory is specified in the SCHEDULE_BY_RESOURCES keyword. If ConsumableMemory is not specified in the SCHEDULE_BY_RESOURCES keyword, then this is the same as Memory.
ConsumableVirtualMemory: This is the number of megabytes of ConsumableVirtualMemory available on the machine, provided that ConsumableVirtualMemory is specified in the SCHEDULE_BY_RESOURCES keyword. If ConsumableVirtualMemory is not specified in the SCHEDULE_BY_RESOURCES keyword, then this is the same as VirtualMemory.
PagesFreed: The number of pages freed per second by the page replacement algorithm of the virtual memory manager.
PagesScanned: The number of pages scanned per second by the page replacement algorithm of the virtual memory manager.
FreeRealMemory: The amount of free real mrmory in megabytes on the machine.

Using the MACHPRIO Keyword - Examples

Example 1

This example orders machines by the Berkeley one-minute load average.

MACHPRIO : 0 - (LoadAvg)

Therefore, if LoadAvg equals .7, this example would read:

MACHPRIO : 0 - (.7)

The MACHPRIO would evaluate to -.7.

Example 2

This example orders machines by the Berkeley one-minute load average normalized for machine speed:

MACHPRIO : 0 - (1000 * (LoadAvg / (Cpus * Speed)))

Therefore, if LoadAvg equals .7, Cpus equals 1, and Speed equals 2, this example would read:

MACHPRIO : 0 - (1000 * (.7 / (1 * 2)))

This example further evaluates to:

MACHPRIO : 0 - (350)

The MACHPRIO would evaluate to -350.

Notice that if the speed of the machine were increased to 3, the equation would read:

MACHPRIO : 0 - (1000 * (.7 / (1 * 3)))

The MACHPRIO would evaluate to approximately -233. Therefore, as the speed of the machine increases, the MACHPRIO also increases.

Example 3

This example orders machines accounting for real memory and available swap space (remembering that Memory is in Mbytes and VirtualMemory is in Kbytes):

MACHPRIO : 0 - (10000 * (LoadAvg / (Cpus * Speed))) +
(10 * Memory) + (VirtualMemory / 1000)

Example 4

This example sets a relative machine priority based on the value of the CUSTOM_METRIC keyword.

MACHPRIO : CustomMetric

To do this, you must specify a value for the CUSTOM_METRIC keyword or the CUSTOM_METRIC_COMMAND keyword in either the LoadL_config.local file of a machine or in the global LoadL_config file. To assign the same relative priority to all machines, specify the CUSTOM_METRIC keyword in the global configuration file. For example:

CUSTOM_METRIC = 5

You can override this value for an individual machine by specifying a different value in that machine's LoadL_config.local file.

Example 5

This example gives master nodes the highest priority:

MACHPRIO : (MasterMachPriority * 10000)

Step 8: Manage a Job's Status Using Control Expressions

You can control running jobs by using five control functions as Boolean expressions in the configuration file. These functions are useful primarily for serial jobs. You define the expressions, using normal C conventions, with the following functions:

START

SUSPEND

CONTINUE

VACATE

KILL

The expressions are evaluated for each job running on a machine using both the job and machine attributes. Some jobs running on a machine may be suspended while others are allowed to continue.

The START expression is evaluated twice; once to see if the machine can accept jobs to run and second to see if the specific job can be run on the machine. The other expressions are evaluated after the jobs have been dispatched and in some cases, already running.

When evaluating the START expression to determine if the machine can accept jobs, Class != { "Z" } evaluates to true only if Z is not in the class definition. This means that if two different classes are defined on a machine, Class != { "Z" } (where Z is one of the defined classes) always evaluates to false when specified in the START expression and, therefore, the machine will not be considered to start jobs.

START: expression that evaluates to T or F (true or false)

Determines whether a machine can run a LoadLeveler job. When the expression evaluates to T, LoadLeveler considers dispatching a job to the machine.

When you use a START expression that is based on the CPU load average, the negotiator may evaluate the expression as F even though the load average indicates the machine is Idle. This is because the negotiator adds a compensating factor to the startd machine's load average every time the negotiator assigns a job. For more information, see the NEGOTIATOR_INTERVAL keyword.

SUSPEND: expression that evaluates to T or F (true or false)

Determines whether running jobs should be suspended. When T, LoadLeveler temporarily suspends jobs currently running on the machine. Suspended LoadLeveler jobs will either be continued or vacated. This keyword is not supported for parallel jobs.

CONTINUE: expression that evaluates to T or F (true or false)

Determines whether suspended jobs should continue execution. When T, suspended LoadLeveler jobs resume execution on the machine.

VACATE: expression that evaluates to T or F (true or false)

Determines whether suspended jobs should be vacated. When T, suspended LoadLeveler jobs are removed from the machine and placed back into the queue (provided you specify restart=yes in the job command file). If a checkpoint was taken, the job restarts from the checkpoint. Otherwise, the job restarts from the beginning.

KILL: expression that evaluates to T or F (true or false)

Determines whether or not vacated jobs should be killed and replaced in the queue. It is used to remove a job that is taking too long to vacate. When T, vacated LoadLeveler jobs are removed from the machine with no attempt to take checkpoints.

Typically, machine load average, keyboard activity, time intervals, and job class are used within these various expressions to dynamically control job execution.

How Control Expressions Affect Jobs

After LoadLeveler selects a job for execution, the job can be in any of several states. Figure 30 shows how the control expressions can affect the state a job is in. The rectangles represent job or daemon states, and the diamonds represent the control expressions.

Figure 30. How Control Expressions Affect Jobs

View figure.

Criteria used to determine when a LoadLeveler job will enter Start, Suspend, Continue, Vacate, and Kill states are defined in the LoadLeveler configuration files and may be different for each machine in the cluster. They may be modified to meet local requirements.

Step 9: Define Job Accounting

LoadLeveler provides accounting information on completed LoadLeveler jobs. For detailed information on this function, refer to Chapter 7, Gathering Job Accounting Data.

The following keywords allow you to control accounting functions:

ACCT = flag

The available flags are:

A_ON

Turns accounting data recording on. If specified without the A_DETAIL flag, the following is recorded:

The total amount of CPU time consumed by the entire job
The maximum memory consumption of all tasks (or nodes).

A_OFF

Turns accounting data recording off. This is the default.

A_VALIDATE

Turns account validation on.

A_DETAIL

Enables extended accounting. Using this flag causes LoadLeveler to record detail resource consumption by machine and by events for each job step. This flag also enables the -x flag of the llq command, permitting users to view resource consumption for active jobs.

For example:

ACCT = A_ON A_DETAIL

This example specifies that accounting should be turned on and that extended accounting data should be collected and that the -x flag of the llq command be enabled.

ACCT_VALIDATION = $(BIN/llacctval (optional)

Keyword used to identify the executable that is called to perform account validation. You can replace the llacctval executable with your own validation program by specifying your program in this keyword.

GLOBAL_HISTORY = $(SPOOL) (optional)

Keyword used to identify the directory that will contain the global history files produced by llacctmrg command when no directory is specified as a command argument.

For example, the following section of the configuration file specifies that the accounting function is turned on. It also identifies the module used to perform account validation and the directory containing the global history files:

ACCT                = A_ON A_VALIDATE
ACCT_VALIDATION     = $(BIN)/llacctval
GLOBAL_HISTORY      = $(SPOOL)

Step 10: Specify Alternate Central Managers

In one of your machine stanzas specified in the administration file, you specified that the machine would serve as the central manager. It is possible for some problem to cause this central manager to become unusable such as network communication or software or hardware failures. In such cases, the other machines in the LoadLeveler cluster believe that the central manager machine is no longer operating. To remedy this situation, you can assign one or more alternate central managers in the machine stanza to take control.

The following machine stanza example defines the machine deep_blue as an alternate central manager:

#
deep_blue:  type=machine
central_manager = alt

If the primary central manager fails, the alternate central manager then becomes the central manager. The alternate central manager is chosen based upon the order in which its respective machine stanza appears in the administration file.

When an alternate becomes the central manager, jobs will not be lost, but it may take a few minutes for all of the machines in the cluster to check in with the new central manager. As a result, job status queries may be incorrect for a short time.

When you define alternate central managers, you should set the following keywords in the configuration file:

CENTRAL_MANAGER_HEARTBEAT_INTERVAL = number: where number is the amount of time in seconds that defines how frequently primary and alternate central managers communicate with each other. The default is 300 seconds or 5 minutes.
CENTRAL_MANAGER_TIMEOUT = number: where number is the number of heartbeat intervals that an alternate central manager will wait without hearing from the primary central manager before declaring that the primary central manager is not operating. The default is 6.

In the following example, the alternate central manager will wait for 30 intervals, where each interval is 45 seconds:

# Set a 45 second interval
CENTRAL_MANAGER_HEARTBEAT_INTERVAL = 45
# Set the number of intervals to wait
CENTRAL_MANAGER_TIMEOUT = 30

For more information on central manager backup, refer to What Happens if the Central Manager Isn't Operating?.

Step 11: Specify Where Files and Directories are Located

The configuration file provided with LoadLeveler specifies default locations for all of the files and directories. You can modify their locations using the following keywords. Keep in mind that the LoadLeveler installation process installs files in these directories and these files may be periodically cleaned up. Therefore, you should not keep any files that do not belong to LoadLeveler in these directories.

To specify the location of the: Specify these keywords:
Administration File

ADMIN_FILE = pathname (required)
points to the administration file containing user, class, group, machine, and adapter stanzas. For example,
ADMIN_FILE = $(tilde)/admin_file

Local Configuration File

LOCAL_CONFIG = pathname
defines the pathname of the optional local configuration file containing information specific to a node in the LoadLeveler network. If you are using a distributed file system like NFS, some examples are:
LOCAL_CONFIG = $(tilde)/$(host).LoadL_config.local LOCAL_CONFIG = $(tilde)/LoadL_config.$(host).$(domain) LOCAL_CONFIG = $(tilde)/LoadL_config.local.$(hostname)

If you are using a local file system, an example is:
LOCAL_CONFIG = /var/LoadL/LoadL_config.local

See LoadLeveler Variables for information about the tilde, host, and domain variables.

Local Directory The following subdirectories reside in the local directory. It is possible that the local directory and LoadLeveler's home directory are the same.

EXECUTE = local directory/execute (required)
defines the local directory to store the executables of jobs submitted by other machines.
LOG = local directory/log (required)
defines the local directory to store log files. It is not necessary to keep all the log files created by the various LoadLeveler daemons and programs in one directory but you will probably find it convenient.
SPOOL = local directory/spool (required)
Defines the local directory where LoadLeveler keeps the local job queue and checkpoint files, as well as:

HISTORY = $(SPOOL)/history (required)
defines the pathname where a file containing the history of local LoadLeveler jobs is kept.

Release Directory

RELEASEDIR = release directory (required)
defines the directory where all the LoadLeveler software resides. The following subdirectories are created during installation and they reside in the release directory. You can change their locations.

BIN = $(RELEASEDIR)/bin (required)
defines the directory where LoadLeveler binaries are kept.
LIB = $(RELEASEDIR)/lib (required)
defines the directory where LoadLeveler libraries are kept.

NQS_DIR = NQS directory (optional)
defines the directory where NQS commands qsub, qstat, and qdel reside. The default is /usr/bin.

To specify the location of the:	Specify these keywords:
Administration File	ADMIN_FILE = `pathname` (required) points to the administration file containing user, class, group, machine, and adapter stanzas. For example, ADMIN_FILE = $(tilde)/admin_file
Local Configuration File	LOCAL_CONFIG = `pathname` defines the pathname of the optional local configuration file containing information specific to a node in the LoadLeveler network. If you are using a distributed file system like NFS, some examples are: LOCAL_CONFIG = $(tilde)/$(host).LoadL_config.local LOCAL_CONFIG = $(tilde)/LoadL_config.$(host).$(domain) LOCAL_CONFIG = $(tilde)/LoadL_config.local.$(hostname) If you are using a local file system, an example is: LOCAL_CONFIG = /var/LoadL/LoadL_config.local See LoadLeveler Variables for information about the tilde, host, and domain variables.
Local Directory	The following subdirectories reside in the local directory. It is possible that the local directory and LoadLeveler's home directory are the same. EXECUTE = `local directory`/execute (required) defines the local directory to store the executables of jobs submitted by other machines. LOG = `local directory`/log (required) defines the local directory to store log files. It is not necessary to keep all the log files created by the various LoadLeveler daemons and programs in one directory but you will probably find it convenient. SPOOL = `local directory`/spool (required) Defines the local directory where LoadLeveler keeps the local job queue and checkpoint files, as well as: HISTORY = $(SPOOL)/history (required) defines the pathname where a file containing the history of local LoadLeveler jobs is kept.
Release Directory	RELEASEDIR = `release directory` (required) defines the directory where all the LoadLeveler software resides. The following subdirectories are created during installation and they reside in the release directory. You can change their locations. BIN = $(RELEASEDIR)/bin (required) defines the directory where LoadLeveler binaries are kept. LIB = $(RELEASEDIR)/lib (required) defines the directory where LoadLeveler libraries are kept. NQS_DIR = `NQS directory` (optional) defines the directory where NQS commands qsub, qstat, and qdel reside. The default is /usr/bin.

Step 12: Record and Control Log Files

The LoadLeveler daemons and processes keep log files according to the specifications in the configuration file. A number of keywords are used to describe where LoadLeveler maintains the logs and how much information is recorded in each log. These keywords, shown in Table 13, are repeated in similar form to specify the pathname of the log file, its maximum length, and the debug flags to be used.

Controlling Debugging Output describes the events that can be reported through logging controls.

Table 13. Log Control Statements

Daemon/ Process Log File (required)
(See note (PAT))
Max Length (required)
(See note (MXL))
Debug Control (required)
(See note (FLA))
Master MASTER_LOG = path MAX_MASTER_LOG = bytes MASTER_DEBUG = flags
Schedd SCHEDD_LOG = path MAX_SCHEDD_LOG = bytes SCHEDD_DEBUG = flags
Startd STARTD_LOG = path MAX_STARTD_LOG = bytes STARTD_DEBUG = flags
Starter STARTER_LOG = path MAX_STARTER_LOG = bytes STARTER_DEBUG = flags
Negotiator NEGOTIATOR_LOG = path MAX_NEGOTIATOR_LOG = bytes NEGOTIATOR_DEBUG = flags
Kbdd KBDD_LOG = path MAX_KBDD_LOG = bytes KBDD_DEBUG = flags
GSmonitor GSMONITOR_LOG = path MAX_GSMONITOR_LOG = bytes GSMONITOR_DEBUG = flags

Daemon/ Process	Log File (required) (See note (PAT))	Max Length (required) (See note (MXL))	Debug Control (required) (See note (FLA))
Master	MASTER_LOG = path	MAX_MASTER_LOG = bytes	MASTER_DEBUG = flags
Schedd	SCHEDD_LOG = path	MAX_SCHEDD_LOG = bytes	SCHEDD_DEBUG = flags
Startd	STARTD_LOG = path	MAX_STARTD_LOG = bytes	STARTD_DEBUG = flags
Starter	STARTER_LOG = path	MAX_STARTER_LOG = bytes	STARTER_DEBUG = flags
Negotiator	NEGOTIATOR_LOG = path	MAX_NEGOTIATOR_LOG = bytes	NEGOTIATOR_DEBUG = flags
Kbdd	KBDD_LOG = path	MAX_KBDD_LOG = bytes	KBDD_DEBUG = flags
GSmonitor	GSMONITOR_LOG = path	MAX_GSMONITOR_LOG = bytes	GSMONITOR_DEBUG = flags

Notes:

When coding the path for the log files, it is not necessary that all LoadLeveler daemons keep their log files in the same directory, however, you will probably find it a convenient arrangement.
There is a maximum length, in bytes, beyond which the various log files cannot grow. Each file is allowed to grow to the specified length and is then saved to an .old file. The .old files are overwritten each time the log is saved, thus the maximum space devoted to logging for any one program will be twice the maximum length of its log file. The default length is 64KB. To obtain records over a longer period of time, that don't get overwritten, you can use the SAVELOGS keyword in the local or global configuration files. SeeSaving Log Files for more information on extended capturing of LoadLeveler logs.
You can also specify that the log file be started anew with every invocation of the daemon by setting the TRUNC statement to true as follows:

TRUNC_MASTER_LOG_ON_OPEN = true|false
TRUNC_STARTD_LOG_ON_OPEN = true|false
TRUNC_SCHEDD_LOG_ON_OPEN = true|false
TRUNC_KBDD_LOG_ON_OPEN = true|false
TRUNC_STARTER_LOG_ON_OPEN = true|false
TRUNC_NEGOTIATOR_LOG_ON_OPEN = true|false
TRUNC_GSMONITOR_LOG_ON_OPEN = true|false
LoadLeveler creates temporary log files used by the starter daemon. These files are used for synchronization purposes. When a job starts, a StarterLog.pid file is created. When the job ends, this file is appended to the StarterLog file.
Normally, only those who are installing or debugging LoadLeveler will need to use the debug flags, described in Controlling Debugging Output The default error logging, obtained by leaving the right side of the debug control statement null, will be sufficient for most installations.

Controlling Debugging Output

You can control the level of debugging output logged by LoadLeveler programs. The following flags are presented here for your information, though they are used primarily by IBM personnel for debugging purposes:

D_ACCOUNT: Logs accounting information about processes. If used, it may slow down the network.
D_AFS: Logs information related to AFS credentials.
D_DAEMON: Logs information regarding basic daemon set up and operation, including information on the communication between daemons.
D_DBX: Bypasses certain signal settings to permit debugging of the processes as they execute in certain critical regions.
D_DCE: Logs information related to DCE credentials.
D_EXPR: Logs steps in parsing and evaluating control expressions.
D_FULLDEBUG: Logs details about most actions performed by each daemon but doesn't log as much activity as setting all the flags.
D_JOB: Logs job requirements and preferences when making decisions regarding whether a particular job should run on a particular machine.
D_KERNEL: Activates diagnostics for errors involving the process tracking kernel extension.
D_LOAD: Displays the load average on the startd machine.
D_LOCKING: Logs requests to acquire and release locks.
D_MACHINE: Logs machine control functions and variables when making decisions regarding starting, suspending, resuming, and aborting remote jobs.
D_NEGOTIATE: Displays the process of looking for a job to run in the negotiator. It only pertains to this daemon.
D_NQS: Provides more information regarding the processing of NQS files.
D_PROC: Logs information about jobs being started remotely such as the number of bytes fetched and stored for each job.
D_QUEUE: Logs changes to the job queue.
D_STANZAS: Displays internal information about the parsing of the administration file.
D_SCHEDD: Displays how the schedd works internally.
D_STARTD: Displays how the startd works internally.
D_STARTER: Displays how the starter works internally.
D_THREAD: Displays the ID of the thread producing the log message. The thread ID is displayed immediately following the date and time. This flag is useful for debugging threaded daemons.
D_XDR: Logs information regarding External Data Representation (XDR) communication protocols.

For example,

SCHEDD_DEBUG = D_CKPT  D_XDR

causes the scheduler to log information about checkpointing user jobs and exchange xdr messages with other LoadLeveler daemons. These flags will primarily be of interest to LoadLeveler implementers and debuggers.

Saving Log Files

By default, LoadLeveler stores only the two most recent iterations of a daemon's log file (<daemon name>_Log, and <daemon name>_Log.old). Occasionally, for problem diagnosing, users will need to capture LoadLeveler logs over an extended period. Users can specify that all log files be saved to a particular directory by using the SAVELOGS keyword in a local or global configuration file. Be aware that LoadLeveler does not provide any way to manage and clean out all of those log files, so users must be sure to specify a directory in a file system with enough space to accomodate them. This file system should be separate from the one used for the LoadLeveler log, spool, and execute directories. The syntax is:

SAVELOGS = <directory>

where <directory> is the directory in which log files will be archived.

Each log file is represented by the name of the daemon that generated it, the exact time the file was generated, and the name of the machine on which the daemon is running. When you list the contents of the SAVELOGS directory, the list of log file names looks like this:

NegotiatorLogNov02.16:10:39c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:42c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:46c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:48c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:51c163n10.ppd.pok.ibm.com
NegotiatorLogNov02.16:10:53c163n10.ppd.pok.ibm.com
StarterLogNov02.16:09:19c163n10.ppd.pok.ibm.com
StarterLogNov02.16:09:51c163n10.ppd.pok.ibm.com
StarterLogNov02.16:10:30c163n10.ppd.pok.ibm.com
SchedLogNov02.16:09:05c163n10.ppd.pok.ibm.com
SchedLogNov02.16:09:26c163n10.ppd.pok.ibm.com
SchedLogNov02.16:09:47c163n10.ppd.pok.ibm.com
SchedLogNov02.16:10:12c163n10.ppd.pok.ibm.com
SchedLogNov02.16:10:37c163n10.ppd.pok.ibm.com
StartLogNov02.16:09:05c163n10.ppd.pok.ibm.com
StartLogNov02.16:09:26c163n10.ppd.pok.ibm.com
StartLogNov02.16:09:47c163n10.ppd.pok.ibm.com
StartLogNov02.16:10:12c163n10.ppd.pok.ibm.com
StartLogNov02.16:10:37c163n10.ppd.pok.ibm.com

Step 13: Define Network Characteristics

A port number is an integer that specifies the port number to use to connect to the specified daemon. You can define these port numbers in the configuration file or the /etc/services file or you can accept the defaults. LoadLeveler first looks in the configuration file for these port numbers. If the port number is in the configuration file and is valid, this value is used. If it is an invalid value, the default value is used.

If LoadLeveler does not find the value in the configuration file, it looks in the /etc/services file. If the value is not found in this file, the default is used.

The configuration file keywords associated with port numbers are the following:

CLIENT_TIMEOUT = number: where number specifies the maximum time, in seconds, that a LoadLeveler daemon waits for a response over TCP/IP from a process. If the waiting time exceeds the specified amount, the daemon tries again to communicate with the process. The default is 30 seconds. In general, you should use this default setting unless you are experiencing delays due to an excessively loaded network. If so, you should try increasing this value. CLIENT_TIMEOUT is used by all LoadLeveler daemons.
CM_COLLECTOR_PORT = port number: The default is 9612.
MASTER_STREAM_PORT = port number: The default is 9616.
NEGOTIATOR_STREAM_PORT = port number: The default is 9614.
SCHEDD_STATUS_PORT = port number: The default is 9606.
SCHEDD_STREAM_PORT = port number: The default is 9605.
STARTD_STREAM_PORT = port number: The default is 9611.
STARTD_DGRAM_PORT = port number: The default is 9615.
MASTER_DGRAM_PORT = port number: The default is 9617.

As stated earlier, if LoadLeveler does not find the value in the configuration file, it looks in the /etc/services file. If the value is not found in this file, the default is used. The first field on each line in the example that follows represents the name of a "service". In most cases, these services are also the names of daemons because few daemons need more than one udp and one tcp connection. There are two exceptions: LoadL_negotiator_collector is the service name for a second stream port that is used by the LoadL_negotiator daemon; LoadL_schedd_status is the service name for a second stream port used by the LoadL_schedd daemon.

LoadL_master               9616/tcp   # Master port number for stream port
LoadL_negotiator           9614/tcp   # Negotiator port number
LoadL_negotiator_collector 9612/tcp   # Second negotiator stream port
LoadL_schedd               9605/tcp   # Schedd port number for stream port
LoadL_schedd_status        9606/tcp   # Schedd stream port for job status data
LoadL_startd               9611/tcp   # Startd port number for stream port
LoadL_master               9617/udp   # Master port number for dgram port
LoadL_startd               9615/udp   # Startd port number for dgram port

Step 14: Enable Checkpointing

This section tells you how to set up checkpointing for jobs. For more information on the job command file keywords mentioned here, see Job Command File Keywords. To enable checkpointing for parallel jobs, you must use the APIs provided with the Parallel Environment (PE) program. For information on parallel checkpointing, see IBM Parallel Environment for AIX: Operation and Use, Volume 1.

Checkpointing is a method of periodically saving the state of a job so that if the job does not complete it can be restarted from the saved state. You can checkpoint both serial and parallel jobs.

You can specify the following types of checkpointing:

user initiated: The user's application program determines when the checkpoint is taken. This type of checkpointing is available to both serial and parallel jobs.
system initiated: The checkpoint is taken at administrator-defined intervals. This type of checkpointing is available only to serial jobs.

At checkpoint time, a checkpoint file is created, by default, on the executing machine and stored on the scheduling machine. You can control where the file is created and stored by using the CHKPT_FILE and CHKPT_DIR environment variables, which are described in Set the Appropriate Environment Variables. The checkpoint file contains the program's data segment, stack, heap, register contents, signal state and the states of the open files at the time of the checkpoint. The checkpoint file is often much larger in size than the executable.

When a job is vacated, the most recent checkpoint file taken before the job was vacated is used to restart the job when it is scheduled to run on a new machine. Note that a vacating job may be killed by LoadLeveler if the job takes too long to write its checkpoint file. This occurs only when a job is vacated by the executing machine after the job's VACATE expression evaluates to TRUE. See Step 8: Manage a Job's Status Using Control Expressions for more information on the VACATE and KILL expressions.

If the executing machine fails, then when the machine restarts LoadLeveler reschedules the job, which restores its state from the most recent checkpoint file. LoadLeveler waits for the original executing machine to restart before scheduling the job to run on another machine in order to ensure that only one copy of the job will run.

Planning Considerations for Checkpointing Jobs

Review the following guidelines before you submit a checkpointing job:

Set the Appropriate Environment Variables

This section discusses the CHKPT_STATE, CHKPT_FILE, and CHKPT_DIR environment variables.

The CHKPT_STATE environment variable allows you to enable and disable checkpointing. CHKPT_STATE can be set to the following:

enable: Enables checkpointing.
restart: Restarts the executable from an existing checkpoint file.

If you set checkpoint=no in your job command file, no checkpoints are taken, regardless of the value of the CHKPT_STATE environment variable. See checkpoint for more information.

The CHKPT_FILE and CHKPT_DIR environment variables help you manage your checkpoint files. For parallel jobs, you must specify at least one of these variables in order to designate the location of the checkpoint file. For serial jobs, if you do not specify either of these variables, LoadLeveler manages your checkpoint files. LoadLeveler stores the checkpoint file in its working directories and deletes the file as soon as the job terminates (that is, when the job exits the LoadLeveler system.) If your job terminates abnormally, there is no checkpoint file from which LoadLeveler can restart the job. When you resubmit the job, it will start running from the beginning.

To avoid this problem, use CHKPT_FILE and CHKPT_DIR to control where your checkpoint file is stored. CHKPT_DIR specifies the directory where it is stored, and CHKPT_FILE specifies the checkpoint file name. (You can use just CHKPT_FILE provided you specify a full path name. Also, you can use just CHKPT_DIR; in this case the checkpoint file is copied to the directory you specify with a file name of executable.chkpt.) You can use these variables to have your checkpoint file written to a the file system of your choice. This allows you to resubmit your job and have it restart from the last checkpoint file, since the file will not be erased if your job is terminated. If your job completes normally, the checkpoint library deletes all checkpoint files associated with the job.

Note that two or more job steps running at the same time cannot both write to the same checkpoint file, since the file will be corrupted.

See How to Checkpoint a Job for more information.

Plan for Jobs that You Will Migrate

If you plan to migrate jobs (restart jobs on a different node or set of nodes), you should understand the difference between writing checkpoint files to a local file system (such as JFS) versus a global file system (such as AFS or GPFS). The CHKPT_DIR and CHKPT_FILE environment variables allow you to write to either type of file system. If you are using a local file system, you must first move the checkpoint file(s) to the target node(s) before resubmitting the job. Then you must ensure that the job runs on those specific nodes. If you are using a global file system, the checkpointing may take longer, but there is no additional work required to migrate the job.

Reserve Adequate Disk Space in the Execute Directory

A checkpoint file requires a significant amount of disk space. Your job may fail if the directory where the checkpoint file is written does not have adequate space. For serial jobs, the directory must be able to contain two checkpoint files. For parallel jobs, the directory must be able to contain 2*n checkpoint files, where n is the number of tasks. You can make an accurate size estimate only after you've run your job and noticed the size of the checkpoint file that is created. LoadLeveler attempts to reserve enough disk space for the checkpoint file when the job is started. However, only you can ensure that enough space is available.

Set your Checkpoint File Size to the Maximum

To make sure that your job is not prevented from writing a checkpoint file due to system limits, assign your job to a job class that has its file creation limit set to the maximum (unlimited). In the administration file, set up a class stanza for checkpointing jobs with the following entry:

  file_limit = unlimited,unlimited

This statement specifies that there is no limit on the maximum size of a file that your program can create.

Checkpoint Programs Whose States are Simple to Checkpoint and Recreate

For some processes, it is impossible to obtain or recreate the state of the process. For this reason, you should only checkpoint programs whose states are simple to checkpoint and recreate. A program that is long-running, computation-intensive, and does not fork any processes is an example of a job well suited for checkpointing.

Avoid Using Certain System Services in Checkpointed Jobs

In order to prevent unpredictable results from occurring, checkpointing jobs should not use the following system services:

Threads
Shared libraries
Dynamic loading
Shared memory (such as pfork and shmget)
IPC (sockets, pipes, semaphores, and message queues)
Memory-mapped files
Fork and exec system calls
Device I/O
File locks
Set/get user or group IDs and process IDs
Open system calls from inside a signal handler
Time and timer services
Administrative calls (for example, DCE security, audit, and swapqry)
64 bit addressing

Another limitation of checkpointing jobs is file I/O. Since individual write calls are not traced, the file recovery scheme requires that all I/O operations, when repeated, must yield the same result. A job that opens all files as read only can be checkpointed. A job that writes to a file and then reads the data back may also be checkpointed. An example of I/O that could cause unpredictable results is reading, writing, and then reading again the same area of a file.

Ensure Jobs are Restarted on an Appropriate Machine

A checkpointed serial job must be restarted on a machine with the same processor and the same operating system level, including service fixes, as the machine on which the checkpoint was taken.

A checkpointed parallel job must be restarted on a machine with the same processor, the same operating system level, including service fixes, and the same SP switch adapter(s) as the machine on which the checkpoint was taken.

Choose a Supported Compiler

Compile your program with one of the following supported compilers:

For FORTRAN: xlf 5.1.1 or later releases
For C and C++: xlC 3.6.x, or Visual Age C, C++ (VAC++) 4.1

Ensure all User's Jobs are Linked to Checkpointing Libraries

All serial checkpointing programs must be linked with the LoadLeveler libraries libchkrst.a and chkrst_wrap.o. To ensure your checkpointing jobs are linked correctly, compile your programs using the compile scripts found in the bin subdirectory of the LoadLeveler release directory. These compile scripts are as follows:

crxlc (for use with C)

crxlC (for use with C++)

crxlf (for use with FORTRAN)

In all these scripts, be sure to substitute all occurrences of "RELEASEDIR" with the location of the LoadLeveler release directory.

C Syntax

crxlc executable [args] source_file

Where:

executable: Is your checkpointable binary.
args: Is one or more arguments you supply to the compiler (xlc -c).
source_file: Is your C source code.

Some examples are:

   crxlc myprog myprog.c
   crxlc myprog -qlanglvl=extended myprog.c

C++ Syntax

crxlC executable [args] source_file

Where:

executable: Is your checkpointable binary.
args: Is one or more arguments you supply to the compiler (xlC -c).
source_file: Is your C++ source code.

Some examples are:

   crxlC myprog myprog.C
   crxlC myprog -qlanglvl=extended myprog.C

FORTRAN Syntax

crxlf executable [args] source_file

Where:

executable: Is your checkpointable binary.
args: Is one or more arguments you supply to the compiler (xlf -c).
source_file: Is your FORTRAN source code.

Some examples are:

   crxlf myprog myprog.f
   crxlf myprog -qintlog -qfullpath myprog.f

How to Checkpoint a Job

There are several ways to checkpoint a job. To determine which type of checkpointing is appropriate for your situation, refer to the following table:

To specify that: Do this:
Your serial job determines when the checkpoint occurs Add the following option to your job command file:
checkpoint = user_initiated
You can also select this option on the Build a Job window of the GUI.
User initiated checkpointing is available to FORTRAN, C, and C++ programs which call the ckpt serial checkpointing API. See Serial Checkpointing API for more information.
LoadLeveler automatically checkpoints your serial job. Add the following option to your job command file:
checkpoint = system_initiated
You can also select this option on the Build a Job window of the GUI.
For this type of checkpointing to work, system administrators must set two keywords in the configuration file to specify how often LoadLeveler would take a checkpoint of the job. These two keywords are:

MIN_CKPT_INTERVAL = number MAX_CKPT_INTERVAL = number
where number specifies a period, in seconds, between checkpoints taken for running jobs. The time between checkpoints will be increased after each checkpoint within these limits as follows:

The first checkpoint is taken after a period of time equal to the MIN_CKPT_INTERVAL has passed.
The second checkpoint is taken after LoadLeveler waits twice as long (MIN_CKPT_INTERVAL X 2)
The third checkpoint is taken after LoadLeveler waits twice as long again (MIN_CKPT_INTERVAL X 4) before taking the third checkpoint.

LoadLeveler continues to double this period until the value of MAX_CKPT_INTERVAL has been reached, where it stays for the remainder of the job.
A minimum value of 900 (15 minutes) and a maximum value of 7200 (2 hours) are the defaults.

You can set these keyword values globally in the global configuration file so that all machines in the cluster have the same value, or you can specify a different value for each machine by modifying the local configuration files.
To enable both user initiated and system initiated checkpointing for a job, specify checkpoint=system_initiated in your job command file, and code the ckpt API call in your program.
System initiated checkpointing is not available to parallel jobs.
LoadLeveler restarts your executable from an existing checkpoint file when you submit the job. Pass the CHKPT_STATE environment variable using the LoadLeveler environment keyword in your job command file. For more information, see environment. You must also set the CHKPT_DIR and/or CHKPT_FILE environment variables.
Your job not be checkpointed Add the following option to your job command file:
checkpoint = no
You can also select this option on the Build a Job window of the GUI. This option is the default.

To specify that:	Do this:
Your serial job determines when the checkpoint occurs	Add the following option to your job command file: checkpoint = user_initiated You can also select this option on the Build a Job window of the GUI. User initiated checkpointing is available to FORTRAN, C, and C++ programs which call the ckpt serial checkpointing API. See Serial Checkpointing API for more information.
LoadLeveler automatically checkpoints your serial job.	Add the following option to your job command file: checkpoint = system_initiated You can also select this option on the Build a Job window of the GUI. For this type of checkpointing to work, system administrators must set two keywords in the configuration file to specify how often LoadLeveler would take a checkpoint of the job. These two keywords are: MIN_CKPT_INTERVAL = `number` MAX_CKPT_INTERVAL = `number` where `number` specifies a period, in seconds, between checkpoints taken for running jobs. The time between checkpoints will be increased after each checkpoint within these limits as follows: The first checkpoint is taken after a period of time equal to the MIN_CKPT_INTERVAL has passed. The second checkpoint is taken after LoadLeveler waits twice as long (MIN_CKPT_INTERVAL X 2) The third checkpoint is taken after LoadLeveler waits twice as long again (MIN_CKPT_INTERVAL X 4) before taking the third checkpoint. LoadLeveler continues to double this period until the value of MAX_CKPT_INTERVAL has been reached, where it stays for the remainder of the job. A minimum value of 900 (15 minutes) and a maximum value of 7200 (2 hours) are the defaults. You can set these keyword values globally in the global configuration file so that all machines in the cluster have the same value, or you can specify a different value for each machine by modifying the local configuration files. To enable both user initiated and system initiated checkpointing for a job, specify checkpoint=system_initiated in your job command file, and code the ckpt API call in your program. System initiated checkpointing is not available to parallel jobs.
LoadLeveler restarts your executable from an existing checkpoint file when you submit the job.	Pass the CHKPT_STATE environment variable using the LoadLeveler environment keyword in your job command file. For more information, see environment. You must also set the CHKPT_DIR and/or CHKPT_FILE environment variables.
Your job not be checkpointed	Add the following option to your job command file: checkpoint = no You can also select this option on the Build a Job window of the GUI. This option is the default.

Step 15: Specify Process Tracking

When a job terminates, it's orphaned processes may continue to consume or hold resources, thereby degrading system performance, or causing jobs to hang or fail. Process tracking allows LoadLeveler to cancel any processes (throughout the entire cluster), left behind when a job terminates. Using process tracking is optional. There are two keywords used in specifying process tracking:

PROCESS_TRACKING: To activate process tracking, set PROCESS_TRACKING=TRUE in the LoadLeveler global configuration file. By default, PROCESS_TRACKING is set to FALSE.
PROCESS_TRACKING_EXTENSION: This keyword is used to specify the path to the kernel extension binary LoadL_pt_ke in the local or global configuration file. If the PROCESS_TRACKING_EXTENSION keyword is not supplied, then LoadLeveler will search the default directory $HOME/bin.

Step 16: Configuring LoadLeveler to use DCE Security Services

When LoadLeveler is configured to exploit DCE security, it uses PSSP and DCE security services to:

Authenticate the identity of users and programs interacting with LoadLeveler.
Authorize users and programs to use LoadLeveler services. It will prevent unauthorized users and programs from misusing resources or disrupting services.
Delegate the user credentials at submit time to the Starter process to give the user's job the same DCE permissions at run time.

You can skip this section if you do not plan to use these security features or if you plan to continue to use only the limited support for DCE available in LoadLeveler 2.1. Please consult Usage Notes for additional information.

When LoadLeveler is configured to exploit DCE security, most of its interactions with DCE are through the PSSP security services API. For this reason, it is important that you configure PSSP security services before you configure LoadLeveler for DCE. For more information on PSSP security services, please refer to: RS/6000 SP Planning Volume 2, Control Workstation and Software Environment (GA22-7281-05), Parallel System Support Programs for AIX Installation and Migration Guide Version 3 Release 2 (GA22-7347-02), and Parallel System Support Programs for AIX Administration Guide Version 3 Release 2 (SA22-7348-02).

DCE maintains a registry of all DCE principals which have been authorized to login to the DCE cell. In order for LoadLeveler daemons to login to DCE, DCE accounts must be set up, and DCE key files must be created for these daemons. In LoadLeveler 2.2 each LoadLeveler daemon on each node is associated with a different DCE principal. The DCE principal of the Schedd daemon running on node A is distinct from the DCE principal of the Schedd daemon running on node B. Since it is possible for up to seven LoadLeveler daemons to run on any particular node (Master, Negotiator, Schedd, Startd, Kbdd, Starter, and GSmonitor), the number of DCE principal accounts and key files that must be created could reach as high as 7x(number of nodes). Since it is not always possible to know in advance on which node a particular daemon will run, a conservative approach would be to create accounts and key files for all seven daemons on all nodes in a given LoadLeveler cluster. However, it is only necessary to create accounts and keyfiles for DCE principals which will actually be instantiated and run in the cluster.

These are the steps used for configuring LoadLeveler for DCE. We recommend that you use SMIT and the lldcegrpmaint command to perform this task. The manual steps are also described in Manual Configuration, and may be useful should you need to create a highly customized LoadLeveler environment. Some of the names used in this section are the default names as defined in the file /usr/lpp/ssp/config/spsec_defaults and can be overridden with appropriate specifications in the file /spdata/sys1/spsec/spsec_overrides. Also, the term "LoadLeveler node" is used to refer to a node on an SP system that will be part of a LoadLeveler cluster.

Using SMIT and the lldcegrpmaint command

Login to the SP control workstation as root, then login to DCE as cell_admin.
Start the SMIT program. From SMIT's main menu, select the RS/6000 SP System Management option, then select the RS/6000 SP Security option in the next menu.
Perform the appropriate steps associated with this menu to configure the security features of this SP system. From LoadLeveler's perspective, the important actions are:
- Create dcehostnames
- Configure SP Trusted Services to use DCE Authentication
Before continuing to step 4, ensure that:
- DCE hostnames for LoadLeveler nodes are defined.
- A DCE group named spsec-services and a DCE organization named spsec-services are created.
- The DCE principals of the LoadLeveler daemons on LoadLeveler nodes are created.
- The DCE principals of the LoadLeveler daemons on LoadLeveler nodes are added to the spsec-services group and the spsec-services organization.
- A DCE account is created for each DCE principal associated with the LoadLeveler daemons on the SP system.
- A DCE key file is created for each LoadLeveler daemon on the LoadLeveler nodes.
If the LoadLeveler cluster consists of nodes spanning several SP systems, then you should repeat step 1 through step 3 for each SP system.
PSSP security services use certain fields in the SDR (System Data Repository) to determine the current software configuration. Use the command "splstdata -p" to verify that the field ts_auth_methods is set to either dce or dce:compat. If ts_auth_methods is set to dce:compat then either DCE or non-DCE authentication is allowed. For some PSSP applications, this setting also implies that if DCE authentication is activated but, DCE authentication cannot be performed, then non-DCE authentication will be used. However, LoadLeveler can not change authentication methods dynamically, and the dce:compat setting simply indicates that LoadLeveler can be brought up in either DCE or non-DCE authentication modes using the DCE_ENABLEMENT keyword.
Add these statements to the LoadLeveler global configuration file:
```
DCE_ENABLEMENT = TRUE
DCE_ADMIN_GROUP = LoadL-admin
DCE_SERVICES_GROUP = LoadL-services
 
```
DCE_ENABLEMENT must be set to TRUE to activate the DCE security features of LoadLeveler version 2.2. The LoadL-admin group should be populated with DCE principals of users who are to be given LoadLeveler administrative priviledges. For more information on populating the LoadL-admin group, see 9. The LoadL-services group should be populated with the DCE principals of all the LoadLeveler daemons that will be running in the current cluster. You can use the lldcegrpmaint command to automate this process. For more information on populating the LoadL-services group, see step 8. Note that these daemons are already members of the spsec-services group. If there is more than one DCE-enabled LoadLeveler cluster within the same DCE cell, then it is important that the name assigned to DCE_SERVICES_GROUP for each cluster be distinct; this will avoid any potential operational conflict.
Add DCE hostnames to the machine stanzas of the LoadLeveler administration file. The machine stanza of each node defined in the LoadLeveler administration file must contain a statement with this format:
```
dce_host_name = DCE hostname
 
```
Execute either "SDRGetObjects Node dcehostname," or "llextSDR" to obtain a listing of DCE hostnames of nodes on an SP system.
Execute the command:
```
lldcegrpmaint config_pathname admin_pathname
 
```
where config_pathname is the pathname of the LoadLeveler global configuration file and admin_pathname is the pathname of the LoadLeveler administration file. The lldcegrpmaint command will:
- Create the LoadL-services and LoadL-admin DCE groups (if they do not already exist).
- Add the DCE principals of all the LoadLeveler daemons in the LoadLeveler cluster defined by the admin_pathname file to the LoadL-services group.
For more information about the lldcegrpmaint command, see lldcegrpmaint - LoadLeveler DCE group Maintenance Utility.
Add the DCE principals of users who will have LoadLeveler administrative authority for the cluster to the LoadL-admin group. For example, this command adds loadl to the LoadL-admin group:
```
dcecp -c group add LoadL-admin -member loadl
 
```

Manual Configuration

Here is an example of the steps you must take to configure LoadLeveler for DCE.

In this example, the LoadLeveler cluster consists of 3 nodes of an SP system which belong to the same DCE cell. Their hostnames and DCE hostnames are the same: c163n01.pok.ibm.com, c163n02.pok.ibm.com, and c163n03.pok.ibm.com. Assume that the basic PSSP security setup steps have been performed, and that the DCE group spsec-services and the DCE organization spsec-services have been created.

Login to any node in the DCE cell as root and login to DCE as cell_admin.
Create LoadLeveler's product directory if it does not already exist. First, see if the directory has already been created:
```
dcecp -c cdsli /.:/subsys
 
```
This command lists the contents of the /.:/subsys directory in DCE. LoadLeveler's product name within DCE is LoadL, so its product directory is /.:/subsys/LoadL. If this directory already exists, then continue to the next step. If it does not exist, issue to following command to create it:
```
dcecp -c directory create /.:/subsys/LoadL
 
 
```
Create the DCE principal names for all of the LoadLeveler daemons in the LoadLeveler cluster. PSSP security services expect the DCE principal name of a LoadLeveler daemon to have the format:
```
product_name/dce_host_name/dce_daemon_name
```
where:
product_name
is the product name and should always be set to LoadL.
dce_host_name
is the DCE hostname of the node on which the daemon will run.
dce_daemon_name
is the DCE name of the daemon and is defined in the file /usr/lpp/ssp/config/spsec_defaults. Go to the LoadLeveler section of this file. You will find a SERVICE record similar to this for all the seven daemons:
```
SERVICE:LoadL/Master:kw:root:system
 
```
The relevant portion of this record is Master; this is the DCE daemon name of LoadL_master. The DCE daemon names of other daemons can be identified in a similar manner.
For the c163n01.pok.ibm.com node, the following commands will create the desired principal names:
```
dcecp -c principal create LoadL/c163n01.pok.ibm.com/Master
dcecp -c principal create LoadL/c163n01.pok.ibm.com/Negotiator
dcecp -c principal create LoadL/c163n01.pok.ibm.com/Schedd
dcecp -c principal create LoadL/c163n01.pok.ibm.com/Kbdd
dcecp -c principal create LoadL/c163n01.pok.ibm.com/Startd
dcecp -c principal create LoadL/c163n01.pok.ibm.com/Starter
dcecp -c principal create LoadL/c163n01.pok.ibm.com/GSmonitor
 
```
These commands must then be repeated for each node in the LoadLeveler cluster, replacing the dce_host_name with the DCE hostname of each respective node.
Add the principals defined in step 3 to the PSSP security services' services group. This group is named spsec-services. PSSP security services require that any daemon using their APIs be members of this group. This command will add the DCE principal of the Master daemon on node c163n01 to the spsec-services group.
```
dcecp -c group add spsec-services -member LoadL/c163n01.pok.ibm.com/Master
 
```
This operation must be repeated for all of the other LoadLeveler daemons on c163n01, and the complete set of operations must be repeated for all of the nodes in the LoadLeveler cluster.
Add the principals defined in step 3 to the spsec-services organization. The following command will add the DCE principal of the Master daemon on node c163n01 to the spsec-services organization.
```
dcecp -c organization add spsec-services -member LoadL/c163n01.pok.ibm.com/Master
```
This operation must be repeated for all of the other LoadLeveler daemons on c163n01, and the complete set of operations must be repeated for all of the nodes in the LoadLeveler cluster.
Create a DCE account for each of the principals defined in step 3. This series of commands will create a DCE account for the Master daemon on node c163n01:
```
dcecp  <Enter>      
dcecp> account create LoadL/c163n01.pok.ibm.com/Master \
       -group spsec-services -organization spsec-services \
       -password  service-password -mypwd  cell_admin's-password
dcecp> quit
 
```
The service-password passed to DCE in this command can be any valid DCE password. Please take note of it since you will need it when you create the key file for this daemon in step 8. The continuation character "\" is not supported by dcecp, but appears in the example merely for clarity. This operation must be repeated for the other LoadLeveler daemons on c163n01, and the complete set of operations must be repeated for all of the nodes in the LoadLeveler cluster.
Create directories to contain the key files for the principals defined in step 3.
```
mkdir -p /spdata/sys1/keyfiles/LoadL/dce_host_name
 
```
You must login to the appropriate node to perform this operation. This operation must be repeated for every node in the LoadLeveler cluster.
NOTE: The directory /spdata/sys1/keyfiles should already exist on each node in the cluster which has been installed with a level of PSSP software that supports DCE Security exploitation. If this directory does not exist, then the node cannot support DCE Security and LoadLeveler 2.2 in DCE mode will not run on it. If this configuration seems to be in error, contact your system administrator to determine which nodes in the cluster should support DCE Security.
Create a key file for each LoadLeveler daemon on the node on which it will run. The key file contains security-related information specific to each daemon. Use this series of commands:
```
 dcecp  <Enter>
 dcecp> keytab create LoadL/c163n01.pok.ibm.com/Master \
        -storage /spdata/sys1/keyfiles/LoadL/c163n01.pok.ibm.com/Master \
        -data { LoadL/c163n01.pok.ibm.com/Master plain 1 service-password }
 dcecp> quit
 
```
You must login to node c163n01 to perform this operation. DCE must be able to locate the key file locally, otherwise the daemon's login to DCE on startup will fail. The principal name passed to DCE in the preceeding example is the same principal name defined in step 3. The AIX path passed with the "-storage" flag should point to the same directory created in step 7. The principal name passed with the "-data" flag should match the principal name used at the beginning of the command. The password used in the service-password field must be the same as the service password defined when this principal's account was created in step 6.
This operation must be repeated for all of the other LoadLeveler daemons on node c163n01, and the complete set of operations must be repeated for all of the nodes in the LoadLeveler cluster.
Perform steps 5, 6, and 7 of Using SMIT and the lldcegrpmaint command.
Create the DCE groups LoadL-admin, and LoadL-services. This command creates the DCE group LoadL-admin:
```
dcecp -c group create LoadL-admin
 
```
Add the DCE principals of users who will have LoadLeveler administrative authority for the cluster to the LoadL-admin group. This command adds loadl to the LoadL-admin group:
```
dcecp -c group add LoadL-admin -member loadl
 
```
Add the principals defined in step 3 to the LoadL-services group. This command will add the DCE principal of the Master daemon on node c163n01.pok.ibm.com to LoadL-services:
```
dcecp -c group add LoadL-services -member LoadL/c163n01.pok.ibm.com/Master
 
```
This operation must be repeated for all of the other LoadLeveler daemons on node c163n01, and the complete set of operations must be repeated for all of the nodes in the LoadLeveler cluster.

Usage Notes

Limited support for DCE security was available in a previous version of LoadLeveler. In version 2.1, the configuration keyword "DCE_AUTHENTICATION_PAIR = program1, program2" was used to activate LoadLeveler support for DCE security and to specify to LoadLeveler which programs should be used to authenticate DCE security credentials. program1 obtains a handle (an opaque credentials object), at the time the job is submitted to LoadLeveler, which is used to authenticate to DCE. program2 uses the handle obtained by program1 to authenticate to DCE before starting the job on the executing machine(s). These programs could be the default LoadLeveler binaries llgetdce and llsetdce, or a pair of installation defined binaries. See pages ***, and *** for more information on the DCE_AUTHENTICATION_PAIR keyword.
In LoadLeveler 2.2, this limited form of support for DCE is still available. If the DCE_ENABLEMENT keyword is not defined, then the DCE_AUTHENTICATION_PAIR keyword can still be used to activate this legacy feature. If this level of DCE support meets your requirements, then you can ignore the setup steps in this section. However, setting the DCE_ENABLEMENT configuration keyword to TRUE activates a more comprehensive level of support for DCE . In this case, LoadLeveler will use the PSSP security services API to perform mutual authentication of all appropriate transactions in addition to using llgetdce and llsetdce (or the pair of programs specified by DCE_AUTHENTICATION_PAIR) to obtain the opaque credentials object and to authenticate to DCE before starting the job. Unless you want to specify a pair of programs other than the default llgetdce and llsetdce binaries, the use of the DCE_AUTHENTICATION_PAIR keyword in the configuration file is optional when "DCE_ENABLEMENT = TRUE".
When DCE_ENABLEMENT is set to TRUE, LoadLeveler uses a different set of criteria to determine who owns job steps, and who has administrator privileges.
- LoadLeveler considers you to be the owner of a job step if your DCE principal matches the DCE principal associated with that job step.
- LoadLeveler administrators are usually defined to LoadLeveler through a list of names associated with the LOADL_ADMIN keyword. However, when DCE_ENABLEMENT is TRUE, this list is no longer used for this purpose. Instead, users and processes whose DCE principals are members of the LoadL-admin DCE group are given LoadLeveler administrative privileges.
Note: The LOADL_ADMIN keyword is also used to provide LoadLeveler with a list of users who are to receive mail notification of problems encountered by the LoadL_master daemon. This function is not affected by the DCE_ENABLEMENT keyword.
If DCE_ENABLEMENT is set to TRUE, you must login to DCE with the dce_login command before attempting to execute any LoadLeveler command. Also, if an AIX user's user name is different from the user's DCE principal name, then the AIX user must have a .k5login file in the home directory specifying which DCE principal may execute using the AIX account. For example, if your DCE principal in the cell local_dce_cell is user1_dce, and your AIX user name is user1, then you will have to add an entry such as "user1_dce@local_dce_cell" to the .k5login file in your home directory.

Step 17: Specify Additional Configuration File Keywords

This section describes keywords that were not mentioned in the previous configuration steps. Unless your installation has special requirements for any of these keywords, you can use them with their default settings.
Note: For the keywords listed below which have a number as the value on the right side of the equal sign, that number must be a numerical value and cannot be an arithmetic expression.

ACTION_ON_MAX_REJECT = HOLD | SYSHOLD | CANCEL

Specifies the state in which jobs are placed when their rejection count has reached the value of the MAX_JOB_REJECT keyword. HOLD specifies that jobs are placed in User Hold status; SYSHOLD specifies that jobs are placed in System Hold status; CANCEL specifies that jobs are canceled. The default is HOLD. When a job is rejected, LoadLeveler sends a mail message stating why the job was rejected.

AFS_GETNEWTOKEN = myprog

where myprog is an administrator supplied program that, for example, can be used to refresh an AFS token. The default is to not run a program.

For more information, see Handling an AFS Token.

DCE_AUTHENTICATION_PAIR = program1, program2

where program1 and program2 are LoadLeveler or installation supplied programs that are used to authenticate DCE security credentials. program1 obtains a handle (an opaque credentials object), at the time the job is submitted, which is used to authenticate to DCE. program2 is the path name of a LoadLeveler or installation supplied program that uses the handle obtained by program1 to authenticate to DCE before starting the job on the executing machine(s).

You must specify this keyword in order to enable DCE authentication. To use LoadLeveler's default DCE authentication method, specify:

DCE_AUTHENTICATION_PAIR = $(BIN)/llgetdce, $(BIN)/llsetdce

To use your own DCE authentication method, substitute your own programs into the keyword definition. For more information on DCE security credentials, see Handling DCE Security Credentials.

DRAIN_ON_SWITCH_TABLE_ERROR = true | false

When DRAIN_ON_SWITCH_TABLE_ERROR is set to true, the startd will be drained when the switch table fails to unload. This will flag the administator that intervention may be required to unload the switch table. The default is false.

MACHINE_UPDATE_INTERVAL = number

where number specifies the time period, in seconds, during which machines must report to the central manager. Machines that do not report in this number of seconds are considered down. The default is 300 seconds.

MAX_JOB_REJECT = number

where number specifies the number of times a job can be rejected before it is removed (cancelled) or put in User Hold or System Hold status. That is, a rejected job is redispatched until the MAX_JOB_REJECT value is reached. The default is -1, meaning a job is redispatched an unlimited number of times. A job that cannot run for various reasons (such as a uid mismatch, unavailable resources, or wrong permissions) on one machine will be rejected on that machine, and LoadLeveler will attempt to run the job on another machine. A value of 0 means that if the job is rejected, it is immediately removed. (For related information, see the NEGOTIATOR_REJECT_DEFER keyword in this section.)

NEGOTIATOR_INTERVAL = number

where number specifies the interval, in seconds, at which the negotiator daemon performs a "negotiation loop" during which it attempts to assign available machines to waiting jobs. A negotiation loop also occurs whenever job states or machine states change. The default is 30 seconds.

NEGOTIATOR_CYCLE_DELAY = number

where number specifies the time, in seconds, the negotiator delays between periods when it attempts to schedule jobs. This time is used by the negotiator daemon to respond to queries, reorder job queues, collect information about changes in the states of jobs, etc. Delaying the scheduling of jobs might improve the overall performance of the negotiator by preventing it from spending excessive time attempting to schedule jobs. The NEGOTIATOR_CYCLE_DELAY must be less than the NEGOTIATOR_INTERVAL. The default is 0 seconds.

NEGOTIATOR_LOADAVG_INCREMENT = number

where number specifies the value the negotiator adds to the startd machine's load average whenever a job in the Pending state is queued on that machine. This value is used to compensate for the increased load caused by starting another job. The default value is .5.

NEGOTIATOR_PARALLEL_DEFER = number

where number specifies the amount of time in seconds that defines how long a job stays out of the queue after it fails to get the correct number of processors. This keyword applies only to the default LoadLeveler scheduler. This keyword must be greater than the NEGOTIATOR_INTERVAL. value; if it is not, the default is used. The default, set internally by LoadLeveler, is NEGOTIATOR_INTERVAL multiplied by 5.

NEGOTIATOR_PARALLEL_HOLD = number

where number specifies the amount of time in seconds that defines how long a job is given to accumulate processors. This keyword applies only to the default LoadLeveler scheduler. This keyword must be greater than the NEGOTIATOR_INTERVAL value; if it is not, the default is used. The default, set internally by LoadLeveler, is NEGOTIATOR_INTERVAL multiplied by 5.

NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = number

where number specifies the amount of time in seconds between calculation of the SYSPRIO values for waiting jobs. The default is 120 seconds. Recalculating the priority can be CPU-intensive; specifying low values for the NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL keyword may lead to a heavy CPU load on the negotiator if a large number of jobs are running or waiting for resources. A value of 0 means the SYSPRIO values are not recalculated.

You can use this keyword to base the order in which jobs are run on the current number of running, queued, or total jobs for a user or a group. For more information, see Step 6: Prioritize the Queue Maintained by the Negotiator.

NEGOTIATOR_REJECT_DEFER = number

where number specifies the amount of time in seconds the negotiator waits before it considers scheduling a job to a machine that recently rejected the job. The default is 120 seconds. (For related information, see the MAX_JOB_REJECT keyword in this section.)

NEGOTIATOR_REMOVE_COMPLETED = number

where number is the amount of time in seconds that you want the negotiator to keep information regarding completed and removed jobs so that you can query this information using the llq command. The default is 0 seconds.

NEGOTIATOR_RESCAN_QUEUE = number

where number specifies the amount of time in seconds that defines how long the negotiator waits to rescan the job queue for machines which have bypassed jobs which could not run due to conditions which may change over time. This keyword must be greater than the NEGOTIATOR_INTERVAL value; if it is not, the default is used. The default is 900 seconds.

OBITUARY_LOG_LENGTH = number

where number specifies the number of lines from the end of the file that are appended to the mail message. The master daemon mails this log to the LoadLeveler administrators when one of the daemons dies. The default is 25.

POLLING_FREQUENCY = number

where number specifies the interval, in seconds, with which the startd daemon evaluates the load on the local machine and decides whether to suspend, resume, or abort jobs. This is also the minimum interval at which the kbdd daemon reports keyboard or mouse activity to the startd daemon. A value of 5 is the default.

POLLS_PER_UPDATE = number

where number specifies how often, in POLLING_FREQUENCY intervals, startd daemon updates the central manager. Due to the communication overhead, it is impractical to do this with the frequency defined by the POLLING_FREQUENCY keyword. Therefore, the startd daemon only updates the central manager every nth (where n is the number specified for POLLS_PER_UPDATE) local update. Change POLLS_PER_UPDATE when changing the POLLING_FREQUENCY. The default is 6.

PUBLISH_OBITUARIES = true| false

where true specifies that the master daemon sends mail to the administrator(s), identified by LOADL_ADMIN keyword, when any of the daemons it manages dies abnormally.

RESTARTS_PER_HOUR = number

where number specifies how many times the master daemon attempts to restart a daemon that dies abnormally. Because one or more of the daemons may be unable to run due to a permanent error, the master only attempts $(RESTARTS_PER_HOUR) restarts within a 60 minute period. Failing that, it sends mail to the administrator(s) identified by the LOADL_ADMIN keyword and exits. The default is 12.

SCHEDD_INTERVAL = number

where number specifies the interval, in seconds, at which the schedd daemon checks the local job queue and updates the negotiator daemon. The default is 60 seconds.

WALLCLOCK_ENFORCE = true| false

Where true specifies that the wall_clock_limit on the job will be enforced. The WALLCLOCK_ENFORCE keyword is only valid when the External Scheduler is enabled.

User-Defined Variables

This type of variable, which is generally created and defined by the user, can be named using any combination of letters and numbers. A user-defined variable is set equal to values, where the value defines conditions, names files, or sets numeric values. For example, you can create a variable named MY_MACHINE and set it equal to the name of your machine named iron as follows:

  MY_MACHINE = iron.ore.met.com

You can then identify the keyword using a dollar sign ($) and parentheses. For example, the literal $(MY_MACHINE) following the definition in the previous example results in the automatic substitution of iron.ore.met.com in place of $(MY_MACHINE).

User-defined definitions may contain references, enclosed in parentheses, to previously defined keywords. Therefore:

  A = xxx
  C = $(A)

is a valid expression and the resulting value of C is xxx. Note that C is actually bound to A, not to its value, so that

  A = xxx
  C = $(A)
  A = yyy

is also legal and the resulting value of C is yyy.

The sample configuration file shipped with the product defines and uses some "user-defined" variables.

LoadLeveler Variables

The LoadLeveler product includes variables that you can use in the configuration file. LoadLeveler variables are evaluated by the LoadLeveler daemons at various stages. They do not require you to use any special characters (such as a parenthesis or a dollar sign) to identify them.

LoadLeveler provides the following variables that you can use in your configuration file statements.

Arch

indicates the system architecture. Note that Arch is a special case of a LoadLeveler variable called a machine variable. You specify a machine variable using the the following format:

  variable : $(value)

ConsumableCpus

the number of ConsumableCpus currently available on the machine, if ConsumableCpus is defined in the SCHEDULE_BY_RESOURCES. If it is not defined in the SCHEDULE_BY_RESOURCES, then it is equivalent to Cpus.

ConsumableMemory

the amount of ConsumableMemory currently available on the machine, if ConsumableMemory is defined in the SCHEDULE_BY_RESOURCES. If it is not defined in the SCHEDULE_BY_RESOURCES, then it is equivalent to Memory.

ConsumableVirtualMemory

the amount of ConsumableVirtualMemory currently available on the machine, if ConsumableVirtualMemory is defined in the SCHEDULE_BY_RESOURCES. If it is not defined in the SCHEDULE_BY_RESOURCES, then it is equivalent to VirtualMemory.

Cpus

the number of CPU's installed.

CurrentTime

the UNIX date; the current system time, in seconds, since January 1, 1970, as returned by the time() function.

CustomMetric

sets a relative machine priority.

Disk

the free disk space in kilobytes on the file system where the executables for the LoadLeveler jobs assigned to this machine are stored. This refers to the file system that is defined by the execute keyword.

domain or domainname

dynamically indicates the official name of the domain of the current host machine where the program is running. Whenever a machine name can be specified or one is assumed, a domain name is assigned if none is present.

EnteredCurrentState

the value of CurrentTime when the current state (START, SUSPEND, etc) was entered.

host or hostname

dynamically indicates the official name of the host machine where the program is running. host returns the machine name without the domain name; hostname returns the machine and the domain.

KeyboardIdle

the number of seconds since the keyboard or mouse was last used. It also includes any telnet or interactive activity from any remote machine.

LoadAvg

The Berkely one-minute load average, a measure of the CPU load on the system. The load average is the average of the number of processes ready to run or waiting for disk I/O to complete. The load average does not map to CPU time.

Machine

indicates the name of the current machine. Note that Machine is a special case of a LoadLeveler variable called a machine variable. See the description of the Arch variable for more information.

Memory

the physical memory installed on the machine in megabytes.

MasterMachPriority

a value that is equal to 1 for nodes which are master nodes, and is equal to 0 otherwise.

OpSys

indicates the operating system on the host where the program is running. This value is automatically determined and need not be defined in the configuration file. Note that OpSys is a special case of a LoadLeveler variable called a machine variable. See the description of the Arch variable for more information.

QDate

the difference in seconds between when LoadLeveler (specifically the negotiator daemon) comes up and when the job is submitted using llsubmit.

Speed

the relative speed of a machine.

State

the state of the startd daemon.

tilde

the home directory for the LoadLeveler userid.

UserPrio

the user defined priority of the job. The priority ranges from 0 to 100, with higher numbers corresponding to greater priority.

VirtualMemory

the size of available swap space on the machine in kilobytes.

Time

You can use the following time variables in the START, SUSPEND, CONTINUE, VACATE, and KILL expressions. If you use these variables in the START expression and you are operating across multiple time zones, unexpected results may occur. This is because the negotiator daemon evaluates the START expressions and this evaluation is done in the time zone in which the negotiator resides. Your executing machine also evaluates the START expression and if your executing machine is in a different time zone, the results you may receive may be inconsistent. To prevent this inconsistency from occurring, ensure that both your negotiator daemon and your executing machine are in the same time zone.

tm_hour

the number of hours since midnight (0-23).

tm_min

number of minutes after the hour (0-59).

tm_sec

number of seconds after the minute (0-59).

tm_isdst

Daylight Savings Time flag: positive when in effect, zero when not in effect, negative when information is unavailable. For example, to start jobs between 5PM and 8AM during the month of October, factoring in an adjustment for Daylight Savings Time, you can issue:

START: (tm_mon == 9) && (tm_hour < 8) && (tm_hour > 17) && (tm_isdst = 1)

Date

tm_mday

the number of the day of the month (1-31).

tm_wday

number of days since Sunday (0-6).

tm_yday

number of days since January 1 (0-365).

tm_mon

number of months since January (0-11).

tm_year

the number of years since 1900 (0-9999). For example:

tm_year == 100

denotes the year 2000.

tm4_year

The integer representation of the current year. For example:

tm4_year == 2010

denotes the year 2010.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]