IBM Books

Using and Administering


Chapter 5. Administering and Configuring LoadLeveler

This chapter tells you how to administer and configure LoadLeveler. In general, the information in this chapter applies to both serial and parallel jobs. For more specific information on parallel jobs, see Chapter 6. "Administration Tasks for Parallel Jobs".


Overview

After installing LoadLeveler, you need to customize it by modifying both the administration file and the configuration file. The administration file optionally lists and defines the machines in the LoadLeveler cluster and the characteristics of classes, users, and groups. The configuration file contains many parameters that you can set or modify that will control how LoadLeveler operates.

In order to easily manage LoadLeveler, you should have only one administration file and one global configuration file, centrally located on a machine in the LoadLeveler cluster. Every other machine in the cluster must be able to read the administration and configuration file that are located on the central machine. LoadLeveler does not prevent you from having multiple copies of administration files but you need to be sure to update all the copies whenever you make a change to one. Having only one administration file prevents any confusion.

You can, however, have multiple local configuration files that specify information specific to individual machines. For more information on the global and local configuration files, refer to "Configuring LoadLeveler".

Before working with these two files, you should read the following planning considerations to help you decide how to modify the files.


Planning Considerations

Node availability
Some workstation owners might agree to accept LoadLeveler jobs only when they are not using the workstation themselves. Using LoadLeveler keywords, these workstations can be configured to be available at designated times only.

Common name space
To run jobs on any machine in the LoadLeveler cluster, a user needs the same uid (the system ID number for a user) and gid (the system ID number for a group) on every machine in the cluster. The term cluster refers to all machines mentioned in the configuration file.

For example, if there are two machines in your LoadLeveler cluster, machine_1 and machine_2, user john must have the same user ID and login group ID in the /etc/passwd file on both machines. If user john has user ID 1234 and login group ID 100 on machine_1, then user john must have the same user ID and login group ID in /etc/passwd on machine_2. This ensures that the getuid system call returns the same user ID on both systems. (This allows a job to run with the same group ID and user ID of the person who submitted the job.)

If you do not have a user ID on one machine, your jobs will not run on that machine. Also, many commands, such as llq, will not work correctly if a user does not have a user ID on the central manager machine.

However, there are cases where you may choose to not give a user a login ID on a particular machine. For example, a user does not need an ID on every submit-only machine; the user only needs to be able to submit jobs from at least one such machine. Also, you may choose to restrict a user's access to a schedd machine that is not a public scheduler; again, the user only needs access to at least one schedd machine.

Performance
You should keep the log, spool, and execute directories in a local file system in order to maximize performance. Also, to measure the performance of your network, consider using one of the available products, such as Toolbox/6000.

Management
Managing distributed software systems is a primary concern for all system administrators. Allowing users to share filesystems to obtain a single, network-wide image, is one way to make managing LoadLeveler easier.

Resource Handling
Some nodes in the LoadLeveler cluster might have special software installed that users might need to run their jobs successfully. You should configure LoadLeveler to distinguish those nodes from other nodes using, for example, machine features.

Where to Begin?

Setting up LoadLeveler involves defining machines, users, and how they interact, in such a way that LoadLeveler is able to run jobs quickly and efficiently. If you have a good deal of experience in system administration and job scheduling, you should begin by reading "Expert". If you are relatively new to job scheduling tasks, begin by reading "Intermediate or Beginner"

No matter what your level of experience, it will prove worthwhile to read all the information in this chapter at some point to help you optimize LoadLeveler's performance.

Intermediate or Beginner

If you are experienced in UNIX system administration but are unfamiliar with job scheduling systems or your experience is limited, you may want to start with the section "Administration File Structure and Syntax" and read to the end of this chapter. This section provides a relatively slow, step-by-step approach to administering LoadLeveler. If you would rather start up LoadLeveler quickly using mostly default characteristics, follow the procedures in "Quick Set Up"

Expert

If you are very familiar with UNIX system administration and job scheduling, and have some idea how you want to distribute your workload, go to "Quick Set Up". Each step in this short procedure refers you to a detailed discussion of the task at hand. The sample configuration and administration files included in the samples subdirectory (and shown in Appendix C. "Sample Files") also provide assistance.

If you plan to run interactive jobs using the Parallel Operating Environment (POE) running under LoadLeveler, see "Setting Up to Allow Users to Submit Interactive POE Jobs".


Quick Set Up

If you are very familiar with UNIX system administration and job scheduling, follow the steps listed in this section to get LoadLeveler up and running on your network quickly in a default configuration. For this set up, it is recommended that you use loadl as the LoadLeveler user ID. Afterward, you can fine tune your configuration for greater efficiency when you become more familiar with the details of LoadLeveler.

  1. Ensure that the installation procedure has completed successfully and that the configuration file, LoadL_config, exists in LoadLeveler's home directory or in the directory specified in /etc/LoadL.cfg (if this file exists). See "Configuring LoadLeveler" for more information.

  2. Identify yourself as the LoadLeveler administrator in the LoadL_config file using the LOADL_ADMIN keyword. The syntax of this keyword follows:

    LOADL_ADMIN = list of user names (required)

    where list of user names is a blank-delimited list of those individuals who will have administrative authority.

    Refer to "Step 1: Define LoadLeveler Administrators" for more information.

  3. Define a machine to act as the LoadLeveler central manager by coding one machine stanza as follows in the administration file, which is called LoadL_admin. (Replace machinename with the actual name of the machine.)
    machinename: type = machine
    central_manager = true
    

    Do not specify more than one machine as the central manager. Also, if during installation, you ran llinit with the -cm flag, the central manager is already defined in the LoadL_admin file because the llinit command takes parameters you entered and updates the administration and configuration files. See "Step 1: Specify Machine Stanzas" for more information.

  4. Issue the following command for each machine to be included in the LoadLeveler cluster. (Replace hostname with the actual name of the machine.)
    llctl -h hostname start
    

    Issue this command for the central manager machine first. See llctl - Control LoadLeveler Daemons for more information.

    You can also issue the following command to start LoadLeveler on all machines beginning with the central manager. Before you issue this command, make sure all the machines are listed in the administration file. This command only affects machines that are defined in the administration file.

    llctl -g start
    

llctl uses rsh or remsh to start LoadLeveler on the target machine. Therefore, the administrator using llctl must have rsh authority on the target machine.


Administering LoadLeveler

This section explains how to perform administration tasks, and includes a step-by-step approach to administering LoadLeveler in "Customizing the Administration File".

Administration File Structure and Syntax

The administration file is called LoadL_admin and it lists and defines the machine, user, class, group, and adapter stanzas.

Machine stanza
Defines the roles that the machines in the LoadLeveler cluster play. See "Step 1: Specify Machine Stanzas" for more information.

User stanza
Defines LoadLeveler users and their characteristics. See "Step 2: Specify User Stanzas" for more information.

Class stanza
Defines the characteristics of the job classes. See "Step 3: Specify Class Stanzas" for more information.

Group stanza
Defines the characteristics of a collection of users that form a LoadLeveler group. See "Step 4: Specify Group Stanzas" for more information.

Adapter stanza
Defines the network adapters available on the machines in the LoadLeveler cluster. See "Step 5: Specify Adapter Stanzas" for more information.

Stanzas have the following general format:

Figure 23. Format of Administration File Stanzas

label: type = type_of_stanza
keyword1 = value1
keyword2 = value2
  ...

The following is a simple example of an administration file illustrating several stanzas:

Figure 24. Sample Administration File Stanzas

machine_a: type = machine
      central_manager = true    # defines this machine as the central manager
      adapter_stanzas = adapter_a  # identifies an adapter stanza
 
class_a: type = class
      priority = 50    # priority of this class
 
user_a: type  = user
      priority  = 50   # priority of this user
 
group_a: type = group
      priority  = 50   # priority of this group
 
adapter_a: type = adapter
      adapter_name = en0  #defines an adapter

The characteristics of a stanza are:

Customizing the Administration File

You can add as many stanzas as you would like to the administration file. This section tells you how to modify this file in a step-by-step manner. You do not have to perform the steps in the order that they appear here.

Step 1: Specify Machine Stanzas

The information in a machine stanza defines the characteristics of that machine. You do not have to specify a machine stanza for every machine in the LoadLeveler cluster but you must have one machine stanza for the machine that will serve as the central manager.

If you do not specify a machine stanza for a machine in the cluster, the machine and the central manager still communicate and jobs are scheduled on the machine but the machine is assigned the default values specified in the default machine stanza. If there is no default stanza, the machine is assigned default values set by LoadLeveler.

Any machine name used in the stanza must be a name which can be resolved to an IP address. This name is referred to as an interface name because the name can be used for a program to interface with the machine. Generally, interface names match the machine name, but they do not have to.

By default, LoadLeveler will append the DNS domain name to the end of any machine name without a domain name appended before resolving its address. If you specify a machine name without a domain name appended to it and you do not want LoadLeveler to append the DNS domain name to it, specify the name using a trailing period. You may have a need to specify machine names in this way if you are running a cluster with more than one nameserving technique. For example, if you are using a DNS nameserver and running NIS, you may have some machine names which are resolved by NIS which you do not want LoadLeveler to append DNS names to. In situations such as this, you also want to specify name_server keyword in your machine stanzas.

Under the following conditions, you must have a machine stanza for the machine in question:

Machine stanzas take the following format. Default values for keywords appear in bold:

Figure 25. Format of a Machine Stanza

label: type = machine
adapter_stanzas = stanza_list
alias = machine_name
central_manager = true | false | alt
cpu_speed_scale = true | false
machine_mode = batch | interactive | general
master_node_exclusive = true | false
max_jobs_scheduled = number
name_server = list
pvm_root = pathname
pool_list = pool_numbers
schedd_host = true | false
spacct_excluse_enable = true | false
speed = number
submit_only = true | false

You can specify the following keywords in a machine stanza:

adapter_stanzas = stanza_list

where stanza_list is a blank-delimited list of one or more adapter stanza names which specify adapters available on this machine. All adapter stanzas you define must be specified on this keyword.

alias = machine_name

where machine_name is a blank-delimited list of one or more machine names. Depending upon your network configurations, you may need to add alias keywords for machines that have multiple interfaces.

Note: In general, if your cluster is configured with machine hostnames which match the hostnames corresponding to the IP address configured for the LAN adapters which LoadLeveler is expected to use, you will not have to specify the alias keyword. For example, if all of the machines in your cluster are configured like this sample machine, you should not have to specify the alias keyword.

Machine porsche.kgn.ibm.com

However, if any machine in your cluster is configured like either of the following two sample machines, then you will have to specify the alias keyword for those machines:

  1. Machine yugo.kgn.ibm.com

    • The hostname command returns yugo.kgn.ibm.com.

    • The Ethernet adapter address 129.40.8.21 resolves to hostname chevy.kgn.ibm.com.

    • No adapter address resolves to yugo.

    You need to code the machine stanza as:

    chevy: type = machine
    alias = yugo
    

  2. Machine rover.kgn.ibm.com

    • The hostname command returns rover.kgn.ibm.com.

    • The FDDI adapter address 129.40.9.22 resolves to hostname rover.kgn.ibm.com.

    • The Ethernet adapter address 129.40.8.22 resolves to hostname bmw.kgn.ibm.com.

    • No route exists via the FDDI adapter to the clusters central manager machine.

    • A route exists from this machine to the central manager via the Ethernet adapter.

    You need to code the machine stanza as:

    bmw:   type = machine
    alias = rover
    

central_manager =  true | false |  alt 

where true designates this machine as the LoadLeveler central manager host, where the negotiator daemon runs. You must specify one and only one machine stanza identifying the central manager. For example:
machine_a: type = machine
central_manager = true

false specifies that this machine is not the central manager.

alt specifies that this machine can serve as an alternate central manager in the event that the primary central manager is not functioning. For more information on recovering if the primary central manager is not operating, refer to "What Happens if the Central Manager Isn't Operating?". Submit-only machines cannot have their machine stanzas set to this value.

If you are going to select machines to serve as alternate central managers, you should look at the following keywords in the configuration file:

For information on setting these keywords, see "Step 9: Specify Alternate Central Managers"

cpu_speed_scale =  true | false

where true specifies that CPU time (which is used, for example, in setting limits, in accounting information, and reported by the llq -x command), is in normalized units for each machine. false specifies that CPU time is in native units for each machine. For an example of using this keyword to normalize accounting information, see "Task 5: Specifying Machines and Their Weights".

machine_mode =  batch  |  interactive  | general

Specifies the type of job this machine can run. Where:

batch
Specifies this machine can run only batch jobs.

interactive
Specifies this machine can run only interactive jobs. Only POE is currently enabled to run interactively.

general
Specifies this machine can run both batch jobs and interactive jobs.

master_node_exclusive =  true | false

where true specifies that this machine is used only as a master node for parallel jobs.

max_jobs_scheduled = number

where number is the maximum number of jobs submitted from this scheduling (schedd) machine that can run (or start running) in the LoadLeveler cluster at one time. If number of jobs are already running, no other jobs submitted from this machine will run, even if resources are available in the LoadLeveler cluster. When one of the running jobs completes, any waiting jobs then become eligible to be run. The default is -1, which means there is no maximum.

name_server = list

where list is a blank-delimited list of character strings that is used to specify which nameserver(s) are used for the machine. Valid strings are DNS, NIS, and LOCAL. LoadLeveler uses the list to determine when to append a DNS domain name for machine names specified in LoadLeveler commands issued from the machine described in this stanza.

If DNS is specified alone, LoadLeveler will always append the DNS domain name to machine names specified in LoadLeveler commands. If NIS or LOCAL is specified, LoadLeveler will never append a DNS domain name to machine names specified in LoadLeveler commands. If DNS is specified with either NIS or LOCAL, LoadLeveler will always look up the name in the administration file to determine whether to append a DNS domain name. If the name is specified with a trailing period, it doesn't append the domain name.

pvm_root = pathname

Where pathname specifies the location of the directory in which PVM is installed. The default pathname is /u/loadl/pvm3.

pool_list = pool_numbers

Where pool_numbers is a blank-delimited list of numbers identifying pools to which the machine belongs. This keyword provides compatability with function that was previously part of the Resource Manager.

schedd_host =  true  | false

where true designates this as a public scheduling machine, used to receive job submissions from submit-only machines. Submit-only machines do not run LoadLeveler jobs.

spacct_excluse_enable =  true  | false

Where true specifies that the accounting function on an SP system is informed that a job step has exclusive use of this machine. Note that your SP system must have exclusive user accounting enabled in order for this keyword to have an effect. For more information on SP accounting, see Parallel System Support Programs for AIX: Administration Guide, GC23-3899.

speed = number

where number is a floating point number that is used for machine scheduling purposes in the MACHPRIO expression. For more information on machine scheduling and the MACHPRIO expression, see "Step 6: Prioritize the Order of Executing Machines Maintained by the Negotiator". In addition, the speed keyword is also used to define the weight associated with the machine. This weight is used when gathering accounting data on a machine basis. The default is 1.0.

The following example illustrates how the speed keyword can be used for assigning weights to machines.

If your cluster consisted of five RISC System/6000 machines that you want to have the same weight, you would not have to specify this keyword in the administration file. By default, all machines would have a weight of 1.0. If, however, you add an SP system to your cluster for parallel job processing, you may want to update the local configuration file for each node of the SP system to charge differently for resource consumption on those nodes. You would need to set the speed keyword to something other than 1.0 to make the SP nodes have a different weight.

For information on how the speed keyword can be used to schedule machines, refer to "Step 6: Prioritize the Order of Executing Machines Maintained by the Negotiator"

submit_only =  true | false

where true designates this as a submit-only machine. If you set this keyword to true, in the administration file set central_manager and schedd_host to false.

Examples of Machine Stanzas

Example 1

In this example, the machine is being defined as the central manager.

#
machine_a: type = machine
central_manager = true    # central manager runs here

Example 2

This example sets up a submit-only node. Note that the submit-only keyword is set to true:

#
machine_b: type = machine
central_manager = false   # not the central manager
schedd_host = false       # not a scheduling machine
submit_only = true        # submit only machine
alias = machineb          # interface name

Example 3

In the following example, machine_c is the central manager, has an alias associated with it, and can run parallel PVM jobs:

#
machine_c: type = machine
central_manager = true    # central manager runs here
schedd_host = true        # defines a public scheduler
alias = brianne
pvm_root = /u/brianne/loadl/1.2.0/aix32/pvm3

Step 2: Specify User Stanzas

The information specified in a user stanza defines the characteristics of that user. You can have one user stanza for each user but this is not necessary. If an individual user does not have their own user stanza, that user uses the defaults defined in the default user stanza.

User stanzas take the following format:

Figure 26. Format of a User Stanza

label: type = user
account = list
default_class = list
default_group = group name
default_interactive_class = class name
maxidle = number
maxjobs = number
maxqueued = number
max_node = number
max_processors = number
priority = number
total_tasks = number

You can specify the following keywords in a user stanza:

account =list

where list is a blank-delimited list of account numbers that identifies the account numbers a user may use when submitting jobs. The default is a null list.

default_class = list

where list is a blank-delimited list of class names used for jobs which do not include a class statement in the job command file. If you specify only one default class name, this class is assigned to the job. If you specify a list of default class names, LoadLeveler searches the list to find a class which satisfies the resource limit requirements. If no class satisfies these requirements, LoadLeveler rejects the job.

Suppose a job requests a CPU limit of 10 minutes. Also, suppose the default class list is default_class = short long, where short is a class for jobs up to five minutes in length and long is a class for jobs up to one hour in length. LoadLeveler will select the long class for this job because the short class does not have sufficient resources.

If no default_class is specified in the user stanza, or if there is no user stanza at all, then jobs submitted without a class statement are assigned to the default_class that appears in the default user stanza. If you do not define a default_class, jobs are assigned to the class called No_Class.

default_group = group_name

where group_name is the default group assigned to jobs submitted by the user. If a default_group statement does not appear in the user stanza, or if there is no user stanza at all, then jobs submitted by the user without a group statement are assigned to the default_group that appears in the default user stanza. If you do not define a default_group, jobs are assigned to the group called No_Group.

If you specify default_group = Unix_Group, LoadLeveler sets the user's LoadLeveler group to his or her primary UNIX group (as defined in the /etc/passwd file).

default_interactive_class = class_name

where class_name is the class to which an interactive job submitted by this user is assigned if the user does not specify a class using the LOADL_INTERACTIVE_CLASS environment variable. You can specify only one default interactive class name.

If you do not set a default_interactive_class value in the user stanza, or if there is no user stanza at all, then interactive jobs submitted without a class statement are assigned to the default_interactive_class that appears in the default user stanza. If you do not define a default_interactive_class, interactive jobs are assigned to the class called No_Class.

See "Example 2" for more information on how LoadLeveler assigns a default interactive class to jobs.

maxidle = number

where number is the maximum number of idle jobs this user can have in queue. That is, number is the maximum number of jobs which the negotiator will consider for dispatch for the user. Jobs above this maximum are placed in the NotQueued state. This prevents individual users from dominating the number of jobs that are either running or are being considered to run. If the user stanza does not specify maxidle or if there is no user stanza at all, the maximum number of jobs that can be simultaneously in queue for the user is defined in the default stanza. If no value is found, or the limit found is -1, then no limit is placed on the number of jobs that can be simultaneously idle for the user.

For more information, see "Controlling the Mix of Idle and Running Jobs"

maxjobs = number

where number is the maximum number of jobs this user can run at any time. If the user stanza does not specify maxjobs or if there is no user stanza at all, the maximum jobs that can be simultaneously run by the user is defined in the default stanza. The default is -1, which means no limit is placed on the number of jobs that can simultaneously run for the user. Regardless of this limit, there is no limit to the number of jobs a user can submit.

For more information, see "Controlling the Mix of Idle and Running Jobs"

maxqueued = number

where number is the maximum number of jobs allowed in the queue for this user. This is the maximum number of jobs which can be either running or being considered to be dispatched by the negotiator for that user. Jobs above this maximum are placed in the NotQueued state. This prevents individual users from dominating the number of jobs that are either running or are being considered to run. If no maxqueued is specified in the user stanza, or if there is no user stanza, the maximum number of jobs that can simultaneously be in the queue is defined in the default stanza. The default is -1, which means that no limit is placed on the number of jobs that can simultaneously be in the job queue for that user. Regardless of this limit, there is no limit to the number of jobs a user can submit.

For more information, see "Controlling the Mix of Idle and Running Jobs"

max_node = number

where number specifies the maximum number of nodes this user can request for a parallel job in a job command file using the node keyword. The default is -1, which means there is no limit.

max_processors = number

where number specifies the maximum number of processors this user can request for a parallel job in a job command file using the max_processors keyword. The default is -1, which means there is no limit.

priority = number

where number is a integer that specifies the priority for jobs submitted by the user. The default is 0. The number specified for priority is referenced as UserSysprio in the configuration file. UserSysprio can be used in the assignment of job priorities. If the variable UserSysprio does not appear in the SYSPRIO expression in the configuration file, the priority numbers for users specified here in the administration file have no effect. See "Step 5: Prioritize the Queue Maintained by the Negotiator" for more information about the UserSysprio keyword.

total_tasks = number

where number specifies the maximum number of tasks this user can request for a parallel job in a job command file using the total_tasks keyword. The default is -1, which means there is no limit.

Examples of User Stanzas

Example 1

In this example, user fred is being provided with a user stanza. His jobs will have a user priority of 100. If he does not specify a job class in his job command file, the default job class class_a will be used. In addition, he can have a maximum of 15 jobs running at the same time.

# Define user stanzas
fred:  type = user
priority = 100
default_class = class_a
maxjobs = 15

Example 2

This example explains how a default interactive class for a parallel job is set by presenting a series of user stanzas and class stanzas. This example assumes that users do not specify the LOADL_INTERACTIVE_CLASS environment variable.

default: type =user
         default_interactive_class = red
         default_class = blue 
 
carol:   type = user 
         default_class = single double 
         default_interactive_class = ijobs
 
steve:   type = user
         default_class = single double
 
ijobs:   type = class
         wall_clock_limit = 08:00:00 
 
red:     type = class
         wall_clock_limit = 30:00

If the user Carol submits an interactive job, the job is assigned to the default interactive class called ijobs. The job is assigned a wall clock limit of 8 hours. If the user Steve submits an interactive job, the job is assigned to the red class from the default user stanza. The job is assigned a wall clock limit of 30 minues.

Example 3

In this example, Jane's jobs have a user priority of 50, and if she does not specify a job class in her job command file the default job class small_jobs is used. This user stanza does not specify the maximum number of jobs that Jane can run at the same time so this value defaults to the value defined in the default stanza. Also, suppose Jane is a member of the primary UNIX group "staff." Jobs submitted by Jane will use the default LoadLeveler group "staff." Lastly, Jane can use three different account numbers.

# Define user stanzas
jane:  type = user
priority = 50
default_class = small_jobs
default_group = Unix_Group
account = dept10 user3 user4

Step 3: Specify Class Stanzas

The information in a class stanza defines characteristics for that class. Class stanzas are optional. Class stanzas take the following format. Default values for keywords appear in bold.

Figure 27. Format of a Class Stanza

label: type = class
admin= list
class_comment = "string"
exclude_groups = list
exclude_users = list
include_groups = list
include_users = list
master_node_requirement = true | false
maxjobs = number
max_node = number
max_processors = number
nice = value
NQS_class = true | false
NQS_submit = name
NQS_query = queue names
priority = number
total_tasks = number
 
core_limit = hardlimit,softlimit
cpu_limit = hardlimit,softlimit
data_limit = hardlimit,softlimit
file_limit = hardlimit,softlimit
job_cpu_limit = hardlimit,softlimit
rss_limit = hardlimit,softlimit
stack_limit = hardlimit,softlimit
wall_clock_limit = hardlimit,softlimit

You can specify the following keywords in a class stanza:

admin = list

where list is a blank-delimited list of administrators for this class. These administrators can hold, release, and cancel jobs in this class.

class_comment = "string"

where string is text characterizing the class. This information appears when the user is building a job command file using the GUI and requests Choice information on the classes to which he or she is authorized to submit jobs. The length of the string cannot exceed 1024 characters.

exclude_groups = list

where list is a blank-delimited list of groups who are not allowed to submit jobs of that class name. Do not specify both a list of included groups and a list of excluded groups. Only one of these may be used for any class. The default is that no groups are excluded.

exclude_users = list

where list is a blank-delimited list of users who are not permitted to submit jobs of that class name. Do not specify both a list of included users and a list of excluded users. Only one of these may be used for any class. The default is that no users are excluded.

include_groups = list

where list is a blank-delimited list of groups who are allowed to submit jobs of that class name. If provided, this list limits groups of that class to those on the list. Do not specify both a list of included groups and a list of excluded groups. Only one of these may be used for any class. The default is to include all groups.

include_users = list

where list is a blank-delimited list of users who are permitted to submit jobs of that class name. If provided, this list limits users of that class to those on the list. Do not specify both a list of included users and a list of excluded users. Only one of these may be used for any class. The default is to include all users.

master_node_requirement =  true |false

where true specifies that parallel jobs in this class require the master node feature. For these jobs, LoadLeveler allocates the first node (called the "master") on a machine having the master_node_exclusive = true setting in its machine stanza. If most or all of your parallel jobs require this feature, you should consider placing the statement master_node_requirement = true in your default class stanza. Then, for classes that do not require this feature, you can use the statement master_node_requirement = false in their class stanzas to override the default setting. One machine per class should have the true setting; if more than one machine has this setting, normal scheduling selection is performed.

maxjobs = number

where number is the maximum number of jobs that can run in this class. If the class stanza does not specify maxjobs, or if there is no class stanza at all, the maximum jobs that can be simultaneously run in this class is defined in the default stanza. The default is -1, which means that no limit is placed on the number of jobs a user can submit.

max_processors = number

where number specifies the maximum number of processors a user submitting jobs to this class can request for a parallel job in a job command file using the max_processors keyword. The default is -1 which means that there is no limit.

max_node = number

where number specifies the maximum number of nodes a user submitting jobs in this class can request for a parallel job in a job command file using the node keyword. The default is -1, which means there is no limit.

nice = value

where value is the amount by which the current UNIX nice value is incremented. The nice value is one factor in a job's run priority. The lower the number, the higher the run priority. If two jobs are running on a machine, the nice value determines the percentage of the CPU allocated to each job.

This value ranges from -20 to 20. Values out of this range are placed at the top (or bottom) of the range. For example, if your current nice value is 15, and you specify nice = 10, the resulting value is 20 (the upper limit) rather than 25. The default is 0.

For more information, consult the appropriate UNIX documentaion.

NQS_class =  true |false

When true, any job submitted to this class will be routed to an NQS machine.

NQS_submit = name

where name is the name of the NQS pipe queue to which the job will be routed. When the job is dispatched to LoadLeveler, LoadLeveler will invoke the qsub command using the name of this queue. There is no default.

NQS_query = queue names

where queue names is a blank-delimited list of queue names (including host names if necessary) to be used with the qstat command to monitor the job and with the qdel command to cancel the job. There is no default.

For more information on routing jobs to machines running NQS, refer to Figure 31

priority = number

where number is an integer that specifies the priority for jobs in this class. The default is 0. The number specified for priority is referenced as ClassSysprio in the configuration file. You can use ClassSysprio when assigning job priorities. If the variable ClassSysprio does not appear in the SYSPRIO expression, then the priority specified here in the administration file is ignored. See "Step 5: Prioritize the Queue Maintained by the Negotiator" for more information about the ClassSysprio keyword.

total_tasks = number

where number specifies the maximum number of tasks a user submitting jobs in this class can request for a parallel job in a job command file using the total_tasks keyword. The default is -1, which means there is no limit.

Limit Keywords

The class stanza includes the following limit keywords, which allow you to control the amount of resources used by a job step or a job process.

Table 6. Types of Limit Keywords
Limit How It Is Enforced
core_limit Per process
cpu_limit Per process
data_limit Per process
file_limit Per process
job_cpu_limit Per job step
rss_limit Per process
stack_limit Per process
wall_clock_limit Per job step

Individual keywords are described in "Specifying Limits in the Class Stanza". The following section gives you a general overview of limits.

Overview of Limits

A limit is the amount of a resource that a job step or a process is allowed to use. (A process is a dispatchable unit of work.) A job step may be made up of several processes.

Limits include both a hard limit and a soft limit. When a hard limit is exceeded, the job is usually terminated. When a soft limit is exceeded, the job is usually given a chance to perform some recovery actions. For more information, see "Exceeding Limits".

Limits are enforced either per process or per job step, depending on the type of limit. For parallel jobs steps, which consist of multiple tasks running on multiple machines, limits are enforced on a per task basis.

For example, a common limit is the cpu_limit, which limits the amount of CPU time a single process can use. If you set cpu_limit to five hours and you have a job step that forks five processes, each process can use up to five hours of CPU time, for a total of 25 CPU hours. Another limit that controls the amount of CPU used is job_cpu_limit. This is the total amount of CPU that the entire serial job step can use. If you impose a job_cpu_limit of five hours, the entire job step (made up of all five processes) cannot consume more than five CPU hours.

You can specify limits in either the class stanza of the administration file or in the job command file. For a per process limit, the limit you set in the administration file overrides the system limit (also called the machine limit).

Exceeding Limits

Process limits are enforced by the operating system. Job step limits are enforced by LoadLeveler.

Exceeding Job Step Limits

When a hard limit is exceeded LoadLeveler sends a non-trappable signal to the process (except in the case of a parallel job). When a soft limit is exceeded, LoadLeveler sends a trappable signal to the process. The following chart summarizes the actions that occur when a job step limit is exceeded:

Table 7. Exceeding Job Step Limits
Type of Job When a Soft Limit is Exceeded When a Hard Limit is Exceeded
Serial SIGXCPU or SIGKILL issued SIGKILL issued
Parallel (non-PVM) SIGXCPU issued to both the user program and to the parallel daemon SIGTERM issued
PVM SIGXCPU issued to the user prgram pvm_halt invoked to shut down PVM

On systems that do not support SIGXCPU, LoadLeveler does not distinguish between hard and soft limits. When a soft limit is reached on these platforms, LoadLeveler issues a SIGKILL.

Exceeding Per Process Limits

For per process limits, what happens when your job reaches and exceeds either the soft limit or the hard limit depends on the operating system you are using.

Note that when a job forks a process which exceeds a per process limit, such as the CPU limit, the operating system (and not LoadLeveler) terminates the process by issuing a SIGXCPU. As a result, you will not see an entry in the LoadLeveler logs indicating that the process exceeded the limit. The job will complete with a 0 return code. LoadLeveler can only report the status of any processes it has started.

If you need more specific information, refer to your operating system documentation.

Syntax

The syntax for setting a limit is

limit_type = hardlimit,softlimit

For example:

core_limit = 120kb,100kb

To specify only a hard limit, you can enter, for example:

core_limit = 120kb

To specify only a soft limit, you can enter, for example:

core_limit = ,100kb

In a keyword statement, you cannot have any blanks between the numerical value (100 in the above example) and the units (kb). Also, you cannot have any blanks to the left or right of the comma when you define a limit in a job command file.

For limit keywords that refer to a data limit -- such as data_limit, core_limit, file_limit, stack_limit, and rss_limit -- the hard limit and the soft limit are expressed as:

integer[.fraction][units]

where integer and fraction represent numerical strings of up to eight characters. units can be:

b
bytes
w
words
kb
kilobytes (2 10 bytes)
kw
kilowords (2 10 words)
mb
megabytes (2 20 bytes)
mw
megawords (2 20 words)
gb
gigabytes (2 30 bytes)
gw
gigawords (2 30 words)

If no units are specified, bytes are assumed.

For limit keywords that refer to a time limit -- such as cpu_limit, job_cpu_limit, and wall_clock_limit -- the hard limit and the soft limit are expressed as:

[[hours:]minutes:]seconds[.fraction]

Fractions are rounded to seconds.

You can use the following character strings with all limit keywords:

rlim_infinity
Represents the largest positive number.
unlimited
Has same effect as rlim_infinity.
copy
Uses the limit currently active when the job is submitted.

See Table 8 for more information on specifying limits.

Table 8. Setting limits
If the hard limit: Then the:
Is set in both the class stanza and the job command file Smaller of the two limits is taken into consideration. If the smaller limit is the job limit, the job limit is then compared with the user limit set on the machine that runs the job. The smaller of these two values is used. If the limit used is the class limit, the class limit is used without being compared to the machine limit.
Is not set in either the class stanza or the job command file User per process limit set on the machine that runs the job is used.
Is set in the job command file and is less than its respective job soft limit The job is not submitted.
Is set in the class stanza and is less than its respective class stanza soft limit Soft limit is adjusted downward to equal the hard limit.
Is specified in the job command file Hard limit must be greater than or equal to the specified soft limit and less than or equal to the limit set by the administrator in the class stanza of the administration file.

Note: If the per process limit is not defined in the administration file and the hard limit defined by the user in the job command file is greater than the limit on the executing machine, then the hard limit is set to the machine limit.

Specifying Limits in the Class Stanza

You can specify the following limit keywords:

core_limit = hardlimit,softlimit

Specifies the hard limit and/or soft limit for the size of a core file.

Examples:

core_limit = unlimited
core_limit = 30mb

For more information, see "Overview of Limits"

cpu_limit = hardlimit,softlimit

Specifies hard limit and/or soft limit for the CPU time to be used by each individual process of a job step. For example, if you impose a cpu_limit of five hours and you have a job step composed of five processes, each process can consume five CPU hours; the entire job step can therefore consume 25 total hours of CPU.

Examples:

cpu_limit = 12:56:21       # hardlimit = 12 hours 56 minutes 21 seconds
cpu_limit = 56:00,50:00    # hardlimit = 56 minutes 0 seconds
# softlimit = 50 minutes 0 seconds
cpu_limit = 1:03           # hardlimit = 1 minute 3 seconds
cpu_limit = unlimited      # hardlimit = 2,147,483,647 seconds
# (X'7FFFFFFF')
cpu_limit = rlim_infinity  # hardlimit = 2,147,483,647 seconds
# (X'7FFFFFFF')
cpu_limit = copy           # current CPU hardlimit value on the
# submitting machine.

For more information, see "Overview of Limits"

data_limit = hardlimit,softlimit

Specifies hard limit and/or soft limit for the data segment to be used by each process of the submitted job.

Examples:

data_limit = 125621         # hardlimit = 125621 bytes
data_limit = 5621kb         # hardlimit = 5621 kilobytes
data_limit = 2mb            # hardlimit = 2 megabytes
data_limit = 2.5mw          # hardlimit = 2.5 megawords
data_limit = unlimited      # hardlimit = 2,147,483,647 bytes
#             (X'7FFFFFF')
data_limit = rlim_infinity  # hardlimit = 2,147,483,647 bytes
#             (X'7FFFFFF')
data_limit = copy           # copy data hardlimit value from submitting
# machine.

For more information, see "Overview of Limits".

file_limit = hardlimit,softlimit

Specifies the hard limit and/or soft limit for the size of a file. For more information, see "Overview of Limits".

job_cpu_limit = hardlimit,softlimit

Specifies the maximum total CPU time to be used by all processes of a job step. That is, if a job step forks to produce multiple processes, the sum total of CPU consumed by all of the processes is added and controlled by this limit.

For example:

job_cpu_limit = 10000

For more information on this keyword, see the JOB_LIMIT_POLICY keyword in Chapter 7. "Gathering Job Accounting Data". For more general information on limits, see "Overview of Limits".

rss_limit = hardlimit,softlimit

Specifies the hard limit and/or soft limit for the resident size. For more information, see "Overview of Limits".

stack_limit = hardlimit,softlimit

Specifies the hard limit and/or soft limit for the size of a stack. For more information, see "Overview of Limits".

wall_clock_limit = hardlimit,softlimit

Specifies the hard limit and/or soft limit for the elapsed time for which a job can run. Note that LoadLeveler uses the time the negotiator daemon dispatches the job as the start time of the job. When a job is checkpointed, vacated, and then restarted, the wall_clock_limit is not adjusted to account for the amount of time that elapsed before the checkpoint occured. This keyword is not supported for NQS jobs. Also, if the startd daemon terminates abnormally with running jobs, any wall clock limits are not supported when the daemon is restarted.

If you are running the Backfill scheduler, you must set a wall clock limit either in the job command file or in a class stanza (for the class associated with the job you submit). LoadLeveler administrators should consider setting a default wall clock limit in a default class stanza. For more information on setting a wall clock limit when using the Backfill scheduler, see "Choosing a Scheduler".

For more general information on limits, see "Overview of Limits".

Examples of Class Stanzas

Example 1: Creating a Class that Excludes Certain Users

class_a: type=class                # class that excludes users
priority=10                        # ClassSysprio
exclude_users=green judy           # Excluded users

Example 2: Creating a Class for Small-Size Jobs

small:  type=class                 # class for small jobs
priority=80                        # ClassSysprio (max=100)
cpu_limit=00:02:00                 # 2 minute limit
data_limit=30mb                    # max 30 MB data segment
core_limit=10mb                    # max 10 MB core file
file_limit=50mb                    # max file size 50 MB
stack_limit=10mb                   # max stack size 10 MB
rss_limit=35mb                     # max resident set size 35 MB
include_users = bob sally          # authorized users

Example 3: Creating a Class for Medium-Size Jobs

medium: type=class             # class for medium jobs
priority=70                    # ClassSysprio
cpu_limit=00:10:00             # 10 minute run time limit
data_limit=80mb,60mb           # max 80 MB data segment
                               # min 60 MB data segment
core_limit=30mb                # max 30 MB core file
file_limit=80mb                # max file size 80 MB
stack_limit=30mb               # max stack size 30 MB
rss_limit=100mb                # max resident set size 100 MB
job_cpu_limit=1800,1200        # hard limit is 30 minutes,
                               # soft limit is 20 minutes

Example 4: Creating a Class for Large-Size Jobs

large:  type=class             # class for large jobs
priority=60                    # ClassSysprio
cpu_limit=00:10:00             # 10 minute run time limit
data_limit=120mb               # max 120 MB data segment
core_limit=30mb                # max 30 MB core file
file_limit=120mb               # max file size 120 MB
stack_limit=unlimited          # unlimited stack size
rss_limit=150mb                # max resident set size 150 MB
job_cpu_limit = 3600,2700      # hard limit 60 minutes
                               # soft limit 45 minutes
wall_clock_limit=12:00:00,11:59:55 # hard limit is 12 hours

Example 5: Creating a Class to Route Jobs to NQS Machines

nqs:   type=class               # class for NQS jobs
NQS_class=true
NQS_submit=pipe_queue           # NQS pipe queue name
NQS_query=one two three         # list of queue names

You can use the class names in control expressions in both the global and local configuration file.

Example 6: Creating a Class for PVM Jobs

PVM3:  type=class             # class for PVM jobs
priority=60                   # ClassSysprio (max=100)
max_processors=15             # maximum number of processors

Example 7: Creating a Class for Master Node Machines

sp-6hr-sp:  type=class               # class for master node machines
priority=50              # ClassSysprio (max=100)
cpu_limit = 06:00:00     # 6 hour limit
job_cpu_limit = 06:00:00 # hard limit is 6 hours
core_limit = lmb         # max 1MB core file
master_node_requirement = true # master node definition

Step 4: Specify Group Stanzas

LoadLeveler groups are another way of granting control to the system administrator. Although a LoadLeveler group is independent from a UNIX group, you can configure a LoadLeveler group to have the same users as a UNIX group by using the include_users keyword, which is explained in this section.

The information specified in a group stanza defines the characteristics of that group. Group stanzas are optional and take the following format:

Figure 28. Format of a Group Stanza

label: type = group
admin = list
exclude_users = list
include_users = list
maxidle = number
maxjobs = number
maxqueued = number
max_node = number
max_processors = number
priority = number
total_tasks = number

You can specify the following keywords in a group stanza:

admin = list

where list is a blank-delimited list of administrators for this group. These administrators can hold, release, and cancel jobs submitted by users in the group.

exclude_users =list

where list is a blank-delimited list of users that do not belong to the group. Do not specify both a list of included users and a list of excluded users. Only one of these may be used for any group. The default is that no users will be excluded.

include_users =list

where list is a blank-delimited list of users that belong to the group. If provided, this list limits users of that group to those on the list. Do not specify both a list of included users and a list of excluded users. Only one of these can be used for any group. The default is that all users are included.

maxidle = number

where number is the maximum number of idle jobs this group can have in queue. That is, number is the maximum number of jobs which the negotiator will consider for dispatch for this group. Jobs above this maximum are placed in the NotQueued state. This prevents groups from flooding the job queue. If the group stanza does not specify maxidle or if there is no group stanza at all, the maximum number of jobs that can be simultaneously in queue for the group is defined in the default stanza. The default is -1, which means that no limit is placed on the number of jobs that can be simultaneously idle for the group.

For more information, see "Controlling the Mix of Idle and Running Jobs".

maxjobs = number

where number is a maximum number of jobs this group can run at any time. If the group stanza does not specify the maxjobs or if there is no group stanza at all, the maximum number of jobs that can be simultaneously run the group is defined in the default stanza. The default is -1, which means that no limit is placed on the number of jobs that can be simultaneously run for the group. Regardless of the limit set to running jobs, there is no limit to the number of jobs that a group can submit.

For more information, see "Controlling the Mix of Idle and Running Jobs".

maxqueued = number

where number is the maximum number of jobs allowed in the queue for this group. This prevents groups from flooding the job queue. Jobs above this maximum are placed in the NotQueued state. If no maxqueued is specified in the group stanza, or if there is no group stanza, the maximum number of jobs that can simultaneously be in the queue is defined in the default stanza. The default is -1, which means that no limit is placed on the number of jobs that can simultaneously be in the job queue for that group. Regardless of the limit set to the number of jobs queued, there is no limit to the number of jobs a group can submit.

For more information, see "Controlling the Mix of Idle and Running Jobs".

max_node = number

where number specifies the maximum number of nodes a user can request for a parallel job in a job command file using the node keyword. The default is -1, which means there is no limit.

max_processors = number

where number specifies the maximum number of processors a user can request for a parallel job in a job command file using the max_processors keyword. The default is -1, which means there is no limit.

priority = number

where number is an integer that specifies the job priority for jobs associated with this group. The higher priority numbers result in a better job dispatch order. If the group stanza does not specify a priority or if there is no priority at all, the priority is defined in the default group stanza. The default priority is 0. The number specified for priority is referenced as GroupSysprio in the configuration file. GroupSysprio can be used in the assignment of job priorities. If the variable GroupSysprio does not appear in the SYSPRIO expression in the configuration file, the priority numbers for group specified in the administration file have no effect. See "Step 5: Prioritize the Queue Maintained by the Negotiator" for more information about the GroupSysprio keyword.

total_tasks = number

where number specifies the maximum number of tasks a user specifying this group can request for a parallel job in a job command file using the total_tasks keyword. The default is -1, which means there is no limit.

Examples of Group Stanzas

Example 1

In this example, the group name is department_a. The jobs issued by users belonging to this group will have a priority of 80. There are three members in this group.

# Define group stanzas
department_a:  type = group
priority = 80
include_users = susann holly fran

Example 2

In this example, the group called great_lakes has five members and these user's jobs have a priority of 100:

# Define group stanzas
great_lakes:  type = group
priority = 100
include_users = huron ontario michigan erie superior

Step 5: Specify Adapter Stanzas

An adapter stanza identifies network adapters that are available on the machines in the LoadLeveler cluster. Adapter stanzas are optional. You need to specify an adapter stanza when you want LoadLeveler jobs to be able to request a specific adapter. You do not need to specify an adapter stanza when you want LoadLeveler jobs to access a shared, default adapter via TCP/IP.

Note the following when using an adapter stanza:

For information on creating adapter stanzas for an SP system, see llextSDR - Extract adapter information from the SDR.

An adapter stanza has the following format:

Figure 29. Format of an Adapter Stanza

label: type = adapter
adapter_name = name
interface_address = IP_address
interface_name = name
network_type = type
switch_node_number = integer

You can specify the following keywords in an adapter stanza:

adapter_name = string

Where string is the name used to refer to a particular interface card installed on the node. Some examples are en0, tk1, and css0. This keyword defines the adapters a user can specify in a job command file using the network keyword. This keyword is required.

interface_address = string

Where string is the IP address by which the adapter is known to other nodes in the network. For example: 7.14.21.28. This keyword is required.

interface_name = string

Where string is the name by which the adapter is known by other nodes in the network. This keyword is required.

network_type = string

Where string specifies the type of network that the adapter supports (for example, Ethernet). This is an administrator defined name. This keyword defines the types of networks a user can specify in a job command file using the network keyword.

switch_node_number = integer

Where integer specifies the node on which the SP switch adapter is installed. This keyword is required for SP switch adapters. Its value is defined in the switch_node_number field in the Node class in the SDR. This value must match the value in the /spdata/sys1/st/switch_node_number file of the Parallel System Support Programs (PSSP).

Example of an Adapter Stanza

Example 1: Specifying an SP Switch Adapter

In the following example, the adapter stanza called "sp01sw.ibm.com" specifies an SP switch adapter. Note that sp01sw.ibm.com is also specified on the adapter_stanzas keyword of the machine stanza for the "yugo" machine.

          yugo:  type=machine
                 adapter_stanzas = sp01sw.ibm.com
                 ...
 
sp01sw.ibm.com:  type = adapter
                 adapter_name = css0
                 interface_address = 12.148.44.218
                 interface_name = sp01sw.ibm.com
                 network_type = switch
                 switch_node_number = 7

Configuring LoadLeveler

One of your main tasks as system administrator is to configure LoadLeveler. To configure LoadLeveler, you need to know what the configuration information is and where it is located. Configuration information includes the following:

LoadLeveler sets up the following default values for the configuration information:

You can run your installation with these default values, or you can change any or all of them. To override the defaults, you must update the following keywords in the /etc/LoadL.cfg file:

LoadLUserid
Specifies the LoadLeveler user ID.
LoadLGroupid
Specifies the LoadLeveler group ID.
LoadLConfig
Specifies the full path name of the configuration file.

Note that if you change the LoadLeveler user ID to something other than loadl, you will have to make sure your configuration files are owned by this ID.

You can also override the /etc/LoadL.cfg file. For an example of when you might want to do this, see "Querying Multiple LoadLeveler Clusters".

The Configuration Files

By taking a look at the configuration files that come with LoadLeveler, you will find that there are many parameters that you can set. In most cases, you will only have to modify a few of these parameters. In some cases, though, depending upon the LoadLeveler nodes, network connection, and hardware availability, you may need to modify additional parameters. This chapter describes these configuration files and the parameters you can set.

Configuring LoadLeveler involves modifying the configuration files that specify the terms under which LoadLeveler can use machines. There are two types of configuration files:

Configuration File Structure and Syntax

The information in both the LoadL_config and the LoadL_config.local files is in the form of a statement. These statements are made up of keywords and values. There are three types of configuration file keywords:

Configuration file statements take one of the following formats:

keyword=value
keyword:value

Statements in the form keyword=value are used primarily to customize an environment. Statements in the form keyword:value are used by LoadLeveler to characterize the machine and are known as part of the machine description. Every machine in LoadLeveler has its own machine description which is read by the central manager when LoadLeveler is started.

To continue configuration file statements, use the back-slash character (\).

In the configuration file, comments must be on a separate line from keyword statements.

You can use the following types of constants and operators in the configuration file.

Numerical and Alphabetical Constants

Constants may be represented as:

Mathematical Operators

You can use the following C operators. The operators are listed in order of precedence. All of these operators are evaluated from left to right:

!
* /
- +
< <= > >=
== !=
&&
||

Customizing the Global and Local Configuration Files

This section presents a step-by-step approach to configuring LoadLeveler. You do not have to perform the steps in the order that they appear here. Other keywords which are not specifically mentioned in any of these steps are discussed in "Step 14: Specify Additional Configuration File Keywords"

Step 1: Define LoadLeveler Administrators

Specify the following keyword:

LOADL_ADMIN = list of user names (required)

where list of user names is a blank-delimited list of those individuals who will have administrative authority. These users are able to invoke the administrator-only commands such as llctl, llfavorjob, and llfavoruser. They can also invoke the administrator-only GUI functions. For more information, see "Administrative Uses for the Graphical User Interface"

LoadLeveler administrators also receive mail describing problems that are encountered by the master daemon.

An administrator on a machine is granted administrative privileges on that machine. It does not grant him administrative privileges on other machines. To be an administrator on all machines in the LoadLeveler cluster either specify your user ID in the global configuration file with no entries in the local configuration file or specify your userid in every local configuration file that exists in the LoadLeveler cluster.

For example, to grant administrative authority to users bob and mary, enter the following in the configuration file:

LOADL_ADMIN = bob mary

Step 2: Define LoadLeveler Cluster Characteristics

You can use the following keywords to define the characteristics of the LoadLeveler cluster:

CUSTOM_METRIC = number

Specifies a machine's relative priority to run jobs. This is an an arbitrary number which you can use in the MACHPRIO expression. If you specify neither CUSTOM_METRIC nor CUSTOM_METRIC_COMMAND, CUSTOM_METRIC = 1 is assumed. For more information, see "Step 6: Prioritize the Order of Executing Machines Maintained by the Negotiator".

CUSTOM_METRIC_COMMAND = command

Specifies an executable and any required arguments. The exit code of this command is assigned to CUSTOM_METRIC. If this command does not exit normally, CUSTOM_METRIC is assigned a value of 1. This command is forked every (POLLING_FREQUENCY * POLLS_PER_UPDATE) period.

MACHINE_AUTHENTICATE =  true |false

Specifies whether machine validation is performed. When set to true, LoadLeveler only accepts connections from machines specified in the administration file. When set to false, LoadLeveler accepts connections from any machine.

When set to true, every communication between LoadLeveler processes will verify that the sending process is running on a machine which is identified via a machine stanza in the administration file. The validation is done by capturing the address of the sending machine when the accept function call is issued to accept a connection. The gethostbyaddr function is called to translate the address to a name, and the name is matched with the list derived from the administration file.

Choosing a Scheduler

This section discusses the types of schedulers that are available under LoadLeveler, and the keywords you use to define these schedulers.

Use the following keywords to define your scheduler:

SCHEDULER_API =  YES |NO

where YES disables the default LoadLeveler scheduling algorithm. Specifying YES implies you will use the job control API to communicate to LoadLeveler scheduling decisions made by an external scheduler. For more information, see "Job Control API".

Specify NO to run the default LoadLeveler scheduler.

SCHEDULER_TYPE = BACKFILL

where BACKFILL specifies the LoadLeveler Backfill scheduler. Note that when you specify this keyword:

Step 3: Define LoadLeveler Machine Characteristics

You can use the following keywords to define the characteristics of machines in the LoadLeveler cluster:

ARCH = string (required)

Indicates the standard architecture of the system. The architecture you specify here must be specified in the same format in the requirements and preferences statements in job command files. The administrator defines the character string for each architecture.

For example, to define a machine as a RISC System/6000, the keyword would look like:

  ARCH = RS6000

Class = { "class1" "class2" ... } | { "No_Class" }

where "class1" "class2" ... is a blank delimited list of class names. This keyword determines whether a machine will accept jobs of a certain job class. For parallel jobs, you must define a class for each task you want to run on a node.

You can specify a default_class in the default user stanza of the administration file to set a default class. If you don't, jobs will be assigned the class called No_Class.

In order for a LoadLeveler job to run on a machine, the machine must have a vacancy for the class of that job. If the machine is configured for only one No_Class job and a LoadLeveler job is already running there, then no further LoadLeveler jobs are started on that machine until the current job completes.

You can have a maximum of 1024 characters in the class statement. You cannot use allclasses as a class name, since this is a reserved LoadLeveler keyword.

You can assign multiple classes to the same machine by specifying the classes in the LoadLeveler configuration file (called LoadL_config) or in the local configuration file (called LoadL_config.local). The classes, themselves, should be defined in the administration file. See "Setting Up a Single Machine To Have Multiple Job Classes" and "Step 3: Specify Class Stanzas" for more information on classes.

Defining Classes - Examples

Example 1

This example defines the default class:

Class = { "No_Class" }

This is the default. The machine will only run one LoadLeveler job at a time that has either defaulted to, or explicitly requested class No_Class. A LoadLeveler job with class CPU_bound, for example, would not be eligible to run here. Only one LoadLeveler job at a time will run on the machine.

Example 2

This example specifies multiple classes. The machine will only run jobs that have either defaulted to or explicitly requested class No_Class. A maximum of two LoadLeveler jobs are permitted to run simultaneously on the machine if the MAX_STARTERS keyword is not specified. See "Step 4: Specify How Many Jobs a Machine Can Run" for more information on MAX_STARTERS.

Class = { "No_Class" "No_Class" }

Example 3

This example specifies multiple classes. The machine will only run a maximum of four LoadLeveler jobs that have either defaulted to, or explicitly requested No_Class, Small, Medium, or Large class. A LoadLeveler job with class IO_bound, for example, would not be eligible to run here.

Class = { "No_Class" "Small" "Medium" "Large" }

Example 4

This example specifies multiple classes. The machine will run only LoadLeveler jobs that have explicitly requested class B or D. Up to three LoadLeveler jobs may run simultaneously: two of class B and one of class D. A LoadLeveler job with class No_Class, for example, would not be eligible to run here.

Class = { "B" "B" "D" }

Feature = {"string" ...}

where string is the (optional) characteristic to use to match jobs with machines.

You can specify unique characteristics for any machine using this keyword. When evaluating job submissions, LoadLeveler compares any required features specified in the job command file to those specified using this keyword. You can have a maximum of 1024 characters in the feature statement.

For example, if a machine has licenses for installed products ABC and XYZ, in the local configuration file you can enter the following:

Feature = {"abc" "xyz"}

When submitting a job that requires both of these products, you should enter the following in your job command file:

requirements = (Feature == "abc") && (Feature == "xyz")

START_DAEMONS = true| false 

Specifies whether to start the LoadLeveler daemons on the node. When true, the daemons are started.

In most cases, you will probably want to set this keyword to true. An example of why this keyword would be set to false is if you want to run the daemons on most of the machines in the cluster but some individual users with their own local configuration files do not want their machines to run the daemons. The individual users would modify their local configuration files and set this keyword to false. Because the global configuration file has the keyword set to true, their individual machines would still be able to participate in the LoadLeveler cluster.

Also, to define the machine as strictly a submit-only machine, set this keyword to false. For more information, see the submit-only section.

SCHEDD_RUNS_HERE = true| false 

Specifies whether the schedd daemon runs on the host. If you do not want to run the schedd daemon, specify false.

To define the machine as an executing machine only, set this keyword to false. For more information, see the submit-only section.

STARTD_RUNS_HERE = true| false 

Specifies whether the startd daemon runs on the host. If you do not want to run the startd daemon, specify false.

X_RUNS_HERE = true| false 

Set X_RUNS_HERE to true if you want to start the keyboard daemon.

Step 4: Specify How Many Jobs a Machine Can Run

To specify how many jobs a machine can run, you need to take into consideration both the MAX_STARTERS keyword, which is described in this section, and the Class statement, which is mentioned here and described in more detail in "Step 3: Define LoadLeveler Machine Characteristics"

The syntax for MAX_STARTERS is:

MAX_STARTERS = number

where number specifies the maximum number of tasks that can run simultaneously on a machine. In this case, a task can be a serial job step, a parallel task, or an instance of the PVM daemon (PVMD). If not specified, the default is the number of elements in the Class statement. MAX_STARTERS defines the number of initiators on the machine (the number of tasks that can be initiated from a startd).

For example, if the configuration file contains these statements:

Class = { "A" "B" "B" "C"}
MAX_STARTERS = 2

the machine can run a maximum of two LoadLeveler jobs simultaneously. The possible combinations of LoadLeveler jobs are:

If this keyword is specified in conjunction with a Class statement, the maximum number of jobs that can be run is equal to the lower of the two numbers. For example, if:

MAX_STARTERS = 2
Class = { "class_a" }

then the maximum number of job steps that can be run is one (the Class statement above defines one class).

If you specify MAX_STARTERS keyword without specifying a Class statement, by default one class still exists (called No_Class). Therefore, the maximum number of jobs that can be run when you do not specify a Class statement is one.

If this keyword is not defined in either the global configuration file or the local configuration file, the maximum number of jobs that the machine can run is equal to the number of classes in the Class statement.

Step 5: Prioritize the Queue Maintained by the Negotiator

Each job submitted to LoadLeveler is assigned a system priority number, based on the evaluation of the SYSPRIO keyword expression in the configuration file of the central manager. The LoadLeveler system priority number is assigned when the central manager adds the new job to the queue of jobs eligible for dispatch. Once assigned, the system priority number for a job is never changed (unless jobs for a user swap their SYSPRIO, or NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL is not zero). Jobs assigned higher SYSPRIO numbers are considered for dispatch before jobs with lower numbers. See "How Does a Job's Priority Affect Dispatching Order?" for more information on job priorities.

You can use the following keywords to define the SYSPRIO expression:

ClassSysprio
The priority for the class of the job step, defined in the class stanza in the administration file. The default is 0.

GroupQueuedJobs
The number of job steps associated with a LoadLeveler group which are either running or queued. (That is, job steps which are in one of these states: Running, Starting, Pending, or Idle.)

GroupRunningJobs
The number of job steps for the LoadLeveler group which are in one of these states: Running, Starting, or Pending.

GroupSysprio
The priority for the group of the job step, defined in the group stanza in the administration file. The default is 0.

GroupTotalJobs
The total number of job steps associated with this LoadLeveler group. Total job steps are all job steps reported by the llq command.

QDate
The difference in the UNIX date when the job step enters the queue and the UNIX date when the negotiator starts up.

UserPrio
The user-defined priority of the job step, specified in the job command file with the user_priority keyword. The default is 0.

UserQueuedJobs
The number of job steps either running or queued for the user. (That is, job steps which are in one of these states: Running, Starting, Pending, or Idle.)

UserRunningJobs
The number of job step steps for the user which are in one of these states: Running, Starting, or Pending.

UserSysprio
The priority of the user who submitted the job step, defined in the user stanza in the administration file. The default is 0.

UserTotalJobs
The total number of job steps associated with this user. Total job steps are all job steps reported by the llq command.

Usage Notes for the SYSPRIO Keyword

Using the SYSPRIO Keyword - Examples

Example 1

This example creates a FIFO job queue based on submission time:

SYSPRIO : 0 - (QDate)

Example 2

This example accounts for Class, User, and Group system priorities:

SYSPRIO : (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1) - (QDate)

Example 3

This example orders the queue based on the number of jobs a user is currently running. The user who has the fewest jobs running is first in the queue. You should set NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL in conjunction with this SYSPRIO expression.

SYSPRIO : 0 - UserRunningJobs

Step 6: Prioritize the Order of Executing Machines Maintained by the Negotiator

Each executing machine is assigned a machine priority number, based on the evaluation of the MACHPRIO keyword expression in the configuration file of the central manager. The LoadLeveler machine priority number is updated every time the central manager updates its machine data. Machines assigned higher MACHPRIO numbers are considered to run jobs before machines with lower numbers. For example, a machine with a MACHPRIO of 10 is considered to run a job before a machine with a MACHPRIO of 5. Similarly, a machine with a MACHPRIO of -2 would be considered to run a job before a machine with a MACHPRIO of -3.

Note that the MACHPRIO keyword is valid only on the machine where the central manager is running. Using this keyword in a local configuration file has no effect.

When you use a MACHPRIO expression that is based on load average, the machine may be temporarily ordered later in the list immediately after a job is scheduled to that machine. This is because the negotiator adds a compensating factor to the startd machine's load average every time the negotiator assigns a job. For more information, see the NEGOTIATOR_LOADAVG_INCREMENT keyword.

You can use the following keywords in the MACHPRIO expression:

LoadAvg
The Berkeley one-minute load average of the machine, reported by startd.

Cpus
The number of processors of the machine, reported by startd.

Speed
The relative speed of the machine, defined in a machine stanza in the administration file. The default is 1.

Memory
The size of real memory in megabytes of the machine, reported by startd.

VirtualMemory
The size of available swap space in kilobytes of the machine, reported by startd.

Disk
The size of free disk space in kilobytes on the filesystem where the executables reside.

CustomMetric
Allows you to set a relative priority number for one or more machines, based on the value of the CUSTOM_METRIC keyword. (See "Example 4" for more information.)

MasterMachPriority
A value that is equal to 1 for nodes which are master nodes (those with master_node_exclusive = true); this value is equal to 0 for nodes which are not master nodes. Assigning a high priority to master nodes may help job scheduling performance for parallel jobs which require master node features.

Using the MACHPRIO Keyword - Examples

Example 1

This example orders machines by the Berkeley one-minute load average.

MACHPRIO : 0 - (LoadAvg)

Therefore, if LoadAvg equals .7, this example would read:

MACHPRIO : 0 - (.7)

The MACHPRIO would evaluate to -.7.

Example 2

This example orders machines by the Berkeley one-minute load average normalized for machine speed:

MACHPRIO : 0 - (1000 * (LoadAvg / (Cpus * Speed)))

Therefore, if LoadAvg equals .7, Cpus equals 1, and Speed equals 2, this example would read:

MACHPRIO : 0 - (1000 * (.7 / (1 * 2)))

This example further evaluates to:

MACHPRIO : 0 - (350)

The MACHPRIO would evaluate to -350.

Notice that if the speed of the machine were increased to 3, the equation would read:

MACHPRIO : 0 - (1000 * (.7 / (1 * 3)))

The MACHPRIO would evaluate to approximately -233. Therefore, as the speed of the machine increases, the MACHPRIO also increases.

Example 3

This example orders machines accounting for real memory and available swap space (remembering that Memory is in Mbytes and VirtualMemory is in Kbytes):

MACHPRIO : 0 - (10000 * (LoadAvg / (Cpus * Speed))) +
(10 * Memory) + (VirtualMemory / 1000)

Example 4

This example sets a relative machine priority based on the value of the CUSTOM_METRIC keyword.

MACHPRIO : CustomMetric

To do this, you must specify a value for the CUSTOM_METRIC keyword or the CUSTOM_METRIC_COMMAND keyword in either the LoadL_config.local file of a machine or in the global LoadL_config file. To assign the same relative priority to all machines, specify the CUSTOM_METRIC keyword in the global configuration file. For example:

CUSTOM_METRIC = 5

You can override this value for an individual machine by specifying a different value in that machine's LoadL_config.local file.

Example 5

This example gives master nodes the highest priority:

MACHPRIO : (MasterMachPriority * 10000)

Step 7: Manage a Job's Status Using Control Expressions

You can control running jobs by using five control functions as Boolean expressions in the configuration file. These functions are useful primarily for serial jobs. You define the expressions, using normal C conventions, with the following functions:

START
SUSPEND
CONTINUE
VACATE
KILL

The expressions are evaluated for each job running on a machine using both the job and machine attributes. Some jobs running on a machine may be suspended while others are allowed to continue.

The START expression is evaluated twice; once to see if the machine can accept jobs to run and second to see if the specific job can be run on the machine. The other expressions are evaluated after the jobs have been dispatched and in some cases, already running.

When evaluating the START expression to determine if the machine can accept jobs, Class != { "Z" } evaluates to true only if Z is not in the class definition. This means that if two different classes are defined on a machine, Class != { "Z" } (where Z is one of the defined classes) always evaluates to false when specified in the START expression and, therefore, the machine will not be considered to start jobs.

START: expression that evaluates to T or F (true or false)

Determines whether a machine can run a LoadLeveler job. When the expression evaluates to T, LoadLeveler considers dispatching a job to the machine.

When you use a START expression that is based on the CPU load average, the negotiator may evaluate the expression as F even though the load average indicates the machine is Idle. This is because the negotiator adds a compensating factor to the startd machine's load average every time the negotiator assigns a job. For more information, see the NEGOTIATOR_LOADAVG_INCREMENT keyword.

SUSPEND: expression that evaluates to T or F (true or false)

Determines whether running jobs should be suspended. When T, LoadLeveler temporarily suspends jobs currently running on the machine. Suspended LoadLeveler jobs will either be continued or vacated. This keyword is not supported for parallel jobs.

CONTINUE: expression that evaluates to T or F (true or false)

Determines whether suspended jobs should continue execution. When T, suspended LoadLeveler jobs resume execution on the machine.

VACATE: expression that evaluates to T or F (true or false)

Determines whether suspended jobs should be vacated. When T, suspended LoadLeveler jobs are removed from the machine and placed back into the queue (provided you specify restart=yes in the job command file). If a checkpoint was taken, the job restarts from the checkpoint. Otherwise, the job restarts from the beginning.

KILL: expression that evaluates to T or F (true or false)

Determines whether or not vacated jobs should be killed and replaced in the queue. It is used to remove a job that is taking too long to vacate. When T, vacated LoadLeveler jobs are removed from the machine with no attempt to take checkpoints.

Typically, machine load average, keyboard activity, time intervals, and job class are used within these various expressions to dynamically control job execution.

How Control Expressions Affect Jobs

After LoadLeveler selects a job for execution, the job can be in any of several states. Figure 30 shows how the control expressions can affect the state a job is in. The rectangles represent job or daemon states, and the diamonds represent the control expressions.

Figure 30. How Control Expressions Affect Jobs

View figure.

Criteria used to determine when a LoadLeveler job will enter Start, Suspend, Continue, Vacate, and Kill states are defined in the LoadLeveler configuration files and may be different for each machine in the cluster. They may be modified to meet local requirements.

Step 8: Define Job Accounting

LoadLeveler provides accounting information on completed LoadLeveler jobs. For detailed information on this function, refer to Chapter 7. "Gathering Job Accounting Data"

The following keywords allow you to control accounting functions:

ACCT = flag

The available flags are:

A_ON
Turns accounting data recording on. If specified without the A_DETAIL flag, the following is recorded:
  • The total amount of CPU time consumed by the entire job
  • The maximum memory consumption of all tasks (or nodes).

A_OFF
Turns accounting data recording off. This is the default.

A_VALIDATE
Turns account validation on.

A_DETAIL
Enables extended accounting. Using this flag causes LoadLeveler to record detail resource consumption by machine and by events for each job step. This flag also enables the -x flag of the llq command, permitting users to view resource consumption for active jobs.

For example:

ACCT = A_ON A_DETAIL

This example specifies that accounting should be turned on and that extended accounting data should be collected and that the -x flag of the llq command be enabled.

ACCT_VALIDATION = $(BIN/llacctval (optional)

Keyword used to identify the executable that is called to perform account validation. You can replace the llacctval executable with your own validation program by specifying your program in this keyword.

GLOBAL_HISTORY = $(SPOOL) (optional)

Keyword used to identify the directory that will contain the global history files produced by llacctmrg command when no directory is specified as a command argument.

For example, the following section of the configuration file specifies that the accounting function is turned on. It also identifies the module used to perform account validation and the directory containing the global history files:

ACCT                = A_ON A_VALIDATE
ACCT_VALIDATION     = $(BIN)/llacctval
GLOBAL_HISTORY      = $(SPOOL)

Step 9: Specify Alternate Central Managers

In one of your machine stanzas specified in the administration file, you specified that the machine would serve as the central manager. It is possible for some problem to cause this central manager to become unusable such as network communication or software or hardware failures. In such cases, the other machines in the LoadLeveler cluster believe that the central manager machine is no longer operating. To remedy this situation, you can assign one or more alternate central managers in the machine stanza to take control.

The following machine stanza example defines the machine deep_blue as an alternate central manager:

#
deep_blue:  type=machine
central_manager = alt

If the primary central manager fails, the alternate central manager then becomes the central manager. The alternate central manager is chosen based upon the order in which its respective machine stanza appears in the administration file.

When an alternate becomes the central manager, jobs will not be lost, but it may take a few minutes for all of the machines in the cluster to check in with the new central manager. As a result, job status queries may be incorrect for a short time.

When you define alternate central managers, you should set the following keywords in the configuration file:

CENTRAL_MANAGER_HEARTBEAT_INTERVAL = number

where number is the amount of time in seconds that defines how frequently primary and alternate central managers communicate with each other.

The default is 300 seconds or 5 minutes.

CENTRAL_MANAGER_TIMEOUT = number

where number is the number of heartbeat intervals that an alternate central manager will wait without hearing from the primary central manager before declaring that the primary central manager is not operating.

The default is 6.

In the following example, the alternate central manager will wait for 30 intervals, where each interval is 45 seconds:

# Set a 45 second interval
CENTRAL_MANAGER_HEARTBEAT_INTERVAL = 45
# Set the number of intervals to wait
CENTRAL_MANAGER_TIMEOUT = 30

For more information on central manager backup, refer to "What Happens if the Central Manager Isn't Operating?"

Step 10: Specify Where Files and Directories are Located

The configuration file provided with LoadLeveler specifies default locations for all of the files and directories. You can modify their locations using the following keywords. Keep in mind that the LoadLeveler installation process installs files in these directories and these files may be periodically cleaned up. Therefore, you should not keep any files that do not belong to LoadLeveler in these directories.
To specify the location of the: Specify these keywords:
Administration File

ADMIN_FILE = pathname (required)

points to the administration file containing user, class, group, machine, and adapter stanzas. For example,
ADMIN_FILE = $(tilde)/admin_file

Local Configuration File

LOCAL_CONFIG = pathname

defines the pathname of the optional local configuration file containing information specific to a node in the LoadLeveler network. If you are using a distributed file system like NFS, some examples are:
LOCAL_CONFIG = $(tilde)/$(host).LoadL_config.local
LOCAL_CONFIG = $(tilde)/LoadL_config.$(host).$(domain)
LOCAL_CONFIG = $(tilde)/LoadL_config.local.$(hostname)

If you are using a local file system, an example is:

LOCAL_CONFIG = /var/LoadL/LoadL_config.local

See "LoadLeveler Variables" for information about the tilde, host, and domain variables.


Local Directory The following subdirectories reside in the local directory. It is possible that the local directory and LoadLeveler's home directory are the same.

EXECUTE = local directory/execute (required)

defines the local directory to store the executables of jobs submitted by other machines.

LOG = local directory/log (required)

defines the local directory to store log files. It is not necessary to keep all the log files created by the various LoadLeveler daemons and programs in one directory but you will probably find it convenient.

SPOOL = local directory/spool (required)

Defines the local directory where LoadLeveler keeps the local job queue and checkpoint files, as well as:

HISTORY = $(SPOOL)/history (required)

defines the pathname where a file containing the history of local LoadLeveler jobs is kept.
Release Directory

RELEASEDIR = release directory (required)

defines the directory where all the LoadLeveler software resides. The following subdirectories are created during installation and they reside in the release directory. You can change their locations.

BIN = $(RELEASEDIR)/bin (required)

defines the directory where LoadLeveler binaries are kept.

LIB = $(RELEASEDIR)/lib (required)

defines the directory where LoadLeveler libraries are kept.

NQS_DIR = NQS directory (optional)

defines the directory where NQS commands qsub, qstat, and qdel reside. The default is /usr/bin.

Step 11: Record and Control Log Files

The LoadLeveler daemons and processes keep log files according to the specifications in the configuration file. A number of keywords are used to describe where LoadLeveler maintains the logs and how much information is recorded in each log. These keywords, shown in Table 9, are repeated in similar form to specify the pathname of the log file, its maximum length, and the debug flags to be used.

"Controlling Debugging Output" describes the events that can be reported through logging controls.

Table 9. Log Control Statements
Daemon/ Process Log File (required)

(See note (PAT))

Max Length (required)

(See note (MXL))

Debug Control (required)

(See note (FLA))

Master MASTER_LOG = path MAX_MASTER_LOG = bytes MASTER_DEBUG = flags
Schedd SCHEDD_LOG = path MAX_SCHEDD_LOG = bytes SCHEDD_DEBUG = flags
Startd STARTD_LOG = path MAX_STARTD_LOG = bytes STARTD_DEBUG = flags
Starter STARTER_LOG = path MAX_STARTER_LOG = bytes STARTER_DEBUG = flags
Negotiator NEGOTIATOR_LOG = path MAX_NEGOTIATOR_LOG = bytes NEGOTIATOR_DEBUG = flags
Kbdd KBDD_LOG = path MAX_KBDD_LOG = bytes KBDD_DEBUG = flags

Notes:

  1. (PAT) When coding the path for the log files, it is not necessary that all LoadLeveler daemons keep their log files in the same directory, however, you will probably find it a convenient arrangement.

  2. (MXL) There is a maximum length, in bytes, beyond which the various log files cannot grow. Each file is allowed to grow to the specified length and is then saved to an .old file. The .old files are overwritten each time the log is saved, thus the maximum space devoted to logging for any one program will be twice the maximum length of its log file. The default length is 64KB.

    You can also specify that the log file be started anew with every invocation of the daemon by setting the TRUNC statement to true as follows:

    TRUNC_MASTER_LOG_ON_OPEN = true|false
    TRUNC_STARTD_LOG_ON_OPEN = true|false
    TRUNC_SCHEDD_LOG_ON_OPEN = true|false
    TRUNC_KBDD_LOG_ON_OPEN = true|false
    TRUNC_STARTER_LOG_ON_OPEN = true|false
    TRUNC_NEGOTIATOR_LOG_ON_OPEN = true|false

  3. LoadLeveler creates temporary log files used by the starter daemon. These files are used for synchronization purposes. When a job starts, a StarterLog.pid file is created. When the job ends, this file is appended to the StarterLog file.

  4. (FLA) Normally, only those who are installing or debugging LoadLeveler will need to use the debug flags, described in "Controlling Debugging Output" The default error logging, obtained by leaving the right side of the debug control statement null, will be sufficient for most installations.

Controlling Debugging Output

You can control the level of debugging output logged by LoadLeveler programs. The following flags are presented here for your information, though they are used primarily by IBM personnel for debugging purposes:

D_ACCOUNT
Logs accounting information about processes. If used, it may slow down the network.
D_CKPT
Logs various steps in the checkpointing process. Logs calls to read and write by the xdr routines.
D_DAEMON
Logs information regarding basic daemon set up and operation, including information on the communication between daemons.
D_DBX
Bypasses certain signal settings to permit debugging of the processes as they execute in certain critical regions.
D_EXPR
Logs steps in parsing and evaluating control expressions.
D_FULLDEBUG
Logs details about most actions performed by each daemon but doesn't log as much activity as setting all the flags.
D_JOB
Logs job requirements and preferences when making decisions regarding whether a particular job should run on a particular machine.
D_LOAD
Displays the load average on the startd machine.
D_MACHINE
Logs machine control functions and variables when making decisions regarding starting, suspending, resuming, and aborting remote jobs.
D_NEGOTIATE
Displays the process of looking for a job to run in the negotiator. It only pertains to this daemon.
D_NQS
Provides more information regarding the processing of NQS files.
D_PROC
Logs information about jobs being started remotely such as the number of bytes fetched and stored for each job.
D_STANZAS
Displays internal information about the parsing of the administration file.
D_SCHEDD
Displays how the schedd works internally.
D_STARTD
Displays how the startd works internally.
D_STARTER
Displays how the starter works internally.
D_THREAD
Displays the ID of the thread producing the log message. The thread ID is displayed immediately following the date and time. This flag is useful for debugging threaded daemons.
D_XDR
Logs information regarding External Data Representation (XDR) communication protocols.

For example,

SCHEDD_DEBUG = D_CKPT  D_XDR

causes the scheduler to log information about checkpointing user jobs and exchange xdr messages with other LoadLeveler daemons. These flags will primarily be of interest to LoadLeveler implementers and debuggers.

Step 12: Define Network Characteristics

A port number is an integer that specifies the port number to use to connect to the specified daemon. You can define these port numbers in the configuration file or the /etc/services file or you can accept the defaults. LoadLeveler first looks in the configuration file for these port numbers. If the port number is in the configuration file and is valid, this value is used. If it is an invalid value, the default value is used.

If LoadLeveler does not find the value in the configuration file, it looks in the /etc/services file. If the value is not found in this file, the default is used.

The configuration file keywords associated with port numbers are the following:

CLIENT_TIMEOUT = number

where number specifies the maximum time, in seconds, that a LoadLeveler daemon waits for a response over TCP/IP from a process. If the waiting time exceeds the specified amount, the daemon tries again to communicate with the process. The default is 30 seconds. In general, you should use this default setting unless you are experiencing delays due to an excessively loaded network. If so, you should try increasing this value. CLIENT_TIMEOUT is used by all LoadLeveler daemons.

MASTER_STREAM_PORT = port number

The default is 9616.

NEGOTIATOR_STREAM_PORT = port number

The default is 9614.

SCHEDD_STREAM_PORT = port number

The default is 9605.

STARTD_STREAM_PORT = port number

The default is 9611.

COLLECTOR_DGRAM_PORT = port number

The default is 9613. This keyword is used by the negotiator daemon.

STARTD_DGRAM_PORT = port number

The default is 9615.

MASTER_DGRAM_PORT = port number

The default is 9617.

As stated earlier, if LoadLeveler does not find the value in the configuration file, it looks in the /etc/services file. If the value is not found in this file, the default is used. The following is an example of this file illustrating the port numbers:

LoadL_master     9616/tcp   # Master port number for stream port
LoadL_negotiator 9614/tcp   # Negotiator port number
LoadL_schedd     9605/tcp   # Schedd port number for stream port
LoadL_startd     9611/tcp   # Startd port number for stream port
LoadL_negotiator 9613/udp   # Negotiator port number for dgram port
LoadL_startd     9615/udp   # Startd port number for dgram port
LoadL_master     9617/udp   # Master port number for dgram port

Step 13: Enable Checkpointing

This section tells you how to set up checkpointing for jobs. For more information on the job command file keywords mentioned here, see "Job Command File Keywords". To enable checkpointing for parallel jobs, you must use the APIs provided with the Parallel Environment (PE) program. For information on parallel checkpointing, see IBM Parallel Environment for AIX:Operation and Use, Volume 1.

Checkpointing is a method of periodically saving the state of a job so that if the job does not complete it can be restarted from the saved state. You can checkpoint both serial and parallel jobs.

You can specify the following types of checkpointing:

user initiated
The user's application program determines when the checkpoint is taken. This type of checkpointing is available to both serial and parallel jobs.

system initiated
The checkpoint is taken at administrator-defined intervals. This type of checkpointing is available only to serial jobs.

At checkpoint time, a checkpoint file is created, by default, on the executing machine and stored on the scheduling machine. You can control where the file is created and stored by using the CHKPT_FILE and CHKPT_DIR environment variables, which are described in "Set the Appropriate Environment Variables". The checkpoint file contains the program's data segment, stack, heap, register contents, signal state and the states of the open files at the time of the checkpoint. The checkpoint file is often much larger in size than the executable.

When a job is vacated, the most recent checkpoint file taken before the job was vacated is used to restart the job when it is scheduled to run on a new machine. Note that a vacating job may be killed by LoadLeveler if the job takes too long to write its checkpoint file. This occurs only when a job is vacated by the executing machine after the job's VACATE expression evaluates to TRUE. See "Step 7: Manage a Job's Status Using Control Expressions" for more information on the VACATE and KILL expressions.

If the executing machine fails, then when the machine restarts LoadLeveler reschedules the job, which restores its state from the most recent checkpoint file. LoadLeveler waits for the original executing machine to restart before scheduling the job to run on another machine in order to ensure that only one copy of the job will run.

Planning Considerations for Checkpointing Jobs

Review the following guidelines before you submit a checkpointing job:

Set the Appropriate Environment Variables

This section discusses the CHKPT_STATE, CHKPT_FILE, and CHKPT_DIR environment variables.

The CHKPT_STATE environment variable allows you to enable and disable checkpointing. CHKPT_STATE can be set to the following:

enable
Enables checkpointing.

restart
Restarts the executable from an existing checkpoint file.

If you set checkpoint=no in your job command file, no checkpoints are taken, regardless of the value of the CHKPT_STATE environment variable. See "checkpoint" for more information.

The CHKPT_FILE and CHKPT_DIR environment variables help you manage your checkpoint files. For parallel jobs, you must specify at least one of these variables in order to designate the location of the checkpoint file. For serial jobs, if you do not specify either of these variables, LoadLeveler manages your checkpoint files. LoadLeveler stores the checkpoint file in its working directories and deletes the file as soon as the job terminates (that is, when the job exits the LoadLeveler system.) If your job terminates abnormally, there is no checkpoint file from which LoadLeveler can restart the job. When you resubmit the job, it will start running from the beginning.

To avoid this problem, use CHKPT_FILE and CHKPT_DIR to control where your checkpoint file is stored. CHKPT_DIR specifies the directory where it is stored, and CHKPT_FILE specifies the checkpoint file name. (You can use just CHKPT_FILE provided you specify a full path name. Also, you can use just CHKPT_DIR; in this case the checkpoint file is copied to the directory you specify with a file name of executable.chkpt.) You can use these variables to have your checkpoint file written to a the file system of your choice. This allows you to resubmit your job and have it restart from the last checkpoint file, since the file will not be erased if your job is terminated. If your job completes normally, the checkpoint library deletes all checkpoint files associated with the job.

Note that two or more job steps running at the same time cannot both write to the same checkpoint file, since the file will be corrupted.

See "How to Checkpoint a Job" for more information.

Plan for Jobs that You Will Migrate

If you plan to migrate jobs (restart jobs on a different node or set of nodes), you should understand the difference between writing checkpoint files to a local file system (such as JFS) versus a global file system (such as AFS or GPFS). The CHKPT_DIR and CHKPT_FILE environment variables allow you to write to either type of file system. If you are using a local file system, you must first move the checkpoint file(s) to the target node(s) before resubmitting the job. Then you must ensure that the job runs on those specific nodes. If you are using a global file system, the checkpointing may take longer, but there is no additional work required to migrate the job.

Reserve Adequate Disk Space in the Execute Directory

A checkpoint file requires a significant amount of disk space. Your job may fail if the directory where the checkpoint file is written does not have adequate space. For serial jobs, the directory must be able to contain two checkpoint files. For parallel jobs, the directory must be able to contain 2*n checkpoint files, where n is the number of tasks. You can make an accurate size estimate only after you've run your job and noticed the size of the checkpoint file that is created. LoadLeveler attempts to reserve enough disk space for the checkpoint file when the job is started. However, only you can ensure that enough space is available.

Set your Checkpoint File Size to the Maximum

To make sure that your job is not prevented from writing a checkpoint file due to system limits, assign your job to a job class that has its file creation limit set to the maximum (unlimited). In the administration file, set up a class stanza for checkpointing jobs with the following entry:

  file_limit = unlimited,unlimited

This statement specifies that there is no limit on the maximum size of a file that your program can create.

Checkpoint Programs Whose States are Simple to Checkpoint and Recreate

For some processes, it is impossible to obtain or recreate the state of the process. For this reason, you should only checkpoint programs whose states are simple to checkpoint and recreate. A program that is long-running, computation-intensive, and does not fork any processes is an example of a job well suited for checkpointing.

Avoid Using Certain System Services in Checkpointed Jobs

In order to prevent unpredictable results from occurring, checkpointing jobs should not use the following system services:

Another limitation of checkpointing jobs is file I/O. Since individual write calls are not traced, the file recovery scheme requires that all I/O operations, when repeated, must yield the same result. A job that opens all files as read only can be checkpointed. A job that writes to a file and then reads the data back may also be checkpointed. An example of I/O that could cause unpredictable results is reading, writing, and then reading again the same area of a file.

Choose a Supported Compiler

Compile your program with one of the following supported compilers:

Ensure all User's Jobs are Linked to Checkpointing Libraries

All serial checkpointing programs must be linked with the LoadLeveler libraries libchkrst.a and chkrst_wrap.o. To ensure your checkpointing jobs are linked correctly, compile your programs using the compile scripts found in the bin subdirectory of the LoadLeveler release directory. These compile scripts are as follows:

crxlc (for use with C)
crxlC (for use with C++)
crxlf (for use with FORTRAN)

In all these scripts, be sure to substitute all occurrences of "RELEASEDIR" with the location of the LoadLeveler release directory.

C Syntax

crxlc executable [args] source_file

Where:

executable
Is your checkpointable binary.

args
Is one or more arguments you supply to the compiler (xlc -c).

source_file
Is your C source code.

Some examples are:

   crxlc myprog myprog.c
   crxlc myprog -qlanglvl=extended myprog.c

CC++ Syntax

crxlC executable [args] source_file

Where:

executable
Is your checkpointable binary.

args
Is one or more arguments you supply to the compiler (xlC -c).

source_file
Is your C++ source code.

Some examples are:

   crxlC myprog myprog.C
   crxlC myprog -qlanglvl=extended myprog.C

FORTRAN Syntax

crxlf executable [args] source_file

Where:

executable
Is your checkpointable binary.

args
Is one or more arguments you supply to the compiler (xlf -c).

source_file
Is your FORTRAN source code.

Some examples are:

   crxlf myprog myprog.f
   crxlf myprog -qintlog -qfullpath myprog.f

How to Checkpoint a Job

There are several ways to checkpoint a job. To determine which type of checkpointing is appropriate for your situation, refer to the following table:
To specify that: Do this:
Your serial job determines when the checkpoint occurs Add the following option to your job command file:

checkpoint = user_initiated

You can also select this option on the Build a Job window of the GUI.

User initiated checkpointing is available to FORTRAN, C, and C++ programs which call the ckpt serial checkpointing API. See "Serial Checkpointing API" for more information.

LoadLeveler automatically checkpoints your serial job. Add the following option to your job command file:

checkpoint = system_initiated

You can also select this option on the Build a Job window of the GUI.

For this type of checkpointing to work, system administrators must set two keywords in the configuration file to specify how often LoadLeveler would take a checkpoint of the job. These two keywords are:

MIN_CKPT_INTERVAL = number  MAX_CKPT_INTERVAL = number

where number specifies a period, in seconds, between checkpoints taken for running jobs. The time between checkpoints will be increased after each checkpoint within these limits as follows:

  • The first checkpoint is taken after a period of time equal to the MIN_CKPT_INTERVAL has passed.

  • The second checkpoint is taken after LoadLeveler waits twice as long (MIN_CKPT_INTERVAL X 2)

  • The third checkpoint is taken after LoadLeveler waits twice as long again (MIN_CKPT_INTERVAL X 4) before taking the third checkpoint.

LoadLeveler continues to double this period until the value of MAX_CKPT_INTERVAL has been reached, where it stays for the remainder of the job.

A minimum value of 900 (15 minutes) and a maximum value of 7200 (2 hours) are the defaults.

You can set these keyword values globally in the global configuration file so that all machines in the cluster have the same value, or you can specify a different value for each machine by modifying the local configuration files.

To enable both user initiated and system initiated checkpointing for a job, specify checkpoint=system_initiated in your job command file, and code the ckpt API call in your program.

System initiated checkpointing is not available to parallel jobs.

LoadLeveler restarts your executable from an existing checkpoint file when you submit the job. Pass the CHKPT_STATE environment variable using the LoadLeveler environment keyword in your job command file. For more information, see "environment". You must also set the CHKPT_DIR and/or CHKPT_FILE environment variables.
Your job not be checkpointed Add the following option to your job command file:

checkpoint = no

You can also select this option on the Build a Job window of the GUI. This option is the default.

Step 14: Specify Additional Configuration File Keywords

This section describes keywords that were not mentioned in the previous configuration steps. Unless your installation has special requirements for any of these keywords, you can use them with their default settings.
Note:For the keywords listed below which have a number as the value on the right side of the equal sign, that number must be a numerical value and cannot be an arithmetic expression.

ACTION_ON_MAX_REJECT = HOLD |  SYSHOLD  |  CANCEL 

Specifies the state in which jobs are placed when their rejection count has reached the value of the MAX_JOB_REJECT keyword. HOLD specifies that jobs are placed in User Hold status; SYSHOLD specifies that jobs are placed in System Hold status; CANCEL specifies that jobs are canceled. The default is HOLD. When a job is rejected, LoadLeveler sends a mail message stating why the job was rejected.

AFS_GETNEWTOKEN = myprog

where myprog is an administrator supplied program that, for example, can be used to refresh an AFS token. The default is to not run a program.

For more information, see "Handling an AFS Token"

DCE_AUTHENTICATION_PAIR = program1, program2

Where program1 and program2 are LoadLeveler or installation supplied programs that are used to authenticate DCE security credentials. program1 obtains a handle (an opaque credentials object), at the time the job is submitted, which is used to authenticate to DCE. program2 is the path name of a LoadLeveler or an installation supplied program that uses the handle obtained by program1 to authenticate to DCE before starting the job on the executing machine(s).

You must specify this keyword in order to enable DCE authentication. To use LoadLeveler's default DCE authentication method, specify the following:

DCE_AUTHENTICATION_PAIR = $(BIN)/llgetdce, $(BIN)/llsetdce

To use your own DCE authentication method, substitute your own programs into the keyword definition. For more information, see "Handling DCE Security Credentials".

MACHINE_UPDATE_INTERVAL = number

where number specifies the time period, in seconds, during which machines must report to the central manager. Machines that do not report in this number of seconds are considered down. The default is 300 seconds.

MAX_JOB_REJECT = number

where number specifies the number of times a job can be rejected before it is removed (cancelled) or put in User Hold or System Hold status. That is, a rejected job is redispatched until the MAX_JOB_REJECT value is reached. The default is -1, meaning a job is redispatched an unlimited number of times. A job that cannot run for various reasons (such as a uid mismatch, unavailable resources, or wrong permissions) on one machine will be rejected on that machine, and LoadLeveler will attempt to run the job on another machine. A value of 0 means that if the job is rejected, it is immediately removed. (For related information, see the NEGOTIATOR_REJECT_DEFER keyword in this section.)

MOUSE_DEVICE = filename

where filename specifies the mouse device file. This keyword only applies to Solaris machines and is used by the startd daemon when monitoring X events. The directory /dev is assumed. The default is mouse.

NEGOTIATOR_INTERVAL = number

where number specifies the interval, in seconds, at which the negotiator daemon negotiates with machines that are available to run jobs. This daemon also negotiates with machines whenever job states or machine states change. The default is 30 seconds.

NEGOTIATOR_LOADAVG_INCREMENT

where number specifies the value the negotiator adds to the startd machine's load average whenever a job in the Pending state is queued on that machine. This value is used to compensate for the increased load caused by starting another job. The default value is .5.

NEGOTIATOR_PARALLEL_DEFER = number

where number specifies the amount of time in seconds that defines how long a job stays out of the queue after it fails to get the correct number of processors. This keyword applies only to the default LoadLeveler scheduler. This keyword must be greater than the NEGOTIATOR_INTERVAL. value; if it is not, the default is used. The default, set internally by LoadLeveler, is NEGOTIATOR_INTERVAL multiplied by 5.

NEGOTIATOR_PARALLEL_HOLD = number

where number specifies the amount of time in seconds that defines how long a job is given to accumulate processors. This keyword applies only to the default LoadLeveler scheduler. This keyword must be greater than the NEGOTIATOR_INTERVAL value; if it is not, the default is used. The default, set internally by LoadLeveler, is NEGOTIATOR_INTERVAL multiplied by 5.

NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = number

where number specifies the amount of time in seconds between calculation of the SYSPRIO values for waiting jobs. The default is 120 seconds. Recalculating the priority can be CPU-intensive; specifying low values for the NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL keyword may lead to a heavy CPU load on the negotiator if a large number of jobs are running or waiting for resources. A value of 0 means the SYSPRIO values are not recalculated.

You can use this keyword to base the order in which jobs are run on the current number of running, queued, or total jobs for a user or a group. For more information, see "Step 5: Prioritize the Queue Maintained by the Negotiator"

NEGOTIATOR_REJECT_DEFER = number

where number specifies the amount of time in seconds the negotiator waits before it considers scheduling a job to a machine that recently rejected the job. The default is 120 seconds. (For related information, see the MAX_JOB_REJECT keyword in this section.)

NEGOTIATOR_REMOVE_COMPLETED = number

where number is the amount of time in seconds that you want the negotiator to keep information regarding completed and removed jobs so that you can query this information using the llq command. The default is 0 seconds.

NEGOTIATOR_RESCAN_QUEUE = number

where number specifies the amount of time in seconds that defines how long the negotiator waits to rescan the job queue for machines which have bypassed jobs which could not run due to conditions which may change over time. This keyword must be greater than the NEGOTIATOR_INTERVAL value; if it is not, the default is used. The default is 900 seconds.

OBITUARY_LOG_LENGTH = number

where number specifies the number of lines from the end of the file that are appended to the mail message. The master daemon mails this log to the LoadLeveler administrators when one of the daemons dies. The default is 25.

POLLING_FREQUENCY = number

where number specifies the frequency, in seconds, with which the startd daemon evaluates the load on the local machine and decides whether to suspend, resume, or abort jobs. This is also the minimum interval at which the kbdd daemon reports keyboard or mouse activity to the startd daemon. A value of 5 is the default.

POLLS_PER_UPDATE = number

where number specifies how often, in POLLING_FREQUENCY intervals, startd daemon updates the central manager. Due to the communication overhead, it is impractical to do this with the frequency defined by the POLLING_FREQUENCY keyword. Therefore, the startd daemon only updates the central manager every nth (where n is the number specified for POLLS_PER_UPDATE) local update. Change POLLS_PER_UPDATE when changing the POLLING_FREQUENCY. The default is 6.

PUBLISH_OBITUARIES = true| false 

where true specifies that the master daemon sends mail to the administrator(s), identified by LOADL_ADMIN keyword, when any of the daemons it manages dies abnormally.

RESTARTS_PER_HOUR = number

where number specifies how many times the master daemon attempts to restart a daemon that dies abnormally. Because one or more of the daemons may be unable to run due to a permanent error, the master only attempts $(RESTARTS_PER_HOUR) restarts within a 60 minute period. Failing that, it sends mail to the administrator(s) identified by the LOADL_ADMIN keyword and exits. The default is 12.

SCHEDD_INTERVAL = number

where number specifies the interval, in seconds, at which the schedd daemon checks the local job queue and updates the negotiator daemon. The default is 60 seconds.

User-Defined Variables

This type of variable, which is generally created and defined by the user, can be named using any combination of letters and numbers. A user-defined variable is set equal to values, where the value defines conditions, names files, or sets numeric values. For example, you can create a variable named MY_MACHINE and set it equal to the name of your machine named iron as follows:

  MY_MACHINE = iron.ore.met.com

You can then identify the keyword using a dollar sign ($) and parentheses. For example, the literal $(MY_MACHINE) following the definition in the previous example results in the automatic substitution of iron.ore.met.com in place of $(MY_MACHINE).

User-defined definitions may contain references, enclosed in parentheses, to previously defined keywords. Therefore:

  A = xxx
  C = $(A)

is a valid expression and the resulting value of C is xxx. Note that C is actually bound to A, not to its value, so that

  A = xxx
  C = $(A)
  A = yyy

is also legal and the resulting value of C is yyy.

The sample configuration file shipped with the product defines and uses some "user-defined" variables. See Appendix C. "Sample Files" for more information.

LoadLeveler Variables

The LoadLeveler product includes variables that you can use in the configuration file. LoadLeveler variables are evaluated by the LoadLeveler daemons at various stages. They do not require you to use any special characters (such as a parenthesis or a dollar sign) to identify them.

LoadLeveler provides the following variables that you can use in your configuration file statements. For examples of using these variables, see Appendix C. "Sample Files".

Arch

indicates the system architecture. Note that Arch is a special case of a LoadLeveler variable called a machine variable. You specify a machine variable using the the following format:
  variable : $(value)

Cpus

the number of CPU's installed.

CurrentTime

the UNIX date; the current system time, in seconds, since January 1, 1970, as returned by the time() function.

CurrentTime

sets a relative machine priority.

Disk

the free disk space in kilobytes on the file system where the executables for the LoadLeveler jobs assigned to this machine are stored. This refers to the file system that is defined by the execute keyword.

domain  or  domainname

dynamically indicates the official name of the domain of the current host machine where the program is running. Whenever a machine name can be specified or one is assumed, a domain name is assigned if none is present.

EnteredCurrentState

the value of CurrentTime when the current state (START, SUSPEND, etc) was entered.

host  or  hostname

dynamically indicates the official name of the host machine where the program is running. host returns the machine name without the domain name; hostname returns the machine and the domain.

KeyboardIdle

the number of seconds since the keyboard or mouse was last used. It also includes any telnet or interactive activity from any remote machine.

LoadAvg

The Berkely one-minute load average, a measure of the CPU load on the system. The load average is the average of the number of processes ready to run or waiting for disk I/O to complete. The load average does not map to CPU time.

Machine

indicates the name of the current machine. Note that Machine is a special case of a LoadLeveler variable called a machine variable. See the description of the Arch variable for more information.

Memory

the physical memory installed on the machine in megabytes.

MasterMachPrio

a value that is equal to 1 for nodes which are master nodes, and is equal to 0 otherwise.

OpSys

indicates the operating system on the host where the program is running. This value is automatically determined and need not be defined in the configuration file. Note that OpSys is a special case of a LoadLeveler variable called a machine variable. See the description of the Arch variable for more information.

QDate

the difference in seconds between when LoadLeveler (specifically the negotiator daemon) comes up and when the job is submitted using llsubmit.

QDate

the relative speed of a machine.

State

the state of the startd daemon.

tilde

the home directory for the LoadLeveler userid.

UserPrio

the user defined priority of the job. The priority ranges from 0 to 100, with higher numbers corresponding to greater priority.

VirtualMemory

the size of available swap space on the machine in kilobytes.

Time

You can use the following time variables in the START, SUSPEND, CONTINUE, VACATE, and KILL expressions. If you use these variables in the START expression and you are operating across multiple time zones, unexpected results may occur. This is because the negotiator daemon evaluates the START expressions and this evaluation is done in the time zone in which the negotiator resides. Your executing machine also evaluates the START expression and if your executing machine is in a different time zone, the results you may receive may be inconsistent. To prevent this inconsistency from occurring, ensure that both your negotiator daemon and your executing machine are in the same time zone.

tm_hour

the number of hours since midnight (0-23).

tm_min

number of minutes after the hour (0-59).

tm_sec

number of seconds after the minute (0-59).

tm_isdst

Daylight Savings Time flag: positive when in effect, zero when not in effect, negative when information is unavailable. For example, to start jobs between 5PM and 8AM during the month of October, factoring in an adjustment for Daylight Savings Time, you can issue:
START: (tm_mon == 9) && (tm_hour < 8) && (tm_hour > 17) && (tm_isdst = 1)

Date

tm_mday

the number of the day of the month (1-31).

tm_wday

number of days since Sunday (0-6).

tm_yday

number of days since January 1 (0-365).

tm_mon

number of months since January (0-11).

tm_year

the number of years since 1900 (0-9999).

Keyword Summary

This section contains summaries keywords you can use in the administration file and those you can use in the configuration file.

Administration File Keywords

The following table contains a brief description of the keywords you can use in the administration file. For more information on a specific keyword, see the section and page number referenced in the "For Details" column.
Admin. File Keyword Stanza(s) Brief Description For Details
account User, Group A list of account numbers available to a user submitting jobs. "Step 2: Specify User Stanzas"
adapter_name Adapter Specifies the name the operating system uses to refer to an interface card installed on a node (such as en0). "Step 5: Specify Adapter Stanzas"
adapter_stanzas Machine A list of adapter stanza names that define the adapters on a machine which can be requested. "Step 1: Specify Machine Stanzas"
admin Group, Class A list of administrators for a group or class. "Step 3: Specify Class Stanzas"
alias Machine Lists one or more alias names to associate with the machine name. "Step 1: Specify Machine Stanzas"
central_manager Machine When true, this designates the machine as the LoadLeveler central manager. "Step 1: Specify Machine Stanzas"
class_comment Class Text characterizing the class "Step 3: Specify Class Stanzas"
core_limit Class Specifies the hard limit and/or soft limit for the size of a core file a job can create. "Limit Keywords"
cpu_limit Class Specifies the hard limit and/or soft limit for the CPU time a job can use. "Limit Keywords"
cpu_speed_scale Machine Determines whether CPU time is normalized according to machine speed. "Step 1: Specify Machine Stanzas"
data_limit Class Specifies the hard limit and/or soft limit for the size of a data segment a job can use. "Limit Keywords"
default_class User A class name that is the default value assigned to jobs submitted by users for which no class statement appears. "Step 2: Specify User Stanzas"
default_group User A group name to which the user belongs. "Step 2: Specify User Stanzas"
default_interactive__class User A class to which interactive jobs are assigned for jobs submitted by users who do not specify a class using LOADL_INTERACTIVE_CLASS. "Step 2: Specify User Stanzas"
exclude_groups Class A list of groups names identifying those who cannot submit jobs of a particular class. "Step 3: Specify Class Stanzas"
exclude_users Class, Group A list of user names identifying those who cannot submit jobs of a particular class or who are not members of the group. "Step 3: Specify Class Stanzas"
file_limit Class Specifies the hard limit and/or soft limit for the size of a file that a job can create. "Limit Keywords"
include_groups Class A list of groups names identifying those who can submit jobs of a particular class. "Step 3: Specify Class Stanzas"
include_users Class, Group A list of user names identifying those who can submit jobs of a particular class or who do belong to the group. "Step 3: Specify Class Stanzas"
interface_address Adapter Specifies the IP address by which the adapter is known to other nodes in the network. "Step 5: Specify Adapter Stanzas"
interface_name Adapter Specifies the name by which the adapter is known to other nodes in the network. "Step 5: Specify Adapter Stanzas"
job_cpu_limit Class Specifies the hard limit and/or soft limit for the amount of CPU time an individual job step can use per processor. "Limit Keywords"
machine_mode Machine Specifies the type of jobs this machine can run (batch, interactive, or both). "Step 1: Specify Machine Stanzas"
master_node_exclusive Machine When true, this machine is used only as a master node for parallel jobs. "Step 1: Specify Machine Stanzas"
master_node_requirement Class When true, jobs in this class have the requirement that they run on a master node having the master_node_exclusive setting. "Step 3: Specify Class Stanzas"
maxidle User, Group Maximum number of idle jobs this user or group can have simultaneously. "Step 2: Specify User Stanzas"
maxjobs User, Class, Group Maximum number of jobs this user, class, or group can have running simultaneously. "Step 2: Specify User Stanzas"
max_jobs_scheduled Machine The maximum number of jobs that this machine can run. "Step 1: Specify Machine Stanzas"
max_node User, Class, Group The maximum number of nodes a user can request for a parallel job. "Step 2: Specify User Stanzas"
max_processors User, Class, Group The maximum number of processors a user can request for a parallel job. "Step 2: Specify User Stanzas"
maxqueued Group, User The maximum number of jobs a single group or user can have queued at the same time. "Step 2: Specify User Stanzas"
name_server Machine A list of nameservers used for a machine. "Step 1: Specify Machine Stanzas"
network_type Adapter The type of network the adapter supports (for example, Ethernet). This is an administrator defined name. "Step 5: Specify Adapter Stanzas"
nice Class Increments the nice value of a job. "Step 3: Specify Class Stanzas"
NQS_class Class When true, any job submitted to this class is routed to an NQS machine. "Step 3: Specify Class Stanzas"
NQS_query Class A list of queue names to use to monitor and cancel jobs. "Step 3: Specify Class Stanzas"
NQS_submit Class A name that identifies the name of the NQS pipe queue to which the job will be routed. "Step 3: Specify Class Stanzas"
pool_list Machine Specifies a list of pool numbers to which the machine belongs. "Step 1: Specify Machine Stanzas"
priority User, Class, Group A number that identifies the priority of the appropriate user, class, or group. "Step 2: Specify User Stanzas"
pvm_root Machine A directory in which PVM 3.3 is installed. "Step 1: Specify Machine Stanzas"
rss_limit Class Specifies the hard limit and/or soft limit for the resident set size for a job. "Limit Keywords"
schedd_host Machine When true, this machine is used to help submit-only machines access LoadLeveler hosts that run LoadLeveler jobs. "Step 1: Specify Machine Stanzas"
spacct_excluse_enable Machine Specifies whether the SP accounting function is informed whenever this machine is being used exclusively by a particular job. "Step 1: Specify Machine Stanzas"
speed Machine The weight associated with the machine. "Step 1: Specify Machine Stanzas"
stack_limit Class Specifies the hard limit and/or soft limit for the size of a stack. "Limit Keywords"
submit_only Machine When true, designates this as a submit-only machine. "Step 1: Specify Machine Stanzas"
switch_node_number Adapter The node on which the SP switch adapter is installed. "Step 5: Specify Adapter Stanzas"
total_tasks User, Class, Group The maximum number of tasks a user can request for a parallel job. "Step 2: Specify User Stanzas"
type All The type of stanza. "Administering LoadLeveler"
wall_clock_limit Class Specifies the hard limit and/or soft limit for the amount of elapsed time for which a job can run. "Limit Keywords"

Configuration File Keywords and LoadLeveler Variables

The following tables contain a brief description of the keywords you can use in the configuration file. The term configuration file keywords refers to keywords, user-defined variables, and LoadLeveler variables. A summary table is provided for each of the three types of configuration file keywords.

Keywords

The following table serves only as a reference. For more information on a specific keyword, see the section and page number referenced in the "For Details" column.
Configuration File Keyword Brief Description For Details
ACCT Turns the accounting function on (or off). "Step 8: Define Job Accounting"
ACCT_VALIDATION The module called to perform account validation. "Step 8: Define Job Accounting"
ACTION_ON_MAX_REJECT Specifies whether a job is cancelled or put in User Hold or System Hold status when the job exceeds the MAX_JOB_REJECT value. "Step 14: Specify Additional Configuration File Keywords"
ADMIN_FILE Points to the administration file containing user, class, and machine list stanzas. "Step 10: Specify Where Files and Directories are Located"
AFS_GETNEWTOKEN A filter which can be used to renew an AFS token. "Step 14: Specify Additional Configuration File Keywords"
ARCH The standard architecture of the system. "Step 3: Define LoadLeveler Machine Characteristics"
BIN The directory where LoadLeveler binaries are kept. "Step 10: Specify Where Files and Directories are Located"
CENTRAL_MANAGER_HEARTBEAT_INTERVAL The amount of time in seconds that defines how frequently primary and alternate central manager communicate with each other. "Step 9: Specify Alternate Central Managers"
CENTRAL_MANAGER_TIMEOUT The number of heartbeat intervals that an alternate central manager will wait before declaring that the primary central manager is not operating. "Step 9: Specify Alternate Central Managers"
Class The class of jobs that can run on the machine. "Step 3: Define LoadLeveler Machine Characteristics"
CLIENT_TIMEOUT The maximum time, in seconds, that a daemon waits to respond to a process over TCP/IP. "Step 12: Define Network Characteristics"
COLLECTOR_DGRAM_PORT The port number used when connecting to a daemon. "Step 12: Define Network Characteristics"
CONTINUE Continue expression. Determines if a job should continue. "Step 7: Manage a Job's Status Using Control Expressions"
CUSTOM_METRIC A machine's relative priority to run jobs. "Step 2: Define LoadLeveler Cluster Characteristics"
CUSTOM_METRIC_COMMAND An executable whose exit code is value is assigned to CUSTOM_METRIC. "Step 2: Define LoadLeveler Cluster Characteristics"
DCE_AUTHENTICATION_PAIR A pair of installation supplied programs that are used to authenticate DCE security credentials. "Step 14: Specify Additional Configuration File Keywords"
EXECUTE The local directory to store the executable checkpoints of jobs submitted by other machines. "Step 10: Specify Where Files and Directories are Located"
Feature A string specifying unique characteristics of a machine. "Step 3: Define LoadLeveler Machine Characteristics"
GLOBAL_HISTORY The directory containing the global history files. "Step 8: Define Job Accounting"
HISTORY The pathname of the history file for local LoadLeveler jobs. "Step 10: Specify Where Files and Directories are Located"
JOB_ACCT_Q_POLICY The amount of time in seconds that determines how often the startd daemon updates the schedd daemon with accounting data of running jobs. Chapter 7. "Gathering Job Accounting Data"
JOB_EPILOG Pathname of the epilog program. "Writing Prolog and Epilog Programs"
JOB_LIMIT_POLICY The amount of time in seconds that LoadLeveler checks to see if job_cpu_limit has been exceeded. Chapter 7. "Gathering Job Accounting Data"
JOB_PROLOG Pathname of the prolog program. "Writing Prolog and Epilog Programs"
JOB_USER_EPILOG Pathname of the user epilog program. "Writing Prolog and Epilog Programs"
JOB_USER_PROLOG Pathname of the user prolog program. "Writing Prolog and Epilog Programs"
KBDD KBDD expression. Location of kbdd executable (Loadl_kbdd). "LoadLeveler Daemons"
KILL Kill expression. Determines if vacated jobs should be killed. "Step 7: Manage a Job's Status Using Control Expressions"
LIB The directory where LoadLeveler libraries are kept. "Step 10: Specify Where Files and Directories are Located"
LOADL_ADMIN List of LoadLeveler administrators. "Step 1: Define LoadLeveler Administrators"
LOCAL_CONFIG Pathname of the optional local configuration file containing information specific to a node in the LoadLeveler network. "Step 10: Specify Where Files and Directories are Located"
LOG Local directory for storing log files. "Step 10: Specify Where Files and Directories are Located"
MACHINE_AUTHENTICATE Specifies whether machine validation is performed. "Step 2: Define LoadLeveler Cluster Characteristics"
MACHINE_UPDATE_INTERVAL The time, in seconds, during which machines must report to the central manager. "Step 14: Specify Additional Configuration File Keywords"
MACHPRIO Machine priority expression "Step 6: Prioritize the Order of Executing Machines Maintained by the Negotiator"
MAIL Name of a local mail program used to override default mail notification. "Using Your Own Mail Program"
MASTER Location of the master executable (LoadL_master). "LoadLeveler Daemons"
MASTER_DGRAM_PORT The port number used when connecting to the daemon. "Step 12: Define Network Characteristics"
MASTER_STREAM_PORT The port number to used when connecting to the daemon. "Step 12: Define Network Characteristics"
MAX_CKPT_INTERVAL The maximum number of seconds between checkpoints for running jobs. "Step 13: Enable Checkpointing"
MAX_JOB_REJECT The number of times a job is rejected before it is cancelled or put in User Hold or System Hold status. "Step 14: Specify Additional Configuration File Keywords"
MAX_STARTERS The maximum number of jobs that can run simultaneously. "Step 4: Specify How Many Jobs a Machine Can Run"
MIN_CKPT_INTERVAL The minimum number of seconds between checkpoints for running jobs. "Step 13: Enable Checkpointing"
NEGOTIATOR Location of the negotiator executable (LoadL_negotiator). "LoadLeveler Daemons"
NEGOTIATOR_INTERVAL The time interval, in seconds, at which the negotiator daemon updates the status of jobs in the LoadLeveler cluster and negotiates with machines that are available to run jobs. "Step 14: Specify Additional Configuration File Keywords"
NEGOTIATOR_LOADAVG_INCREMENT The factor added to the startd machine's load average to compenstate for the increased load caused by starting another machine. "Step 14: Specify Additional Configuration File Keywords"
NEGOTIATOR_PARALLEL_DEFER The length of time that a job is given to accumulate processors. "Step 14: Specify Additional Configuration File Keywords"
NEGOTIATOR_PARALLEL_HOLD The length of time a job attempts to collect machines before releasing them. "Step 14: Specify Additional Configuration File Keywords"
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL The amount of time in seconds between calculation of the SYSPRIO values for waiting jobs. "Step 14: Specify Additional Configuration File Keywords"
NEGOTIATOR_REJECT_DEFER The amount of time in seconds the negotiator waits before it considers scheduling a job to a machine that recently rejected the job. "Step 14: Specify Additional Configuration File Keywords"
NEGOTIATOR_REMOVE_COMPLETED The amount of time the negotiator keeps information on completed and removed jobs. "Step 14: Specify Additional Configuration File Keywords"
NEGOTIATOR_RESCAN_QUEUE The amont of time the negotiator waits to rescan the job queue for machines that temporarily have non-runnable jobs. "Step 14: Specify Additional Configuration File Keywords"
NEGOTIATOR_STREAM_PORT The port number used when connecting to the daemon. "Step 12: Define Network Characteristics"
NQS_DIR The directory where NQS commands reside. "Step 10: Specify Where Files and Directories are Located"
OBITUARY_LOG_LENGTH The number of lines from the ned of the file that are appended to the Master_Log. "Step 14: Specify Additional Configuration File Keywords"
POLLING_FREQUENCY The frequency in seconds the startd daemon uses to evaluate the load on the local machine and to decide whether to suspend, resume, or abort jobs. "Step 14: Specify Additional Configuration File Keywords"
POLLS_PER_UPDATE The frequency, in POLLING_FREQUENCY intervals, with which the startd daemon updates the central manager. "Step 14: Specify Additional Configuration File Keywords"
PUBLISH_OBITUARIES When true, specifies that the master daemon sends mail to the administrator(s) when any daemon it manages dies abnormally. "Step 14: Specify Additional Configuration File Keywords"
RELEASEDIR The directory where all the LoadLeveler software resides. "Step 10: Specify Where Files and Directories are Located"
RESTARTS_PER_HOUR The number of times the master daemon attempts to restart a daemon that dies abnormally. "Step 14: Specify Additional Configuration File Keywords"
SCHEDD Location of the schedd executable (LoadL_schedd). "LoadLeveler Daemons"
SCHEDD_INTERVAL Specifies the interval, in seconds, at which the schedd daemon checks the local job queue. "Step 14: Specify Additional Configuration File Keywords"
SCHEDD_RUNS_HERE Specifies whether this daemon will run on the host. "Step 3: Define LoadLeveler Machine Characteristics"
SCHEDD_STREAM_PORT The port number used when connecting to the daemon. "Step 12: Define Network Characteristics"
SCHEDULER_API When YES, disables the native LoadLeveler scheduling algorithm. "Step 2: Define LoadLeveler Cluster Characteristics"
SCHEDULER_TYPE Specifies the LoadLeveler Backfill scheduling algorithm. "Step 2: Define LoadLeveler Cluster Characteristics"
SPOOL The local directory where LoadLeveler keeps the local job queue and checkpoint files. "Step 10: Specify Where Files and Directories are Located"
START Start expression. Determines if a machine can run a job. "Step 7: Manage a Job's Status Using Control Expressions"
STARTD Location of the startd executable (LoadL_startd). "LoadLeveler Daemons"
STARTER Location of the starter executable (LoadL_starter). "LoadLeveler Daemons"
STARTD_RUNS_HERE Specifies whether this daemon will run on the host. "Step 3: Define LoadLeveler Machine Characteristics"
START_DAEMONS Specifies whether to start the daemons on the machine. "Step 3: Define LoadLeveler Machine Characteristics"
STARTD_DGRAM_PORT The port number used when connecting to the daemon. "Step 12: Define Network Characteristics"
STARTD_STREAM_PORT The port number used when connecting to the daemon. "Step 12: Define Network Characteristics"
SUBMIT_FILTER The program you want to run to filter a job script when the job is submitted. "Filtering a Job Script"
SUSPEND Suspend expresson. Determines if a job should be suspended. "Step 7: Manage a Job's Status Using Control Expressions"
SYSPRIO System priority expression. "Step 5: Prioritize the Queue Maintained by the Negotiator"
TRUNC_KBDD_LOG_ON_OPEN When true, specifies the log file is restarted with every invocation of the daemon. "Step 11: Record and Control Log Files"
TRUNC_MASTER_LOG_ON_OPEN When true, specifies the log file is re started with every invocation of the daemon. "Step 11: Record and Control Log Files"
TRUNC_NEGOTIATOR_LOG_ON_OPEN When true, specifies the log file is restarted with every invocation of the daemon. "Step 11: Record and Control Log Files"
TRUNC_SCHEDD_LOG_ON_OPEN When true, specifies the log file is restarted with every invocation of the daemon. "Step 11: Record and Control Log Files"
TRUNC_STARTD_LOG_ON_OPEN When true, specifies the log file is restarted with every invocation of the daemon. "Step 11: Record and Control Log Files"
TRUNC_STARTER_LOG_ON_OPEN When true, specifies the log file is restarted with every invocation of the daemon. "Step 11: Record and Control Log Files"
VACATE The vacate expression. Determines whether suspended jobs should be vacated. "Step 7: Manage a Job's Status Using Control Expressions"
X_RUNS_HERE When true, specifies you want to start the keyboard daemon (unles you are running on Sun machine or an HP machine). "Step 3: Define LoadLeveler Machine Characteristics"

User-Defined Keywords

The following table serves only as a reference. These keywords are described in more detail in "User-Defined Variables".
Keyword Brief Description
BackgroundLoad Defines the variable BackgroundLoad and assigns to it a floating point constant. This might be used as a noise factor indicating no activity.
CPU_Busy Defines the variable CPU_Busy and reassigns to it at each evaluation the Boolean value True or False, depending on whether the Berkeley one-minute load average is equal to or greater than the saturation level of 1.5.
CPU_Idle Defines the variable CPU_Idle and reassigns to it at each evaluation the Boolean value True or False, depending on whether the Berkeley one-minute load average is equal or less than 0.7.
HighLoad Is a keyword that the user can define to use as a saturation level at which no further jobs should be started.
HOUR Defines the variable HOUR and assigns to it a constant integer value.
JobLoad Defines the variable JobLoad which defines the load on the machine caused by running the job.
KeyboardBusy Defines the variable KeyboardBusy and reassigns to it at each evaluation the Boolean value True or False, depending on whether the keyboard and mouse have been idle for fifteen minutes.
LowLoad Defines the variable LowLoad and assigns to it the value of BackgroundLoad. This might be used as a restart level at which jobs can be started again and assumes only running 1 job on the machine.
mail Specifies a local program you want to use in place of the LoadLeveler default mail notification method.
MINUTE Defines the variable MINUTE and assigns to it a constant integer value.
StateTimer Defines the variable StateTimer and reassigns to it at each evaluation the number of seconds since the current state was entered.

LoadLeveler Variables

The following table serves only as a reference. For more information on a specific keyword, see the section and page number referenced in the "For Details" column.
Keyword Brief Description For Details
Arch Standard architecture of the system. "LoadLeveler Variables"
ClassSysprio Job priority for the class. "Step 5: Prioritize the Queue Maintained by the Negotiator"
Cpus Number of CPU's installed. "LoadLeveler Variables"
CurrentTime The UNIX date that includes the current system time, in seconds, since January 1, 1970. "LoadLeveler Variables"
CustomMetric The relative machine priority. "LoadLeveler Variables"
Disk Free disk in megabytes on the filesystem where checkpoints are stored. "LoadLeveler Variables"
domain or domainname Dynamically indicates the domain name of the current host machine where the program is running. "LoadLeveler Variables"
EnteredCurrentState Value of CurrentTime when the current state was entered. "LoadLeveler Variables"
GroupQueuedJobs The number of jobs either running or queued for the LoadLeveler group. "Step 5: Prioritize the Queue Maintained by the Negotiator"
GroupRunningJobs The number of jobs currently running for the LoadLeveler group. "Step 5: Prioritize the Queue Maintained by the Negotiator"
GroupSysprio The job priority for the group. "Step 5: Prioritize the Queue Maintained by the Negotiator"
GroupTotalJobs The total number of jobs associated with the LoadLeveler group. "Step 5: Prioritize the Queue Maintained by the Negotiator"
host or hostname Dynamically indicates the name of the host machine where the program is running. "LoadLeveler Variables"
KeyboardIdle Number of seconds since the keyboard or mouse was last used. "LoadLeveler Variables"
LoadAvg Berkeley one-minute load average. "LoadLeveler Variables"
Machine Name of the current machine. "LoadLeveler Variables"
MasterMachPrio A value that is 1 for master nodes and is 0 otherwise. "LoadLeveler Variables"
Memory Physical memory installed on the machine in megabytes. "LoadLeveler Variables"
OpSys Indicates the operating system on the host where the program is running. "LoadLeveler Variables"
QDate Difference in seconds between when the negotiator starts up and when the job is submitted. "LoadLeveler Variables"
Speed The relative machine speed. "LoadLeveler Variables"
State State of the startd. Can be None, Busy, Running, Idle, Suspend, Flush, or Drain. "LoadLeveler Variables"
tilde Dynamically defines the pathname of the LoadLeveler home directory. "LoadLeveler Variables"
tm_hour Number of hours since midnight (0-23). "LoadLeveler Variables"
tm_isdst Daylight Savings Time flag: positive when in effect, zero when not in effect, negative when information is unavailable. "LoadLeveler Variables"
tm_mday Number of the day of the month (1-31). "LoadLeveler Variables"
tm_min Number of minutes after the hour (0-59). "LoadLeveler Variables"
tm_mon Number of months since January (0-11). "LoadLeveler Variables"
tm_sec Number of seconds after the minute (0-59). "LoadLeveler Variables"
tm_wday Number of days since Sunday (0-6). "LoadLeveler Variables"
tm_yday Number of days since January 1 (0-365). "LoadLeveler Variables"
tm_year Number of years since 1900 (0-99). "LoadLeveler Variables"
UserPrio User defined priority of a job. "Step 5: Prioritize the Queue Maintained by the Negotiator"
UserQueuedJobs The number of jobs either running or queued for the user. "Step 5: Prioritize the Queue Maintained by the Negotiator"
UserRunningJobs The number of jobs currently running for the user. "Step 5: Prioritize the Queue Maintained by the Negotiator"
UserSysprio The priority of the user who submitted the job. "Step 5: Prioritize the Queue Maintained by the Negotiator"
UserTotalJobs The total number of jobs associated with the this user. "Step 5: Prioritize the Queue Maintained by the Negotiator"
VirtualMemory The size of the available swap space on the machine in kilobytes. "LoadLeveler Variables"


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]