Using and Administering

Helpful Hints

This section contains tips on running LoadLeveler, including some productivity aids.

If you are running LoadLeveler on a large number of nodes (128 or more), network traffic between LoadLeveler daemons can become excessive to the point of overwhelming a receiving daemon. To reduce network traffic, consider the following daemon, keyword, and command recommendations for large installations.

Set the POLLS_PER_UPDATE*POLLING_FREQUENCY interval to five minutes or more. This limits the volume of machine updates the startd daemons send to the negotiator. For example, set POLLS_PER_UPDATE to 10 and set POLLING_FREQUENCY to 30 seconds.
If your installation's mix of jobs includes a high percentage of parallel jobs requiring many nodes, specify schedd_host=yes in the machine stanza of each schedd machine. The schedd daemons must communicate with hundreds of startd daemons every time a job runs. You can distribute this communication by activating many schedd daemons. You should activate as many schedd daemons as there are jobs likely to be running at any one time. When you do this, each schedd handles the dispatching of one parallel job.
If your installation allows jobs to be submitted from machines running the schedd daemon, you should consider avoiding "schedd affinity" by specifying SCHEDD_SUBMIT_AFFINITY=FALSE in the LoadLeveler configuration file. By default, the llsubmit command submits a job to the machine where the command was invoked provided the schedd daemon is running on the machine. (This is called schedd affinity.)
You can decrease the amount of time the negotiator daemon spends running negotiation loops by increasing the NEGOTIATOR_INTERVAL and the NEGOTIATOR_CYCLE_DELAY. For example, set NEGOTIATOR_INTERVAL to 600, and set NEGOTIATOR_CYCLE_DELAY to 30.
Make sure the machine update interval is not too short by setting the MACHINE_UPDATE_INTERVAL to a value larger than three times the polling interval (POLLS_PER_UPDATE*POLLING_FREQUENCY). This prevents the negotiator from prematurely marking a machine as "down" or prematurely cancelling jobs.
In a large LoadLeveler cluster, issuing the llctl command with the -g can take minutes to complete. To speed this up, set up a working collective containing the machines in the cluster and use the PSSP dsh command; for example, dsh llctl -g reconfig. This command also allows you to limit your operation to a subset of machines by defining other working collectives.

Hints for Running Jobs

Determining When Your Job Started and Stopped

By reading the notification mail you receive after submitting a job, you can determine the time the job was submitted, started, and stopped. Suppose you submit a job and receive the following mail when the job finishes:

 
Submitted at: Sun Apr 30 11:40:41 1996
Started   at: Sun Apr 30 11:45:00 1996
Exited    at: Sun Apr 30 12:49:10 1996
 
Real Time:   0 01:08:29
Job Step User Time:   0 00:30:15
Job Step System Time:   0 00:12:55
Total Job Step Time:   0 00:43:10
 
Starter User Time:   0 00:00:00
Starter System Time:   0 00:00:00
Total Starter Time:   0 00:00:00

This mail tells you the following:

Submitted at: The time you issued the llsubmit command or the time you submitted the job with the graphical user interface.
Started at: The time the starter process executed the job.
Exited at: The actual time your job completed.
Real Time: The wall clock time from submit to completion.
Job Step User Time: The CPU time the job consumed executing in user space.
Job Step System Time: The CPU time the system (AIX) consumed on behalf of the job.
Total Job Step Time: The sum of the two fields above.
Starter User Time: The CPU time consumed by the LoadLeveler starter process for this job, executing in user space. Time consumed by the starter process is the only LoadLeveler overhead which can be directly attributed to a user's job.
Starter System Time: The CPU time the system (AIX) consumed on behalf of the LoadLeveler starter process running for this job.
Total Starter Time: The sum of the two fields above.

You can also get the starting time by issing llsummary -l -x and then issuing awk /Date|Event/ against the resulting file. For this to work, you must have ACCT = A_ON A_DETAIL set in the LoadL_config file.

Running Jobs at a Specific Time of Day

Using a machine's local configuration file, you can set up the machine to run jobs at a certain time of day (sometimes called an execution window). The following coding in the local configuration file runs jobs between 5:00 PM and 8:00AM daily, and suspends jobs the rest of the day:

START: (tm_day >= 1700) || (tm_day <= 0800)
SUSPEND: (tm_day > 0800)  && (tm_day < 1700)
CONTINUE: (tm_day >= 1700) || (tm_day <= 0800)

Controlling the Mix of Idle and Running Jobs

Three keywords determine the mix of idle and running jobs for a user. By a running job, we mean a job that is in one of the following states: Running, Pending, or Starting. These keywords, which are described in detail in Step 2: Specify User Stanzas, are:

maxqueued: Controls the number of jobs in any of these states: Idle, Running, Pending, or Starting.
maxjobs: Controls the number of jobs in any of these states: Running, Pending, or Starting; thus it controls a subset of what maxqueued controls. maxjobs effectively controls the number of jobs in the Running state, since Pending and Starting are usually temporary states.
maxidle: Controls the number of jobs in any of these states: Idle, Pending, or Starting; thus it controls a subset of what maxqueued controls. maxidle effectively controls the number of jobs in the Idle state, since Pending and Starting are usually temporary states.

What Happens When You Submit a Job

For a user's job to be allowed into the job queue, the total of other jobs (in the Idle, Pending, Starting and Running states) for that user must be less than the maxqueued value for that user. Also, the total idle jobs (those in the Idle, Pending, and Starting states) must be less than the maxidle value for the user. If either of these constraints are at the maximum, the job is placed in the Not Queued state until one of the other jobs changes state. If the user is at the maxqueued limit, a job must complete, be cancelled, or be held before the new job can enter the queue. If the user is at the maxidle limit, a job must start running, be cancelled, or be held before the new job can enter the queue.

Once a job is in the queue, the job is not taken out of queue unless the user places a hold on the job, the job completes, or the job is cancelled. (An exception to this, when you are running the default LoadLeveler scheduler, is parallel jobs which do not accumulate sufficient machines in a given time period. These jobs are moved to the Deferred state, meaning they must vie for the queue when their Deferred period expires.)

Once a job is in the queue, the job will run unless the maxjobs limit for the user is at a maximum.

Note the following restrictions for using these keywords:

If maxqueued is greater than (maxjobs + maxidle), the maxqueued value will never be reached.
If either maxjobs or maxidle is greater than maxqueued, then maxqueued will be the only restriction in effect, since maxjobs and maxidle will never be reached.

Sending Output from Several Job Steps to One Output File

You can use dependencies in your job command file to send the output from many job steps to the same output file. For example:

# @ step_name = step1
# @ executable = ssba.job
# @ output = ssba.tmp
# @ ...
# @ queue
#
# @ step_name = append1
# @ dependency = (step1 != CC_REMOVED)
# @ executable = append.ksh
# @ output = /dev/null
# @ queue
# @
# @ step_name = step2
# @ dependency = (append1 == 0)
# @ executable = ssba.job
# @ output = ssba.tmp
# @ ...
# @ queue
# @
# @ step_name = append2
# @ dependency = (step2 != CC_REMOVED)
# @ executable = append.ksh
# @ output = /dev/null
# @ queue
#
# ...

Then, the file append.ksh could contain the line cat ssba.tmp >> ssba.log. All your output will reside in ssba.log. (Your dependecies can look for different return values, depending on what you need to accomplish.)

You can achieve the same result from within ssba.job by appending your output to an output file rather than writing it to stdout. Then your output statement for each step would be /dev/null and you wouldn't need the append steps.

Hints for Using Machines

Setting Up a Single Machine To Have Multiple Job Classes

You can define a machine to have multiple job classes which are active at different times. For example, suppose you want a machine to run jobs of Class A any time, and you want the same machine to run Class B jobs between 6 p.m. and 8 a.m.

You can combine the Class keyword with a user-defined macro (called Off_shift in this example).

For example:

Off_Shift = ((tm_hour >= 18) || (tm_hour < 8))

Then define your START statement:

START : (Class == "A") || ((Class == "B") && $(Off_Shift))

Make sure you have the parenthesis around the Off_Shift macro, since the logical OR has a lower precedence than the logical AND in the START statement.

Also, to take weekends into account, code the following statements. Remember that Saturday is day 6 and Sunday is day 0.

Off_Shift = ((tm_wday == 6) || (tm_wday == 0) || (tm_hour >=18) \
|| (tm_hour < 8))
 
Prime_Shift = ((tm_wday != 6) && (tm_wday != 0) && (tm_hour >= 8) \
&& (tm_hour < 18))

Reporting the Load Average on Machines

You can use the /usr/bin/rup command to report the load average on a machine. The rup machine_name command gives you a report that looks similar to the following:

localhost    up 23 days, 10:25,    load average: 1.72, 1.05, 1.17

You can use this command to report the load average of your local machine or of remote machines. Another command, /usr/bin/uptime, returns the load average information for only your local host.

History Files and schedd

The schedd daemon writes to the spool/history file only when a job is completed or removed. Therefore, you can delete the history file and restart schedd even when some jobs are scheduled to run on other hosts.

However, you should clean up the spool/job_queue.dir and spool/job_queue.pag files only when no jobs are being scheduled on the machine.

You should not delete these files if there are any jobs in the job queue that are being scheduled from this machine (for example, jobs with names such as thismachine.clusterno.jobno).

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]