Using and Administering

Appendix A. Troubleshooting

Troubleshooting LoadLeveler

This chapter is divided into the following sections:

"Frequently Asked Questions", which contains answers to questions frequently asked by LoadLeveler customers. This section focuses on answers that may help you get out of problem situations. The questions and answers are organized into the following categories:
- Jobs submitted to LoadLeveler do not run. See "Why Won't My Job Run?" for more information.
- One or more of your machines goes down. See "What Happens to Running Jobs When a Machine Goes Down?" for more information.
- The central manager is not operating. See "What Happens if the Central Manager Isn't Operating?" for more information.
- Miscellaneous questions. See "Other Questions" for more information.
"Helpful Hints", which contains tips on running LoadLeveler, including some productivity aids.
"Getting Help from IBM", which tells you how to contact IBM for assistance.

It is helpful to create error logs when you are diagnosing a problem. See to "Step 11: Record and Control Log Files" for information on setting up error logs.

Frequently Asked Questions

This section contains answers to questions frequently asked by LoadLeveler customers.

If you submitted your job and it is in the LoadLeveler queue but has not run, issue llq -s first to help diagnose the problem. If you need more help diagnosing the problem, refer to the following table:

Why Your Job May Not Be Running: Possible Soulution
Job requires specific machine, operating system, or other resource.

Does the resource exist in the LoadLeveler cluster? If yes, wait until it becomes available.

Check the GUI to compare the job requirements to the machine details, especially Arch, OpSys, and Class. Ensure that the spelling and capitalization matches.
Job requires specific job class

Is the class defined in the administration file? Use llclass to determine this. If yes,
Is there a machine in the cluster that supports that class? If yes, you need to wait until the machine becomes available to run your job.

The maximum number of jobs are already running on all the eligible machines Wait until one of the machines finishes a job before scheduling your job.
The start expression evaluates to false. Examine the configuration files (both LoadL_config and LoadL_config.local) to determine the START control function expression used by LoadLeveler to start a job. As a problem determination measure, set the START and SUSPEND values, as shown in this example:
START: T SUSPEND: F

The priority of your job is lower than the priority of other jobs. You cannot affect the system priority given to this job by the negotiator daemon but you can try to change your user priority to move this job ahead of other jobs you previously submitted using the llprio command or the GUI.
The information the central manager has about machines and jobs may not be current. Wait a few minutes for the central manager to be updated and then the job may be dispatched. This time limit (a few minutes) depends upon the polling frequency and polls per update set in the LoadL_config file. The default polling frequency is five minutes.
You do not have the same user ID on all the machines in the cluster. To run jobs on any machine in the cluster, you have to have the same user ID and the same uid number on every machine in the pool. If you do not have a userid on one machine, your jobs will not be scheduled to that machine.

Why Your Job May Not Be Running:	Possible Soulution
Job requires specific machine, operating system, or other resource.	Does the resource exist in the LoadLeveler cluster? If yes, wait until it becomes available. Check the GUI to compare the job requirements to the machine details, especially Arch, OpSys, and Class. Ensure that the spelling and capitalization matches.
Job requires specific job class	Is the class defined in the administration file? Use llclass to determine this. If yes, Is there a machine in the cluster that supports that class? If yes, you need to wait until the machine becomes available to run your job.
The maximum number of jobs are already running on all the eligible machines	Wait until one of the machines finishes a job before scheduling your job.
The start expression evaluates to false.	Examine the configuration files (both LoadL_config and LoadL_config.local) to determine the START control function expression used by LoadLeveler to start a job. As a problem determination measure, set the START and SUSPEND values, as shown in this example: START: T SUSPEND: F
The priority of your job is lower than the priority of other jobs.	You cannot affect the system priority given to this job by the negotiator daemon but you can try to change your user priority to move this job ahead of other jobs you previously submitted using the llprio command or the GUI.
The information the central manager has about machines and jobs may not be current.	Wait a few minutes for the central manager to be updated and then the job may be dispatched. This time limit (a few minutes) depends upon the polling frequency and polls per update set in the LoadL_config file. The default polling frequency is five minutes.
You do not have the same user ID on all the machines in the cluster.	To run jobs on any machine in the cluster, you have to have the same user ID and the same uid number on every machine in the pool. If you do not have a userid on one machine, your jobs will not be scheduled to that machine.

You can use the llq command to query the status of your job or the llstatus command to query the status of machines in the cluster. Refer to Chapter 9. "LoadLeveler Commands" for information on these commands.

Why Won't My Parallel Job Run?

If you submitted your parallel job and it is in the LoadLeveler queue but has not run, issue llq -s first to help diagnose the problem. If issuing this command does not help, refer to the previous table and to the following table for more information:

Why Your Job May Not Be Running Possible Solution
The minimum number of processors requested by your job is not available. Sufficient resources must be available. Specifying a smaller number of processors may help if your job can run with fewer resources.
The pool in your requirements statement specifies a pool which is invalid or not available. The specified pool must be valid and available.
The adapter specified in the requirements statement or the network statement identifies an adapter which is invalid or not available. The specified adapter must be valid and available.
PVM3 is not installed PVM3 must be installed on any machine you wish to use for pvm. The PVM3 system itself is not supplied with LoadLeveler.
You are already running a PVM3 job on one of the LoadLeveler machines. PVM3 restrictions prevent a user from running more than one pvm daemon per user per machine. If you want to run pvm3 jobs on LoadLeveler, you must not run any pvm3 jobs outside of LoadLeveler control on any machine being managed by LoadLeveler.
The parallel_path keyword in your job command file is incorrect. Use parallel_path to inform LoadLeveler where binaries that run your pvm tasks are for the pvm_spawn() command. If this is incorrect, the job may not run.
The pvm_root keyword in the administration file is incorrect. This keyword corresponds to the pvm ep keyword and is required to tell LoadLeveler where the pvm system is installed.
The file /tmp/pvmd.userid exists on some LoadLeveler machine but no PVM jobs are running. If PVM3 exits unexpectedly, it will not properly clean up after itself. Although LoadLeveler attempts to clean up after pvm, some situations are ambiguous and you may have to remove this file yourself. Check all the systems specified as being capable of running PVM3, and remove this file if it exists.

Why Your Job May Not Be Running	Possible Solution
The minimum number of processors requested by your job is not available.	Sufficient resources must be available. Specifying a smaller number of processors may help if your job can run with fewer resources.
The pool in your requirements statement specifies a pool which is invalid or not available.	The specified pool must be valid and available.
The adapter specified in the requirements statement or the network statement identifies an adapter which is invalid or not available.	The specified adapter must be valid and available.
PVM3 is not installed	PVM3 must be installed on any machine you wish to use for pvm. The PVM3 system itself is not supplied with LoadLeveler.
You are already running a PVM3 job on one of the LoadLeveler machines.	PVM3 restrictions prevent a user from running more than one pvm daemon per user per machine. If you want to run pvm3 jobs on LoadLeveler, you must not run any pvm3 jobs outside of LoadLeveler control on any machine being managed by LoadLeveler.
The parallel_path keyword in your job command file is incorrect.	Use parallel_path to inform LoadLeveler where binaries that run your pvm tasks are for the pvm_spawn() command. If this is incorrect, the job may not run.
The pvm_root keyword in the administration file is incorrect.	This keyword corresponds to the pvm ep keyword and is required to tell LoadLeveler where the pvm system is installed.
The file /tmp/pvmd.`userid` exists on some LoadLeveler machine but no PVM jobs are running.	If PVM3 exits unexpectedly, it will not properly clean up after itself. Although LoadLeveler attempts to clean up after pvm, some situations are ambiguous and you may have to remove this file yourself. Check all the systems specified as being capable of running PVM3, and remove this file if it exists.

Common Set Up Problems with Parallel Jobs

This section presents a list of common problems found in setting up parallel jobs:

If jobs appear to remain in a Pending or Starting state: check that the nameserver is consistent. Compare results of host machine_name and host IP_address
For POE:
- Specify the POE partition manager as the executable. Do not specify the parallel job as the executable.
- Pass the parallel job as an argument to POE.
- The parallel job must exist and must be specified as a full path name.
- If the job runs in user space, specify the flag -euilib us.
- Specify the correct adapter (when needed).
- Specify a POE job only once in the job command file.
- Compile only with the supported level of POE.
- Specify only parallel as the job_type.
For PVM:
- Specify the parallel job as the executable. Do not specify PVM as the executable.
- Compile only with the supported level of PVM.
- Specify only pvm3 as the job_type.

PVM Problem Determination

If LoadLeveler is to manage PVM jobs on a machine for a user, that user should not attempt to run PVM jobs on that machine outside of LoadLeveler control. Because of PVM restrictions, only a single PVM daemon per user per machine is permitted. If a user tries to run PVM jobs without using LoadLeveler and LoadLeveler later attempts to start a job for that user on the same machine, LoadLeveler may not be able to start PVM for the job. This will cause the LoadLeveler job to be cancelled.

If a PVM job submitted through LoadLeveler is rejected, it is probably because PVM was not correctly terminated the last time it ran on the rejecting machine. LoadLeveler attempts to handle this by making sure that it cleans up PVM jobs when they complete, but remember that you may need to clean up after the job yourself. If a machine refuses to start a PVM job, check the following:

See if there is a process with the name pvmd running on the machine in question under the id of the user whose job will not start. Stop the process by issuing:
```
ps -ef | grep pvmd
kill -TERM pid
```
Do not use either of the following variations to stop the daemon because this will prevent pvmd from cleaning up and jobs will still not start:
```
kill -9 pid
kill -KILL pid
```
If there is no pvmd process running, see if there is a file called /tmp/pvmd. userid, where userid is the ID of the user whose job will not start. If the file exists, remove it.

Why Won't My Submit-Only Job Run?

If a job you submitted from a submit-only machine does not run, verify that you have defined the following statements in the machine stanza of the administration file of the submit-only machine:

submit_only = true
schedd_host = false
central_manager = false

For other submit-only requirements, see the submit-only section.

Why Does a Job Stay in the Pending (or Starting) State?

If a job appears to stay in the Pending or Starting state, it is possible the job is continually being dispatched and rejected. Check the setting of the MAX_JOB_REJECT keyword. If it is set to the default, -1, the job will be rejected an unlimited number of times. Try resetting this keyword to some finite number. Also, check the setting of the ACTION_ON_MAX_REJECT keyword. These keywords are described in "Step 14: Specify Additional Configuration File Keywords".

What Happens to Running Jobs When a Machine Goes Down?

Both the startd daemon and the schedd daemon maintain persistent states of all jobs. Both daemons use a specific protocol to ensure that the state of all jobs is consistent across LoadLeveler. In the event of a failure, the state can be recovered. Neither the schedd nor the startd daemon discard the job state information until it is passed onto and accepted by another daemon in the process.

If Then
The network goes down but the machines are still running
If the network goes down but the machines are still running, when LoadLeveler is restarted, it looks for all jobs that were marked running when it went down. On the machine where the job is running, the startd daemon searches for the job and if it can verify that the job is still running, it continues to manage the job through completion. On the machine where schedd is running, schedd queues a transaction to the startd to re-establish the state of the job. This transaction stays queued until the state is established. Until that time, LoadLeveler assumes the state is the same as when the system went down.
The network partitions or goes down. All transactions are left queued until the recipient has acknowledged them. Critical transactions such as those between the schedd and startd are recorded on disk. This ensures complete delivery of messages and prevents incorrect decisions based on incomplete state information.
The machine with startd goes down. Because job state is maintained on disk in startd, when LoadLeveler is restarted it can forward correct status to the rest of LoadLeveler. In the case of total machine failure, this is usually "JOB VACATED", which causes the job to be restarted elsewhere. In the case that only LoadLeveler failed, it is often possible to "find" the job if it is still running and resume management of it. In this case LoadLeveler sends JOB RUNNING to the schedd and central manager, thereby permitting the job to run to completion.
The central manager machine goes down. All machines in the cluster send current status to the central manager on a regular basis. When the central manager restarts, it queries each machine that checks in, requesting the entire queue from each machine. Over the period of a few minutes the central manager restores itself to the state it was in before the failure. Each schedd is responsible for maintaining the correct state of each job as it progressed while the central manager is down. Therefore, it is guaranteed that the central manager will correctly rebuild itself.
All jobs started when the central manager was down will continue to run and complete normally with no loss of information. Users may continue to submit jobs. These new jobs will be forwarded correctly when the central manager is restarted.
The schedd machine goes down When schedd starts up again, it reads the queue of jobs and for every job which was in some sort of active state (i.e. PENDING, STARTING, RUNNING), it queries the machine where it is marked active.
The running machine is required to return current status of the job. If the job completed while schedd was down, JOB COMPLETE is returned with exit status and accounting information. If the job is running, JOB RUNNING is returned. If the job was vacated, JOB VACATED is returned. Because these messages are left queued until delivery is confirmed, no job will be lost or incorrectly dispatched due to schedd failure.
During the time the schedd is down, the central manager will not be able to start new jobs that were submitted to that schedd
The llsubmit machine goes down schedd gets its own copy of the executable so it does not matter if the llsubmit machine goes down.

If	Then
The network goes down but the machines are still running	If the network goes down but the machines are still running, when LoadLeveler is restarted, it looks for all jobs that were marked running when it went down. On the machine where the job is running, the startd daemon searches for the job and if it can verify that the job is still running, it continues to manage the job through completion. On the machine where schedd is running, schedd queues a transaction to the startd to re-establish the state of the job. This transaction stays queued until the state is established. Until that time, LoadLeveler assumes the state is the same as when the system went down.
The network partitions or goes down.	All transactions are left queued until the recipient has acknowledged them. Critical transactions such as those between the schedd and startd are recorded on disk. This ensures complete delivery of messages and prevents incorrect decisions based on incomplete state information.
The machine with startd goes down.	Because job state is maintained on disk in startd, when LoadLeveler is restarted it can forward correct status to the rest of LoadLeveler. In the case of total machine failure, this is usually "JOB VACATED", which causes the job to be restarted elsewhere. In the case that only LoadLeveler failed, it is often possible to "find" the job if it is still running and resume management of it. In this case LoadLeveler sends JOB RUNNING to the schedd and central manager, thereby permitting the job to run to completion.
The central manager machine goes down.	All machines in the cluster send current status to the central manager on a regular basis. When the central manager restarts, it queries each machine that checks in, requesting the entire queue from each machine. Over the period of a few minutes the central manager restores itself to the state it was in before the failure. Each schedd is responsible for maintaining the correct state of each job as it progressed while the central manager is down. Therefore, it is guaranteed that the central manager will correctly rebuild itself. All jobs started when the central manager was down will continue to run and complete normally with no loss of information. Users may continue to submit jobs. These new jobs will be forwarded correctly when the central manager is restarted.
The schedd machine goes down	When schedd starts up again, it reads the queue of jobs and for every job which was in some sort of active state (i.e. PENDING, STARTING, RUNNING), it queries the machine where it is marked active. The running machine is required to return current status of the job. If the job completed while schedd was down, JOB COMPLETE is returned with exit status and accounting information. If the job is running, JOB RUNNING is returned. If the job was vacated, JOB VACATED is returned. Because these messages are left queued until delivery is confirmed, no job will be lost or incorrectly dispatched due to schedd failure. During the time the schedd is down, the central manager will not be able to start new jobs that were submitted to that schedd
The llsubmit machine goes down	schedd gets its own copy of the executable so it does not matter if the llsubmit machine goes down.

Why Does llstatus Indicate that a Machine is Down when llq Indicates a Job is Running on The Machine?

If a machine fails while a job is running on the machine, the central manager does not change the status of any job on the machine. When the machine comes back up the central manager will be updated.

What Happens if the Central Manager Isn't Operating?

In one of your machine stanzas specified in the administration file, you specified a machine to serve as the central manager. It is possible for some problem to cause this central manager to become unusable such as network communication or software or hardware failures. In such cases, the other machines in the LoadLeveler cluster believe that the central manager machine is no longer operating. If you assigned one or more alternate central managers in the machine stanza, a new central manager will take control. The alternate central manager is chosen based upon the order in which its respective machine stanza appears in the administration file.

Once an alternate central manager takes control, it starts up its negotiator daemon and notifies all of the other machines in the LoadLeveler cluster that a new central manager has been selected. The following diagram illustrates how a machine can become the alternate central manager:

Figure 36. When the Primary Central Manager is Unavailable

View figure.

The diagram illustrates that Machine Z is the primary central manager but Machine A took control of the LoadLeveler cluster by becoming the alternate central manager. Machine A remains in control as the alternate central manager until either:

The primary central manager, Machine Z, resumes operation. In this case, Machine Z notifies Machine A that it is operating again and, therefore, Machine A terminates its negotiator daemon.
Machine A also loses contact with the remaining machines in the pool. In this case, another machine authorized to serve as an alternate central manager takes control. Note that Machine A may remain as its own central manager.

The following diagram illustrates how multiple central managers can function within the same LoadLeveler pool:

Figure 37. Multiple Central Managers

View figure.

In this diagram, the primary central manager is serving Machines A and B. Due to some network failure, Machines C, D, and E have lost contact with the primary central manager machine and, therefore, Machine C which is authorized to serve as an alternate central manager, assumes that role. Machine C remains as the alternate central manager until either:

The primary central manager is able to contact Machines C, D, and E. In this case, the primary central manager notifies the alternate central managers that it is operating again and, therefore, Machine C terminates its negotiator daemon. The negotiator daemon running on the primary central manager machine is refreshed to discard any old job status information and to pick up the new job status information from the newly re-joined machines.
Machine C loses contact with Machines D and E. In this case, if machine D or E is authorized to act as an alternate central manager, it assumes that role. Otherwise, there will be no central manager serving these machines. Note that Machine C remains as its own central manager.

While LoadLeveler can handle this situation of two concurrent central managers without any loss of integrity, some installations may find administering it somewhat confusing. To avoid any confusion, you should specify all primary and alternate central managers on the same LAN segment.

For information on selecting alternate central managers, refer to "Step 1: Specify Machine Stanzas"

Helpful Hints

This section contains tips on running LoadLeveler, including some productivity aids.

Hints for Running Jobs

Determining When Your Job Started and Stopped

By reading the notification mail you receive after submitting a job, you can determine the time the job was submitted, started, and stopped. Suppose you submit a job and receive the following mail when the job finishes:

 
Submitted at: Sun Apr 30 11:40:41 1996
Started   at: Sun Apr 30 11:45:00 1996
Exited    at: Sun Apr 30 12:49:10 1996
 
Real Time:   0 01:08:29
Job Step User Time:   0 00:30:15
Job Step System Time:   0 00:12:55
Total Job Step Time:   0 00:43:10
 
Starter User Time:   0 00:00:00
Starter System Time:   0 00:00:00
Total Starter Time:   0 00:00:00

This mail tells you the following:

Submitted at: The time you issued the llsubmit command or the time you submitted the job with the graphical user interface.
Started at: The time the starter process executed the job.
Exited at: The actual time your job completed.
Real Time: The wall clock time from submit to completion.
Job Step User Time: The CPU time the job consumed executing in user space.
Job Step System Time: The CPU time the system (AIX) consumed on behalf of the job.
Total Job Step Time: The sum of the two fields above.
Starter User Time: The CPU time consumed by the LoadLeveler starter process for this job, executing in user space. Time consumed by the starter process is the only LoadLeveler overhead which can be directly attributed to a user's job.
Starter System Time: The CPU time the system (AIX) consumed on behalf of the LoadLeveler starter process running for this job.
Total Starter Time: The sum of the two fields above.

You can also get the starting time by issing llsummary -l -x and then issuing awk `/Date|Event/` against the resulting file. For this to work, you must have ACCT = A_ON A_DETAIL set in the LoadL_config file.

Running Jobs at a Specific Time of Day

Using a machine's local configuration file, you can set up the machine to run jobs at a certain time of day (sometimes called an execution window). The following coding in the local configuration file runs jobs between 5:00 PM and 8:00AM daily, and suspends jobs the rest of the day:

START: (tm_day >= 1700) || (tm_day <= 0800)
SUSPEND: (tm_day > 0800)  && (tm_day < 1700)
CONTINUE: (tm_day >= 1700) || (tm_day <= 0800)

Controlling the Mix of Idle and Running Jobs

Three keywords determine the mix of idle and running jobs for a user. By a running job, we mean a job that is in one of the following states: Running, Pending, or Starting. These keywords, which are described in detail in "Step 2: Specify User Stanzas", are:

maxqueued: Controls the number of jobs in any of these states: Idle, Running, Pending, or Starting.
maxjobs: Controls the number of jobs in any of these states: Running, Pending, or Starting; thus it controls a subset of what maxqueued controls. maxjobs effectively controls the number of jobs in the Running state, since Pending and Starting are usually temporary states.
maxidle: Controls the number of jobs in any of these states: Idle, Pending, or Starting; thus it controls a subset of what maxqueued controls. maxidle effectively controls the number of jobs in the Idle state, since Pending and Starting are usually temporary states.

What Happens When You Submit a Job

For a user's job to be allowed into the job queue, the total of other jobs (in the Idle, Pending, Starting and Running states) for that user must be less than the maxqueued value for that user. Also, the total idle jobs (those in the Idle, Pending, and Starting states) must be less than the maxidle value for the user. If either of these constraints are at the maximum, the job is placed in the Not Queued state until one of the other jobs changes state. If the user is at the maxqueued limit, a job must complete, be cancelled, or be held before the new job can enter the queue. If the user is at the maxidle limit, a job must start running, be cancelled, or be held before the new job can enter the queue.

Once a job is in the queue, the job is not taken out of queue unless the user places a hold on the job, the job completes, or the job is cancelled. (An exception to this, when you are running the default LoadLeveler scheduler, is parallel jobs which do not accumulate sufficient machines in a given time period. These jobs are moved to the Deferred state, meaning they must vie for the queue when their Deferred period expires.)

Once a job is in the queue, the job will run unless the maxjobs limit for the user is at a maximum.

Note the following restrictions for using these keywords:

If maxqueued is greater than (maxjobs + maxidle), the maxqueued value will never be reached.
If either maxjobs or maxidle is greater than maxqueued, then maxqueued will be the only restriction in effect, since maxjobs and maxidle will never be reached.

Sending Output from Several Job Steps to One Output File

You can use dependencies in your job command file to send the output from many job steps to the same output file. For example:

# @ step_name = step1
# @ executable = ssba.job
# @ output = ssba.tmp
# @ ...
# @ queue
#
# @ step_name = append1
# @ dependency = (step1 != CC_REMOVED)
# @ executable = append.ksh
# @ output = /dev/null
# @ queue
# @
# @ step_name = step2
# @ dependency = (append1 == 0)
# @ executable = ssba.job
# @ output = ssba.tmp
# @ ...
# @ queue
# @
# @ step_name = append2
# @ dependency = (step2 != CC_REMOVED)
# @ executable = append.ksh
# @ output = /dev/null
# @ queue
#
# ...

Then, the file append.ksh could contain the line cat ssba.tmp >> ssba.log. All your output will reside in ssba.log. (Your dependecies can look for different return values, depending on what you need to accomplish.)

You can achieve the same result from within ssba.job by appending your output to an output file rather than writing it to stdout. Then your output statement for each step would be /dev/null and you wouldn't need the append steps.

Hints for Using Machines

Setting Up a Single Machine To Have Multiple Job Classes

You can define a machine to have multiple job classes which are active at different times. For example, suppose you want a machine to run jobs of Class A any time, and you want the same machine to run Class B jobs between 6 p.m. and 8 a.m.

You can combine the Class keyword with a user-defined macro (called Off_shift in this example).

For example:

Off_Shift = ((tm_hour >= 18) || (tm_hour < 8))

Then define your START statement:

START : (Class == "A") || ((Class == "B") && $(Off_Shift))

Make sure you have the parenthesis around the Off_Shift macro, since the logical OR has a lower precedence than the logical AND in the START statement.

Also, to take weekends into account, code the following statements. Remember that Saturday is day 6 and Sunday is day 0.

Off_Shift = ((tm_wday == 6) || (tm_wday == 0) || (tm_hour >=18) \
|| (tm_hour < 8))
 
Prime_Shift = ((tm_wday != 6) && (tm_wday != 0) && (tm_hour >= 8) \
&& (tm_hour < 18))

Reporting the Load Average on Machines

You can use the /usr/bin/rup command to report the load average on a machine. The rup machine_name command gives you a report that looks similar to the following:

localhost    up 23 days, 10:25,    load average: 1.72, 1.05, 1.17

You can use this command to report the load average of your local machine or of remote machines. Another command, /usr/bin/uptime, returns the load average information for only your local host.

History Files and schedd

The schedd daemon writes to the spool/history file only when a job is completed or removed. Therefore, you can delete the history file and restart schedd even when some jobs are scheduled to run on other hosts.

However, you should clean up the spool/job_queue.dir and spool/job_queue.pag files only when no jobs are being scheduled on the machine.

You should not delete these files if there are any jobs in the job queue that are being scheduled from this machine (for example, jobs with names such as thismachine.clusterno.jobno).

Getting Help from IBM

Should you require help from IBM in resolving a LoadLeveler problem, you can get assistance by calling IBM Support. Before you call, be sure you have the following information:

Your access code (customer number).
The LoadLeveler product number (5765-D61).
The name and version of the operating system you are using.
A telephone number where you can be reached.

In addition, issue the following command:

  llctl version

This command will provide you with code level information. Provide this information to the IBM representative.

The number for IBM support in the United States is 1-800-IBM-4YOU (426-4968).

The Facsimile number is 800-2IBM-FAX (2426-329).

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]