Using and Administering

Frequently Asked Questions

This section contains answers to questions frequently asked by LoadLeveler customers.

If you submitted your job and it is in the LoadLeveler queue but has not run, issue llq -s first to help diagnose the problem. If you need more help diagnosing the problem, refer to the following table:

Why Your Job May Not Be Running: Possible Soulution
Job requires specific machine, operating system, or other resource.

Does the resource exist in the LoadLeveler cluster? If yes, wait until it becomes available.

Check the GUI to compare the job requirements to the machine details, especially Arch, OpSys, and Class. Ensure that the spelling and capitalization matches.
Job requires specific job class

Is the class defined in the administration file? Use llclass to determine this. If yes,
Is there a machine in the cluster that supports that class? If yes, you need to wait until the machine becomes available to run your job.

The maximum number of jobs are already running on all the eligible machines Wait until one of the machines finishes a job before scheduling your job.
The start expression evaluates to false. Examine the configuration files (both LoadL_config and LoadL_config.local) to determine the START control function expression used by LoadLeveler to start a job. As a problem determination measure, set the START and SUSPEND values, as shown in this example:
START: T SUSPEND: F

The priority of your job is lower than the priority of other jobs. You cannot affect the system priority given to this job by the negotiator daemon but you can try to change your user priority to move this job ahead of other jobs you previously submitted using the llprio command or the GUI.
The information the central manager has about machines and jobs may not be current. Wait a few minutes for the central manager to be updated and then the job may be dispatched. This time limit (a few minutes) depends upon the polling frequency and polls per update set in the LoadL_config file. The default polling frequency is five minutes.
You do not have the same user ID on all the machines in the cluster. To run jobs on any machine in the cluster, you have to have the same user ID and the same uid number on every machine in the pool. If you do not have a userid on one machine, your jobs will not be scheduled to that machine.

Why Your Job May Not Be Running:	Possible Soulution
Job requires specific machine, operating system, or other resource.	Does the resource exist in the LoadLeveler cluster? If yes, wait until it becomes available. Check the GUI to compare the job requirements to the machine details, especially Arch, OpSys, and Class. Ensure that the spelling and capitalization matches.
Job requires specific job class	Is the class defined in the administration file? Use llclass to determine this. If yes, Is there a machine in the cluster that supports that class? If yes, you need to wait until the machine becomes available to run your job.
The maximum number of jobs are already running on all the eligible machines	Wait until one of the machines finishes a job before scheduling your job.
The start expression evaluates to false.	Examine the configuration files (both LoadL_config and LoadL_config.local) to determine the START control function expression used by LoadLeveler to start a job. As a problem determination measure, set the START and SUSPEND values, as shown in this example: START: T SUSPEND: F
The priority of your job is lower than the priority of other jobs.	You cannot affect the system priority given to this job by the negotiator daemon but you can try to change your user priority to move this job ahead of other jobs you previously submitted using the llprio command or the GUI.
The information the central manager has about machines and jobs may not be current.	Wait a few minutes for the central manager to be updated and then the job may be dispatched. This time limit (a few minutes) depends upon the polling frequency and polls per update set in the LoadL_config file. The default polling frequency is five minutes.
You do not have the same user ID on all the machines in the cluster.	To run jobs on any machine in the cluster, you have to have the same user ID and the same uid number on every machine in the pool. If you do not have a userid on one machine, your jobs will not be scheduled to that machine.

You can use the llq command to query the status of your job or the llstatus command to query the status of machines in the cluster. Refer to Chapter 9, LoadLeveler Commands for information on these commands.

Why Won't My Parallel Job Run?

If you submitted your parallel job and it is in the LoadLeveler queue but has not run, issue llq -s first to help diagnose the problem. If issuing this command does not help, refer to the previous table and to the following table for more information:

Why Your Job May Not Be Running Possible Solution
The minimum number of processors requested by your job is not available. Sufficient resources must be available. Specifying a smaller number of processors may help if your job can run with fewer resources.
The pool in your requirements statement specifies a pool which is invalid or not available. The specified pool must be valid and available.
The adapter specified in the requirements statement or the network statement identifies an adapter which is invalid or not available. The specified adapter must be valid and available.
PVM3 is not installed PVM3 must be installed on any machine you wish to use for pvm. The PVM3 system itself is not supplied with LoadLeveler.
You are already running a PVM3 job on one of the LoadLeveler machines. PVM3 restrictions prevent a user from running more than one pvm daemon per user per machine. If you want to run pvm3 jobs on LoadLeveler, you must not run any pvm3 jobs outside of LoadLeveler control on any machine being managed by LoadLeveler.
The parallel_path keyword in your job command file is incorrect. Use parallel_path to inform LoadLeveler where binaries that run your pvm tasks are for the pvm_spawn() command. If this is incorrect, the job may not run.
The pvm_root keyword in the administration file is incorrect. This keyword corresponds to the pvm ep keyword and is required to tell LoadLeveler where the pvm system is installed.
The file /tmp/pvmd.userid exists on some LoadLeveler machine but no PVM jobs are running. If PVM3 exits unexpectedly, it will not properly clean up after itself. Although LoadLeveler attempts to clean up after pvm, some situations are ambiguous and you may have to remove this file yourself. Check all the systems specified as being capable of running PVM3, and remove this file if it exists.

Why Your Job May Not Be Running	Possible Solution
The minimum number of processors requested by your job is not available.	Sufficient resources must be available. Specifying a smaller number of processors may help if your job can run with fewer resources.
The pool in your requirements statement specifies a pool which is invalid or not available.	The specified pool must be valid and available.
The adapter specified in the requirements statement or the network statement identifies an adapter which is invalid or not available.	The specified adapter must be valid and available.
PVM3 is not installed	PVM3 must be installed on any machine you wish to use for pvm. The PVM3 system itself is not supplied with LoadLeveler.
You are already running a PVM3 job on one of the LoadLeveler machines.	PVM3 restrictions prevent a user from running more than one pvm daemon per user per machine. If you want to run pvm3 jobs on LoadLeveler, you must not run any pvm3 jobs outside of LoadLeveler control on any machine being managed by LoadLeveler.
The parallel_path keyword in your job command file is incorrect.	Use parallel_path to inform LoadLeveler where binaries that run your pvm tasks are for the pvm_spawn() command. If this is incorrect, the job may not run.
The pvm_root keyword in the administration file is incorrect.	This keyword corresponds to the pvm ep keyword and is required to tell LoadLeveler where the pvm system is installed.
The file /tmp/pvmd.`userid` exists on some LoadLeveler machine but no PVM jobs are running.	If PVM3 exits unexpectedly, it will not properly clean up after itself. Although LoadLeveler attempts to clean up after pvm, some situations are ambiguous and you may have to remove this file yourself. Check all the systems specified as being capable of running PVM3, and remove this file if it exists.

Common Set Up Problems with Parallel Jobs

This section presents a list of common problems found in setting up parallel jobs:

If jobs appear to remain in a Pending or Starting state: check that the nameserver is consistent. Compare results of host machine_name and host IP_address
For POE:
- Specify the POE partition manager as the executable. Do not specify the parallel job as the executable.
- Pass the parallel job as an argument to POE.
- The parallel job must exist and must be specified as a full path name.
- If the job runs in user space, specify the flag -euilib us.
- Specify the correct adapter (when needed).
- Specify a POE job only once in the job command file.
- Compile only with the supported level of POE.
- Specify only parallel as the job_type.
For PVM:
- Specify the parallel job as the executable. Do not specify PVM as the executable.
- Compile only with the supported level of PVM.
- Specify only pvm3 as the job_type.

PVM Problem Determination

If LoadLeveler is to manage PVM jobs on a machine for a user, that user should not attempt to run PVM jobs on that machine outside of LoadLeveler control. Because of PVM restrictions, only a single PVM daemon per user per machine is permitted. If a user tries to run PVM jobs without using LoadLeveler and LoadLeveler later attempts to start a job for that user on the same machine, LoadLeveler may not be able to start PVM for the job. This will cause the LoadLeveler job to be cancelled.

If a PVM job submitted through LoadLeveler is rejected, it is probably because PVM was not correctly terminated the last time it ran on the rejecting machine. LoadLeveler attempts to handle this by making sure that it cleans up PVM jobs when they complete, but remember that you may need to clean up after the job yourself. If a machine refuses to start a PVM job, check the following:

See if there is a process with the name pvmd running on the machine in question under the id of the user whose job will not start. Stop the process by issuing:
```
ps -ef | grep pvmd
kill -TERM pid
```
Do not use either of the following variations to stop the daemon because this will prevent pvmd from cleaning up and jobs will still not start:
```
kill -9 pid
kill -KILL pid
```
If there is no pvmd process running, see if there is a file called /tmp/pvmd. userid, where userid is the ID of the user whose job will not start. If the file exists, remove it.

Why Won't My Submit-Only Job Run?

If a job you submitted from a submit-only machine does not run, verify that you have defined the following statements in the machine stanza of the administration file of the submit-only machine:

submit_only = true
schedd_host = false
central_manager = false

Why Does a Job Stay in the Pending (or Starting) State?

If a job appears to stay in the Pending or Starting state, it is possible the job is continually being dispatched and rejected. Check the setting of the MAX_JOB_REJECT keyword. If it is set to the default, -1, the job will be rejected an unlimited number of times. Try resetting this keyword to some finite number. Also, check the setting of the ACTION_ON_MAX_REJECT keyword. These keywords are described in Step 17: Specify Additional Configuration File Keywords.

What Happens to Running Jobs When a Machine Goes Down?

Both the startd daemon and the schedd daemon maintain persistent states of all jobs. Both daemons use a specific protocol to ensure that the state of all jobs is consistent across LoadLeveler. In the event of a failure, the state can be recovered. Neither the schedd nor the startd daemon discard the job state information until it is passed onto and accepted by another daemon in the process.

If Then
The network goes down but the machines are still running
If the network goes down but the machines are still running, when LoadLeveler is restarted, it looks for all jobs that were marked running when it went down. On the machine where the job is running, the startd daemon searches for the job and if it can verify that the job is still running, it continues to manage the job through completion. On the machine where schedd is running, schedd queues a transaction to the startd to re-establish the state of the job. This transaction stays queued until the state is established. Until that time, LoadLeveler assumes the state is the same as when the system went down.
The network partitions or goes down. All transactions are left queued until the recipient has acknowledged them. Critical transactions such as those between the schedd and startd are recorded on disk. This ensures complete delivery of messages and prevents incorrect decisions based on incomplete state information.
The machine with startd goes down. Because job state is maintained on disk in startd, when LoadLeveler is restarted it can forward correct status to the rest of LoadLeveler. In the case of total machine failure, this is usually "JOB VACATED", which causes the job to be restarted elsewhere. In the case that only LoadLeveler failed, it is often possible to "find" the job if it is still running and resume management of it. In this case LoadLeveler sends JOB RUNNING to the schedd and central manager, thereby permitting the job to run to completion.
The central manager machine goes down. All machines in the cluster send current status to the central manager on a regular basis. When the central manager restarts, it queries each machine that checks in, requesting the entire queue from each machine. Over the period of a few minutes the central manager restores itself to the state it was in before the failure. Each schedd is responsible for maintaining the correct state of each job as it progressed while the central manager is down. Therefore, it is guaranteed that the central manager will correctly rebuild itself.
All jobs started when the central manager was down will continue to run and complete normally with no loss of information. Users may continue to submit jobs. These new jobs will be forwarded correctly when the central manager is restarted.
The schedd machine goes down When schedd starts up again, it reads the queue of jobs and for every job which was in some sort of active state (i.e. PENDING, STARTING, RUNNING), it queries the machine where it is marked active.
The running machine is required to return current status of the job. If the job completed while schedd was down, JOB COMPLETE is returned with exit status and accounting information. If the job is running, JOB RUNNING is returned. If the job was vacated, JOB VACATED is returned. Because these messages are left queued until delivery is confirmed, no job will be lost or incorrectly dispatched due to schedd failure.
During the time the schedd is down, the central manager will not be able to start new jobs that were submitted to that schedd.
To recover the resources allocated to jobs scheduled by a schedd machine, see How Do I Recover Resources Allocated by a schedd Machine?.
The llsubmit machine goes down schedd gets its own copy of the executable so it does not matter if the llsubmit machine goes down.

If	Then
The network goes down but the machines are still running	If the network goes down but the machines are still running, when LoadLeveler is restarted, it looks for all jobs that were marked running when it went down. On the machine where the job is running, the startd daemon searches for the job and if it can verify that the job is still running, it continues to manage the job through completion. On the machine where schedd is running, schedd queues a transaction to the startd to re-establish the state of the job. This transaction stays queued until the state is established. Until that time, LoadLeveler assumes the state is the same as when the system went down.
The network partitions or goes down.	All transactions are left queued until the recipient has acknowledged them. Critical transactions such as those between the schedd and startd are recorded on disk. This ensures complete delivery of messages and prevents incorrect decisions based on incomplete state information.
The machine with startd goes down.	Because job state is maintained on disk in startd, when LoadLeveler is restarted it can forward correct status to the rest of LoadLeveler. In the case of total machine failure, this is usually "JOB VACATED", which causes the job to be restarted elsewhere. In the case that only LoadLeveler failed, it is often possible to "find" the job if it is still running and resume management of it. In this case LoadLeveler sends JOB RUNNING to the schedd and central manager, thereby permitting the job to run to completion.
The central manager machine goes down.	All machines in the cluster send current status to the central manager on a regular basis. When the central manager restarts, it queries each machine that checks in, requesting the entire queue from each machine. Over the period of a few minutes the central manager restores itself to the state it was in before the failure. Each schedd is responsible for maintaining the correct state of each job as it progressed while the central manager is down. Therefore, it is guaranteed that the central manager will correctly rebuild itself. All jobs started when the central manager was down will continue to run and complete normally with no loss of information. Users may continue to submit jobs. These new jobs will be forwarded correctly when the central manager is restarted.
The schedd machine goes down	When schedd starts up again, it reads the queue of jobs and for every job which was in some sort of active state (i.e. PENDING, STARTING, RUNNING), it queries the machine where it is marked active. The running machine is required to return current status of the job. If the job completed while schedd was down, JOB COMPLETE is returned with exit status and accounting information. If the job is running, JOB RUNNING is returned. If the job was vacated, JOB VACATED is returned. Because these messages are left queued until delivery is confirmed, no job will be lost or incorrectly dispatched due to schedd failure. During the time the schedd is down, the central manager will not be able to start new jobs that were submitted to that schedd. To recover the resources allocated to jobs scheduled by a schedd machine, see How Do I Recover Resources Allocated by a schedd Machine?.
The llsubmit machine goes down	schedd gets its own copy of the executable so it does not matter if the llsubmit machine goes down.

Why Does llstatus Indicate that a Machine is Down when llq Indicates a Job is Running on The Machine?

If a machine fails while a job is running on the machine, the central manager does not change the status of any job on the machine. When the machine comes back up the central manager will be updated.

What Happens if the Central Manager Isn't Operating?

In one of your machine stanzas specified in the administration file, you specified a machine to serve as the central manager. It is possible for some problem to cause this central manager to become unusable such as network communication or software or hardware failures. In such cases, the other machines in the LoadLeveler cluster believe that the central manager machine is no longer operating. If you assigned one or more alternate central managers in the machine stanza, a new central manager will take control. The alternate central manager is chosen based upon the order in which its respective machine stanza appears in the administration file.

Once an alternate central manager takes control, it starts up its negotiator daemon and notifies all of the other machines in the LoadLeveler cluster that a new central manager has been selected. The following diagram illustrates how a machine can become the alternate central manager:

Figure 36. When the Primary Central Manager is Unavailable

View figure.

The diagram illustrates that Machine Z is the primary central manager but Machine A took control of the LoadLeveler cluster by becoming the alternate central manager. Machine A remains in control as the alternate central manager until either:

The primary central manager, Machine Z, resumes operation. In this case, Machine Z notifies Machine A that it is operating again and, therefore, Machine A terminates its negotiator daemon.
Machine A also loses contact with the remaining machines in the pool. In this case, another machine authorized to serve as an alternate central manager takes control. Note that Machine A may remain as its own central manager.

The following diagram illustrates how multiple central managers can function within the same LoadLeveler pool:

Figure 37. Multiple Central Managers

View figure.

In this diagram, the primary central manager is serving Machines A and B. Due to some network failure, Machines C, D, and E have lost contact with the primary central manager machine and, therefore, Machine C which is authorized to serve as an alternate central manager, assumes that role. Machine C remains as the alternate central manager until either:

The primary central manager is able to contact Machines C, D, and E. In this case, the primary central manager notifies the alternate central managers that it is operating again and, therefore, Machine C terminates its negotiator daemon. The negotiator daemon running on the primary central manager machine is refreshed to discard any old job status information and to pick up the new job status information from the newly re-joined machines.
Machine C loses contact with Machines D and E. In this case, if machine D or E is authorized to act as an alternate central manager, it assumes that role. Otherwise, there will be no central manager serving these machines. Note that Machine C remains as its own central manager.

While LoadLeveler can handle this situation of two concurrent central managers without any loss of integrity, some installations may find administering it somewhat confusing. To avoid any confusion, you should specify all primary and alternate central managers on the same LAN segment.

For information on selecting alternate central managers, refer to Step 1: Specify Machine Stanzas.

How Do I Recover Resources Allocated by a schedd Machine?

If a node running the schedd daemon fails, resources allocated to jobs scheduled by this schedd cannot be freed up until you restart the schedd. Administrators must do the following to enable the recovery of schedd resources:

Recognize that a node running the schedd daemon is down and will be down long enough such that it is necessary for you to recover the schedd resources.
Add the statement schedd_fenced=true to the machine stanza of the failed node. This statement specifies that the central manager ignores connections from the schedd daemon running on this machine, and prevents conflicts from arising when a schedd machine is restarted while a purge (see below) is taking place.
Reconfigure the central manager node so that it recognizes the "fenced" node. From the central manager machine issue llctl reconfig.
Issue llctl -h host purgeschedd to purge all jobs scheduled by the schedd on the failed node.
Remove all files in the LoadLeveler spool directory of the failed node. Once the failed node is working again, you can remove the schedd_fenced=true statement.

Using and Administering

Frequently Asked Questions

Why Won't My Job Run?

Why Won't My Parallel Job Run?

Common Set Up Problems with Parallel Jobs

PVM Problem Determination

Why Won't My Submit-Only Job Run?

Why Does a Job Stay in the Pending (or Starting) State?

What Happens to Running Jobs When a Machine Goes Down?

Why Does llstatus Indicate that a Machine is Down when llq Indicates a Job is Running on The Machine?

What Happens if the Central Manager Isn't Operating?

How Do I Recover Resources Allocated by a schedd Machine?

Other Questions

Why do I have to setuid = 0?

Why Doesn't LoadLeveler Execute my .profile or .login Script?

What Happens When a mksysb is Created When LoadLeveler is Running Jobs?