This section contains answers to questions frequently asked by LoadLeveler customers.
If you submitted your job and it is in the LoadLeveler queue but has not
run, issue llq -s first to help diagnose the problem. If you
need more help diagnosing the problem, refer to the following table:
Why Your Job May Not Be Running: | Possible Soulution |
---|---|
Job requires specific machine, operating system, or other resource. |
Check the GUI to compare the job requirements to the machine details,
especially Arch, OpSys, and Class.
Ensure that the spelling and capitalization matches.
|
Job requires specific job class |
|
The maximum number of jobs are already running on all the eligible machines | Wait until one of the machines finishes a job before scheduling your job. |
The start expression evaluates to false. | Examine the configuration files (both LoadL_config and
LoadL_config.local) to determine the START
control function expression used by LoadLeveler to start a job. As a
problem determination measure, set the START and SUSPEND values, as shown in
this example:
START: T SUSPEND: F |
The priority of your job is lower than the priority of other jobs. | You cannot affect the system priority given to this job by the negotiator daemon but you can try to change your user priority to move this job ahead of other jobs you previously submitted using the llprio command or the GUI. |
The information the central manager has about machines and jobs may not be current. | Wait a few minutes for the central manager to be updated and then the job may be dispatched. This time limit (a few minutes) depends upon the polling frequency and polls per update set in the LoadL_config file. The default polling frequency is five minutes. |
You do not have the same user ID on all the machines in the cluster. | To run jobs on any machine in the cluster, you have to have the same user ID and the same uid number on every machine in the pool. If you do not have a userid on one machine, your jobs will not be scheduled to that machine. |
You can use the llq command to query the status of your job or the llstatus command to query the status of machines in the cluster. Refer to Chapter 9, LoadLeveler Commands for information on these commands.
If you submitted your parallel job and it is in the LoadLeveler queue but
has not run, issue llq -s first to help diagnose the
problem. If issuing this command does not help, refer to the previous
table and to the following table for more information:
Why Your Job May Not Be Running | Possible Solution |
---|---|
The minimum number of processors requested by your job is not available. | Sufficient resources must be available. Specifying a smaller number of processors may help if your job can run with fewer resources. |
The pool in your requirements statement specifies a pool which is invalid or not available. | The specified pool must be valid and available. |
The adapter specified in the requirements statement or the network statement identifies an adapter which is invalid or not available. | The specified adapter must be valid and available. |
PVM3 is not installed | PVM3 must be installed on any machine you wish to use for pvm. The PVM3 system itself is not supplied with LoadLeveler. |
You are already running a PVM3 job on one of the LoadLeveler machines. | PVM3 restrictions prevent a user from running more than one pvm daemon per user per machine. If you want to run pvm3 jobs on LoadLeveler, you must not run any pvm3 jobs outside of LoadLeveler control on any machine being managed by LoadLeveler. |
The parallel_path keyword in your job command file is incorrect. | Use parallel_path to inform LoadLeveler where binaries that run your pvm tasks are for the pvm_spawn() command. If this is incorrect, the job may not run. |
The pvm_root keyword in the administration file is incorrect. | This keyword corresponds to the pvm ep keyword and is required to tell LoadLeveler where the pvm system is installed. |
The file /tmp/pvmd.userid exists on some LoadLeveler machine but no PVM jobs are running. | If PVM3 exits unexpectedly, it will not properly clean up after itself. Although LoadLeveler attempts to clean up after pvm, some situations are ambiguous and you may have to remove this file yourself. Check all the systems specified as being capable of running PVM3, and remove this file if it exists. |
This section presents a list of common problems found in setting up parallel jobs:
If LoadLeveler is to manage PVM jobs on a machine for a user, that user should not attempt to run PVM jobs on that machine outside of LoadLeveler control. Because of PVM restrictions, only a single PVM daemon per user per machine is permitted. If a user tries to run PVM jobs without using LoadLeveler and LoadLeveler later attempts to start a job for that user on the same machine, LoadLeveler may not be able to start PVM for the job. This will cause the LoadLeveler job to be cancelled.
If a PVM job submitted through LoadLeveler is rejected, it is probably because PVM was not correctly terminated the last time it ran on the rejecting machine. LoadLeveler attempts to handle this by making sure that it cleans up PVM jobs when they complete, but remember that you may need to clean up after the job yourself. If a machine refuses to start a PVM job, check the following:
ps -ef | grep pvmd kill -TERM pid
Do not use either of the following variations to stop the daemon because this will prevent pvmd from cleaning up and jobs will still not start:
kill -9 pid kill -KILL pid
If a job you submitted from a submit-only machine does not run, verify that you have defined the following statements in the machine stanza of the administration file of the submit-only machine:
submit_only = true schedd_host = false central_manager = false
If a job appears to stay in the Pending or Starting state, it is possible the job is continually being dispatched and rejected. Check the setting of the MAX_JOB_REJECT keyword. If it is set to the default, -1, the job will be rejected an unlimited number of times. Try resetting this keyword to some finite number. Also, check the setting of the ACTION_ON_MAX_REJECT keyword. These keywords are described in Step 17: Specify Additional Configuration File Keywords.
Both the startd daemon and the schedd daemon maintain
persistent states of all jobs. Both daemons use a specific protocol to
ensure that the state of all jobs is consistent across LoadLeveler. In
the event of a failure, the state can be recovered. Neither the schedd
nor the startd daemon discard the job state information until it is passed
onto and accepted by another daemon in the process.
If | Then |
---|---|
The network goes down but the machines are still running |
If the network goes down but the machines are still running, when
LoadLeveler is restarted, it looks for all jobs that were marked running when
it went down. On the machine where the job is running, the startd
daemon searches for the job and if it can verify that the job is still
running, it continues to manage the job through completion. On the
machine where schedd is running, schedd queues a transaction to the startd to
re-establish the state of the job. This transaction stays queued until
the state is established. Until that time, LoadLeveler assumes the
state is the same as when the system went down.
|
The network partitions or goes down. | All transactions are left queued until the recipient has acknowledged them. Critical transactions such as those between the schedd and startd are recorded on disk. This ensures complete delivery of messages and prevents incorrect decisions based on incomplete state information. |
The machine with startd goes down. | Because job state is maintained on disk in startd, when LoadLeveler is restarted it can forward correct status to the rest of LoadLeveler. In the case of total machine failure, this is usually "JOB VACATED", which causes the job to be restarted elsewhere. In the case that only LoadLeveler failed, it is often possible to "find" the job if it is still running and resume management of it. In this case LoadLeveler sends JOB RUNNING to the schedd and central manager, thereby permitting the job to run to completion. |
The central manager machine goes down. | All machines in the cluster send current status to the central manager on
a regular basis. When the central manager restarts, it queries each
machine that checks in, requesting the entire queue from each machine.
Over the period of a few minutes the central manager restores itself to the
state it was in before the failure. Each schedd is responsible for
maintaining the correct state of each job as it progressed while the central
manager is down. Therefore, it is guaranteed that the central manager
will correctly rebuild itself.
All jobs started when the central manager was down will continue to run and complete normally with no loss of information. Users may continue to submit jobs. These new jobs will be forwarded correctly when the central manager is restarted. |
The schedd machine goes down | When schedd starts up again, it reads the queue of jobs and for every job
which was in some sort of active state (i.e. PENDING, STARTING,
RUNNING), it queries the machine where it is marked active.
The running machine is required to return current status of the job. If the job completed while schedd was down, JOB COMPLETE is returned with exit status and accounting information. If the job is running, JOB RUNNING is returned. If the job was vacated, JOB VACATED is returned. Because these messages are left queued until delivery is confirmed, no job will be lost or incorrectly dispatched due to schedd failure. During the time the schedd is down, the central manager will not be able to start new jobs that were submitted to that schedd. To recover the resources allocated to jobs scheduled by a schedd machine, see How Do I Recover Resources Allocated by a schedd Machine?. |
The llsubmit machine goes down | schedd gets its own copy of the executable so it does not matter if the llsubmit machine goes down. |
If a machine fails while a job is running on the machine, the central manager does not change the status of any job on the machine. When the machine comes back up the central manager will be updated.
In one of your machine stanzas specified in the administration file, you specified a machine to serve as the central manager. It is possible for some problem to cause this central manager to become unusable such as network communication or software or hardware failures. In such cases, the other machines in the LoadLeveler cluster believe that the central manager machine is no longer operating. If you assigned one or more alternate central managers in the machine stanza, a new central manager will take control. The alternate central manager is chosen based upon the order in which its respective machine stanza appears in the administration file.
Once an alternate central manager takes control, it starts up its negotiator daemon and notifies all of the other machines in the LoadLeveler cluster that a new central manager has been selected. The following diagram illustrates how a machine can become the alternate central manager:
Figure 36. When the Primary Central Manager is Unavailable
![]() |
The diagram illustrates that Machine Z is the primary central manager but Machine A took control of the LoadLeveler cluster by becoming the alternate central manager. Machine A remains in control as the alternate central manager until either:
The following diagram illustrates how multiple central managers can function within the same LoadLeveler pool:
Figure 37. Multiple Central Managers
![]() |
In this diagram, the primary central manager is serving Machines A and B. Due to some network failure, Machines C, D, and E have lost contact with the primary central manager machine and, therefore, Machine C which is authorized to serve as an alternate central manager, assumes that role. Machine C remains as the alternate central manager until either:
While LoadLeveler can handle this situation of two concurrent central managers without any loss of integrity, some installations may find administering it somewhat confusing. To avoid any confusion, you should specify all primary and alternate central managers on the same LAN segment.
For information on selecting alternate central managers, refer to Step 1: Specify Machine Stanzas.
If a node running the schedd daemon fails, resources allocated to jobs scheduled by this schedd cannot be freed up until you restart the schedd. Administrators must do the following to enable the recovery of schedd resources:
The master daemon starts the startd daemon and the startd daemon starts the starter process. The starter process runs the job. The job needs to be run by the userid of the submitter. You either have to have a separate master daemon running for every ID on the system or the master daemon has to be able to su to every userid and the only user ID that can su any other userid is root.
When you submit a batch job to LoadLeveler, the operating system will execute your .profile script before executing the batch job if your login shell is the Korn shell. On the other hand, if your login shell is the Bourne shell, on most operating systems (including AIX), the .profile script is not executed. Similarly, if your login shell is the C shell then AIX will execute your .login script before executing your LoadLeveler batch job but some other variants of UNIX may not invoke this script.
The reason for this discrepancy is due to the interactions of the shells and the operating system. To understand the nature of the problem, examine the following C program that attempts to open a login Korn shell and execute the "ls" command:
#include <stdio.h> main() { execl("/bin/ksh","-","-c","ls",NULL); }
UNIX documentations in general (SunOS, HP-UX, AIX, IRIX) give the impression that if the second argument is "-" then you get a login shell regardless of whether the first argument is /bin/ksh or /bin/csh or /bin/sh. In practice, this is not the case. Whether you get a login shell or not is implementation dependent and varies depending upon the UNIX version you are using. On AIX you get a login shell for /bin/ksh and /bin/csh but not the Bourne shell.
If your login shell is the Bourne shell and you would like the operating system to execute your .profile script before starting your batch job, add the following statement to your job command file:
# @ shell = /bin/ksh
LoadLeveler will open a login Korn shell to start your batch job which may be a shell script of any type (Bourne shell, C shell, or Korn shell) or just a simple executable.
When you create a mksysb (an image of the currently installed operating system) at a time when LoadLeveler is running jobs, the state of the jobs is saved as part of the mksysb. When the mksysb is restored on a node, those jobs will appear to be on the node, in the same state as when they were saved, even though the jobs are not actually there. To delete these phantom jobs, you must remove all files from the LoadLeveler spool and execute directories and then restart LoadLeveler.