A network job management and job scheduling system, such as LoadLeveler, is a software program that schedules and manages jobs that you submit to one or more machines under its control. LoadLeveler accepts jobs that users submit and reviews the job requirements. LoadLeveler then examines the machines under its control to determine which machines are best suited to run each job.
LoadLeveler schedules your jobs on one or more machines for processing. The definition of a job, in this context, is a set of job steps. For each job step, you can specify a different executable (the executable is the part of the job that gets processed). You can use LoadLeveler to submit jobs which are made up of one or more job steps, where each job step depends upon the completion status of a previous job step. For example, Figure 2 illustrates a stream of job steps:
Figure 2. LoadLeveler Job Steps
View figure.
Each of these job steps is defined in a single job command file. A job command file specifies the name of the job, as well as the job steps that you want to submit, and can contain other LoadLeveler statements.
LoadLeveler tries to execute each of your job steps on a machine that has enough resources to support executing and checkpointing each step. If your job command file has multiple job steps, the job steps will not necessarily run on the same machine, unless you explicitly request that they do.
You can submit batch jobs to LoadLeveler for scheduling. Batch jobs run in the background and generally do not require any input from the user. Batch jobs can either be serial or parallel. A serial job runs on a single machine. A parallel job is a program designed to execute as a number of individual, but related, processes on one or more of your system's nodes. When executed, these related processes can communicate with each other (through message passing or shared memory) to exchange data or synchronize their execution.
LoadLeveler will execute two different types of parallel jobs:
job_type = PVM job_type = parallel
With a job_type of PVM, LoadLeveler supports a PVM API to allocate nodes and launch tasks. With a job_type of parallel, LoadLeveler interacts with Parallel Operating Environment (POE) to allocate nodes, assign tasks to nodes, and launch tasks.
In order for LoadLeveler to schedule a job on a machine, the machine must be a valid member of the LoadLeveler cluster. A cluster is the combination of all of the different types of machines that use LoadLeveler. The following types of machines can comprise a LoadLeveler cluster:
To make a machine a member of the LoadLeveler cluster, the administrator has to install the LoadLeveler software onto the machine and identify the central manager (described in Roles of Machines). Once a machine becomes a valid member of the cluster, LoadLeveler can schedule jobs to it.
Each machine in the LoadLeveler cluster performs one or more roles in scheduling jobs. These roles are described below:
Keep in mind that one machine can assume multiple roles.
There may be times when some of the machines in the LoadLeveler cluster are not available to process jobs; for instance, when the owners of the machines have decided to make them unavailable. This ability of LoadLeveler to allow users to restrict the use of their machines provides flexibility and control over the resources.
Machine owners can make their personal workstations available to other LoadLeveler users in several ways. For example, you can specify that:
Owners can also specify that their personal workstations never be made available to other LoadLeveler users.