IBM Books

Using and Administering


Appendix B. Customer Case Studies

This chapter gives you an overview, including configuration information, of some LoadLeveler customers. These profiles are meant to highlight how customers in different industries use LoadLeveler.

Note that all of these configurations apply to Version 1 Release 3 of the default LoadLeveler scheduler unless otherwise noted.


Customer 1: Technical Computing at the Cornell Theory Center

The Cornell Theory Center (CTC) of Cornell University provides a high-performance computing environment to advance and facilitate research and education.

System Configuration

The CTC runs a 160-node SP with 16 wide nodes and 144 thin nodes. The SP nodes include two interactive nodes and two submit-only nodes. The majority of the other SP nodes run batch jobs. The LoadLeveler central manager runs on a workstation outside of the SP. Also, two other non-SP workstations act as schedd hosts.

LoadLeveler Configuration

The CTC runs parallel jobs by disabling the default LoadLeveler scheduler SCHEDULER_API=YES) and running an external scheduler. The CTC has developed this scheduler to meet the needs of its users.

The following figures represent sections of the CTC's LoadL_admin file. Note that not all nodes are shown here.

#############################################################################
# DEFAULTS FOR MACHINE, CLASS, USER, AND GROUP STANZAS:
# Remove initial # (comment), and edit to suit.
#############################################################################
default:        type = machine
                central_manager = false  # default not central manager
                schedd_host = false      # default not a public scheduler
                submit_only = false      # default not a submit-only machine
                pvm_root = /usr/local/app/pvm3  # default pvm3 directory
                rm_host = true           # default is parallel SP2 node
#               speed = 1                # default machine speed
#               cpu_speed_scale = false  # scale cpu limits by speed
 
default:        type = class             # default class stanza
#               priority = 0             # default ClassSysprio
#               max_processors = -1      # default max processors for class (no
default:        type = user              # default user stanza
#               priority = 0             # default UserSysprio
                default_class = DSI      # default class
                default_group = No_Group # default group = No_Group (not
                                         # optional)
#               maxjobs = -1             # default maximum jobs user is allowed
                                         # to run simultaneously (no limit)
#               maxqueued = -1           # default maximum jobs user is allowed
                                         # on system queue (no limit).  does not
                                         # limit jobs submitted.
 
default:        type = group             # default group stanza
#               priority = 0             # default GroupSysprio
#               maxjobs = -1             # default maximum jobs group is allowed
                                         # to run simultaneously (no limit)
#               maxqueued = -1           # default maximum jobs group is allowed
                                         # on system queue (no limit).  does not
                                         # limit jobs submitted.
#############################################################################
# MACHINE STANZAS:
# These are the machine stanzas; the first machine is defined as
# the central manager.  mach1:, mach2:, etc. are machine name labels -
# revise these placeholder labels with the names of the machines in the
# pool, and specify any schedd_host and submit_only keywords and values
# (true or false), if required.
#############################################################################
 
# spscheduler is a 43P running EASY-LL and the Central Manager
spscheduler.tc.cornell.edu:   type = machine
                              central_manager = true
                              rm_host =false
 
# ctc1 and ctc2 are two 43P's running as dedicated SchedDs
ctc1.tc.cornell.edu: type = machine
                            schedd_host = true
 
ctc2.tc.cornell.edu: type = machine
                            schedd_host = true
 
# Submit only node for Sweb server
arms.tc.cornell.edu:  type = machine
                      submit_only = true
#
#   Nodes of the SP2
#
# Rack 1
#
# PIOFS name server, HiPPi router, Switch & JMD primary
#r01n01.tc.cornell.edu:   type = machine
#                         alias = r01n01-css
# r01n02 & r01n05 are interactive nodes
r01n03.tc.cornell.edu:   type = machine
                         alias = r01n03-css
                         submit_only = true
r01n05.tc.cornell.edu:   type = machine
                         alias = r01n05-css
                         submit_only = true
r01n07.tc.cornell.edu:   type = machine
                         alias = r01n07-css
r01n09.tc.cornell.edu:   type = machine
                         alias = r01n09-css
r01n11.tc.cornell.edu:   type = machine
                         alias = r01n11-css
r01n13.tc.cornell.edu:   type = machine
                         alias = r01n13-css
r01n15.tc.cornell.edu:   type = machine
                         alias = r01n15-css
#
# Rack 2
#
# HPSS/PIOFS backup
#r02n01.tc.cornell.edu:   type = machine
#                         alias = r02n01-css
# r02n03, r02n05, r02n07, r02n09 are splong nodes
r02n03.tc.cornell.edu:   type = machine
                         alias = r02n03-css
                         submit_only = true
r02n05.tc.cornell.edu:   type = machine
                         alias = r02n05-css
                         submit_only = true
r02n07.tc.cornell.edu:   type = machine
                         alias = r02n07-css
                         submit_only = true
r02n09.tc.cornell.edu:   type = machine
                         alias = r02n09-css
                         submit_only = true
# VIS node
#r02n11.tc.cornell.edu:   type = machine
#                         alias = r02n11-css
r02n13.tc.cornell.edu:   type = machine
                         alias = r02n13-css
r02n15.tc.cornell.edu:   type = machine
                         alias = r02n15-css
#
# Rack 3
#
r03n01.tc.cornell.edu:   type = machine
                         alias = r03n01-css
r03n02.tc.cornell.edu:   type = machine
                         alias = r03n02-css
r03n03.tc.cornell.edu:   type = machine
                         alias = r03n03-css
r03n04.tc.cornell.edu:   type = machine
                         alias = r03n04-css
r03n05.tc.cornell.edu:   type = machine
                         alias = r03n05-css
r03n06.tc.cornell.edu:   type = machine
                         alias = r03n06-css
r03n07.tc.cornell.edu:   type = machine
                         alias = r03n07-css
r03n08.tc.cornell.edu:   type = machine
                         alias = r03n08-css
r03n09.tc.cornell.edu:   type = machine
                         alias = r03n09-css
r03n10.tc.cornell.edu:   type = machine
                         alias = r03n10-css
r03n11.tc.cornell.edu:   type = machine
                         alias = r03n11-css
r03n12.tc.cornell.edu:   type = machine
                         alias = r03n12-css
r03n13.tc.cornell.edu:   type = machine
                         alias = r03n13-css
r03n14.tc.cornell.edu:   type = machine
                         alias = r03n14-css
r03n15.tc.cornell.edu:   type = machine
                         alias = r03n15-css
# ATM/FDDI routing node
#r03n16.tc.cornell.edu:   type = machine
#                         alias = r03n16-css
 
 
#
# Rack 4
#
r04n01.tc.cornell.edu:   type = machine
                         alias = r04n01-css
r04n02.tc.cornell.edu:   type = machine
                         alias = r04n02-css
r04n03.tc.cornell.edu:   type = machine
                         alias = r04n03-css
r04n04.tc.cornell.edu:   type = machine
                         alias = r04n04-css
r04n05.tc.cornell.edu:   type = machine
                         alias = r04n05-css
r04n06.tc.cornell.edu:   type = machine
                         alias = r04n06-css
r04n07.tc.cornell.edu:   type = machine
                         alias = r04n07-css
r04n08.tc.cornell.edu:   type = machine
                         alias = r04n08-css
r04n09.tc.cornell.edu:   type = machine
                         alias = r04n09-css
r04n10.tc.cornell.edu:   type = machine
                         alias = r04n10-css
r04n11.tc.cornell.edu:   type = machine
                         alias = r04n11-css
# r04n12 - r14n16 HPSS nodes
#r04n12.tc.cornell.edu:   type = machine
#                         alias = r04n12-css
#r04n13.tc.cornell.edu:   type = machine
#                         alias = r04n13-css
#r04n14.tc.cornell.edu:   type = machine
#                         alias = r04n14-css
#r04n15.tc.cornell.edu:   type = machine
#                         alias = r04n15-css
#r04n16.tc.cornell.edu:   type = machine
#                         alias = r04n16-css
#
#############################################################################
# CLASS STANZAS: (optional)
# These are sample class stanzas; small, medium, large, and nqs are sample
# labels for job classes - revise these labels and specify attributes
# to each class.
#############################################################################
DSI:       type = class
 
piofs:     type = class
#############################################################################
 
 

The following represents the CTC's LoadL_config file.

#
# Machine Description
#
ARCH = R6000
 
#
#  Specify LoadLeveler Administrators here:
#
LOADL_ADMIN = loadl admin1 admin2 admin3 admin4
 
#
# Default to starting LoadLeveler daemons when requested
#
START_DAEMONS = TRUE
 
#
# Machine authentication
#
# If TRUE, only connections from machines in the ADMIN_LIST are accepted.
# If FALSE, connections from any machine are accepted.  Default if not
# specified is FALSE.
#
MACHINE_AUTHENTICATE = FALSE
 
#
# Specify which daemons run on each node
#
SCHEDD_RUNS_HERE = False
STARTD_RUNS_HERE = True
 
#
# Specify information for backup central manager
#
# CENTRAL_MANAGER_HEARTBEAT_INTERVAL = 300
# CENTRAL_MANAGER_TIMEOUT = 6
#
# Specify pathnames
#
RELEASEDIR = /usr/lpp/LoadL/nfs
LOCAL_CONFIG = $(tilde)/local/configs/LoadL_config.$(host)
ADMIN_FILE = $(tilde)/LoadL_admin
LOG = /var/loadl/log
SPOOL = /var/loadl/spool
EXECUTE = /var/loadl/execute
HISTORY = $(SPOOL)/history
BIN = $(RELEASEDIR)/bin
LIB = $(RELEASEDIR)/lib
ETC = $(RELEASEDIR)/etc
#
# Specify port numbers
#
COLLECTOR_STREAM_PORT  = 9612
MASTER_STREAM_PORT     = 9616
NEGOTIATOR_STREAM_PORT = 9614
SCHEDD_STREAM_PORT     = 9605
STARTD_STREAM_PORT     = 9611
COLLECTOR_DGRAM_PORT   = 9613
STARTD_DGRAM_PORT      = 9615
MASTER_DGRAM_PORT      = 9617
SCHEDULER_API          = YES
SCHEDULER_PORT         = 9624
 
#
# Specify accounting controls
#
ACCT    = A_ON
ACCT_VALIDATION  = $(BIN)/llacctval
GLOBAL_HISTORY  = $(SPOOL)
 
#
# Specify prolog and epilog path names
#
JOB_PROLOG = $(ETC)/llprolog
JOB_EPILOG = $(ETC)/llepilog
JOB_USER_PROLOG = $(ETC)/ll_user_prolog
JOB_USER_EPILOG = $(ETC)/ll_user_epilog
#
#
# Refresh AFS token program.
#
AFS_GETNEWTOKEN = $(ETC)/tokenreviveclient
#
# Customized mail delivery program.
#
# MAIL =
 
#
# Customized submit (job command file) filter program.
#
# SUBMIT_FILTER =
 
#
# Specify checkpointing intervals
#
MIN_CKPT_INTERVAL    = 900
MAX_CKPT_INTERVAL    = 7200
 
# LoadL_KeyboardD Macros
#
KBDD                = $(BIN)/LoadL_kbdd
KBDD_LOG            = $(LOG)/KbdLog
MAX_KBDD_LOG        = 64000
KBDD_DEBUG          =
 
#
# Specify whether to start the keyboard daemon
#
 
X_RUNS_HERE   = False
 
#
# Specify whether to use X server XGetIdleTime() protocol extension
#
 
USE_X_IDLE_EXTENSION = False
 
#
#  LoadL_StartD Macros
#
STARTD   = $(BIN)/LoadL_startd
STARTD_LOG  = $(LOG)/StartLog
MAX_STARTD_LOG  = 5000000
#STARTD_DEBUG  = D_STARTD D_FULLDEBUG D_THREAD
STARTD_DEBUG  = D_FULLDEBUG
POLLING_FREQUENCY = 10
POLLS_PER_UPDATE = 24
JOB_LIMIT_POLICY = 240
JOB_ACCT_Q_POLICY = 3600
 
#
#  LoadL_SchedD Macros
#
SCHEDD   = $(BIN)/LoadL_schedd
SCHEDD_LOG  = $(LOG)/SchedLog
MAX_SCHEDD_LOG  = 5000000
SCHEDD_DEBUG  = D_SCHEDD
SCHEDD_INTERVAL  = 180
 
CLIENT_TIMEOUT  = 300
#
# Negotiator Macros
#
NEGOTIATOR  = $(BIN)/LoadL_negotiator
NEGOTIATOR_DEBUG  = D_FULLDEBUG D_ALWAYS D_NEGOTIATE
NEGOTIATOR_LOG = $(LOG)/NegotiatorLog
MAX_NEGOTIATOR_LOG = 5000000
NEGOTIATOR_INTERVAL = 60
MACHINE_UPDATE_INTERVAL = 600
NEGOTIATOR_PARALLEL_DEFER = 1800
NEGOTIATOR_PARALLEL_HOLD = 300
NEGOTIATOR_REDRIVE_PENDING = 1800
NEGOTIATOR_RESCAN_QUEUE  = 180
NEGOTIATOR_REMOVE_COMPLETED = 0
 
#
# Sets the interval between recalculation of the SYSPRIO values
# for all the jobs in the queue
#
NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 0
 
#
# Starter Macros
#
STARTER = $(BIN)/LoadL_starter
STARTER_DEBUG = D_FULLDEBUG
STARTER_LOG = $(LOG)/StarterLog
MAX_STARTER_LOG = 500000
 
#
# LoadL_Master Macros
#
MASTER   = $(BIN)/LoadL_master
MASTER_LOG  = $(LOG)/MasterLog
MASTER_DEBUG  = D_FULLDEBUG
MAX_MASTER_LOG  = 64000
RESTARTS_PER_HOUR = 12
PUBLISH_OBITUARIES = TRUE
OBITUARY_LOG_LENGTH = 25
 
#
# Specify whether log files are truncated when opened
#
TRUNC_MASTER_LOG_ON_OPEN     = False
TRUNC_STARTD_LOG_ON_OPEN     = False
TRUNC_SCHEDD_LOG_ON_OPEN     = False
TRUNC_KBDD_LOG_ON_OPEN       = False
TRUNC_STARTER_LOG_ON_OPEN    = False
TRUNC_COLLECTOR_LOG_ON_OPEN  = False
TRUNC_NEGOTIATOR_LOG_ON_OPEN = False
#       NQS Directory
#
#
# For users of NQS resources:
# Specify the directory containing qsub, qstat, qdel
#
# NQS_DIR   = /usr/bin
 
#
# Specify Custom metric keywords
#
# CUSTOM_METRIC  =
# CUSTOM_METRIC_COMMAND = $(ETC)/sw_chip_number
#
# Machine control expressions and macros
#
 
OpSys :  $(OPSYS)
Arch  :  $(ARCH)
Machine :  $(HOST).$(DOMAIN)
 
#
# Expressions used to control starting and stopping of foreign jobs
#
MINUTE  = 60
HOUR  = (60 * $(MINUTE))
StateTimer = (CurrentTime - EnteredCurrentState)
 
BackgroundLoad  = 0.7
HighLoad  = 1.5
StartIdleTime  = 15 * $(MINUTE)
ContinueIdleTime = 5 * $(MINUTE)
MaxSuspendTime  = 10 * $(MINUTE)
MaxVacateTime  = 10 * $(MINUTE)
 
KeyboardBusy= KeyboardIdle < $(POLLING_FREQUENCY)
CPU_Idle = LoadAvg <= $(BackgroundLoad)
CPU_Busy = LoadAvg >= $(HighLoad)
# START  : $(CPU_Idle) && KeyboardIdle > $(StartIdleTime)
# SUSPEND : $(CPU_Busy) || $(KeyboardBusy)
# CONTINUE : $(CPU_Idle) && KeyboardIdle > $(ContinueIdleTime)
# VACATE : $(StateTimer) > $(MaxSuspendTime)
# KILL  : $(StateTimer) > $(MaxVacateTime)
 
START  : T
SUSPEND  : F
CONTINUE : T
VACATE  : F
KILL  : F
#
# Expressions used to prioritize job queue
#
# Values which can be part of the SYSPRIO expression are:
#
# QDate    Job submission time
# UserPrio   User priority
# UserSysprio   System priority value based on userid (from the user
#     list file with default of 0)
# ClassSysprio   System priority value based on job class (from the class
#     list file with default of 0)
# UserRunningProcs  Number of jobs running for the user
# GroupRunningProcs Number of jobs running for the group
#
# The following expression is an example.
#
#SYSPRIO: (ClassSysprio * 100) + (UserSysprio * 10) + (GroupSysprio * 1)- (QDate
)
#
# The following (default) expression for SYSPRIO creates a FIFO job queue.
#
SYSPRIO: (ClassSysprio * 100) - (QDate)
#
# Expressions used to prioritize machines
#
# The following example orders machines by the load average
# normalized for machine speed:
#
#MACHPRIO: 0 - (1000 * (LoadAvg / (Cpus * Speed)))
#
# The following (default) expression for MACHPRIO orders
# machines by load average.
#
#MACHPRIO: 0 - (LoadAvg) + (MasterMachPriority * 10000)
#       The following  expression for MACHPRIO orders
#       machines by increasing ammount of memory and
#       decreasing node number.
#
MACHPRIO: 0 - (100 * Memory) + CustomMetric + (MasterMachPriority * 10000)
 
#
# The MAX_JOB_REJECT value determines how many times a job can be
# rejected before it is canceled or put on hold.  The default value
# is -1, which indicates no limit to the number of times a job can be
# rejected.
 
#
MAX_JOB_REJECT = 0
#
# When ACTION_ON_MAX_REJECT is HOLD, jobs will be put on user hold
# when the number of rejects reaches the MAX_JOB_REJECT value.  When
# ACTION_ON_MAX_REJECT is CANCEL, jobs will be canceled when the
# number of rejects reaches the MAX_JOB_REJECT value. The default
# value is HOLD.
#
ACTION_ON_MAX_REJECT = CANCEL

Customer 2: Circuit Simulation

This customer performs CPU-intensive work in the area of circuit simulation using Electronic Design Automation (EDA).

System Configuration

The customer has 752 batch servers; 209 are dedicated to run LoadLeveler jobs 24 hours a day (the central manager is excluded). The rest are used by LoadLeveler when they are not in use by their respective owners.

The LoadLeveler administrators control all the 173 dedicated machines. That means that users cannot get onto these systems without submitting a LoadLeveler job. 117 of the dedicated machines are public schedulers. The user machines are submit-only machines, and users do not have access to their root password. If a user needs root access to his or her machine, he or she is allowed alternate root access only; he or she cannot get global root access to all the machines on site. (Site administrators use a common global root password.)

This site runs over 31,000 jobs per week and about 2,800 CPU days of resource utilization. The central manager is a RISC/System 6000 model 370 with 128MB of RAM. The batch machines are generally 80 percent busy. The central manager is about 35 percent to 70 percent busy. The central manager does not run any jobs, it just manages. All of the LoadLeveler machines run one job at a time. (That is, MAX_STARTERS=1.)

This customer sees some machines in a down state occassionally. The administrator feels the CPU on these machines are too busy to get a time slice to report its state to the central manager. However, this down state does not cause any problem for this customer.

117 public schedulers are subset of our 173 dedicated machines and are listed in the admin file.

LoadLeveler Configuration

The following figures represent sections of this customer's LoadL_admin file for dedicated machines. Notice the default stanza. Also, every machine in the LoadLeveler cluster is listed in this file.

#=============================================================================#
# type = machine default stanza
#=============================================================================#
 
default:  type = machine             # defaults for machine stanzas
central_manager = false    #  no central manager on machine
schedd_host = true         #  public schedd on machine
#=============================================================================#
# Central Manager
#=============================================================================#
 
mips1:    type = machine          # PRIMARY server - MANAGER   370 128M 3.2.5
central_manager = true  #  runs negotiator
#=============================================================================#
#                               Primary Servers
#=============================================================================#
 
beast100: type = machine
# PRIMARY C=a/b/o/s2/t2           . . 550    128M 3.2.5
beast101: type = machine
# PRIMARY C=a/b/b1/b4/c/o/r/s/t   F . 550    128M 3.2.5
beast102: type = machine
# PRIMARY C=a                     F . 550    128M 3.2.5
beast103: type = machine
# PRIMARY C=a                     . . 550    128M 3.2.5

Later in the Loadl_admin file, user machines are defined. Notice the default stanza.

#=============================================================================#
 
default:  type = machine             # defaults for machine stanzas
central_manager = false    #  no central manager on machine
schedd_host = false        #  no public schedd on machine
#=============================================================================#
 
agni:      type = machine
# SECONDARY server - rmkohn           550    64M 3.2.5
akama:     type = machine
# SECONDARY server - poulter          365    64M 3.2.5
alaska:    type = machine
# SECONDARY server - jcahill          340    64M 3.2.5
alcor:     type = machine
# SECONDARY server - drolson          340    64M 3.2.5

The following represents a local configuration file for a dedicated, public scheduler machine:

#                         PRIMARY LoadL SERVER ==> mips27
#
# this loadl.config.local is tuned for a machine that is part of a compute
# farm.  Interactive users are discouraged.
#
# Run up to one jobs at a time.
#
# Always start a job if there is a class available.
#
# Never suspend a job.
#
# Since jobs never get suspended they never get vacated or killed.
#
 
SCHEDD_RUNS_HERE    = True
STARTD_RUNS_HERE    = True
 
Class = { "a" "b" "b1" "b4" "c" "k" "r" "s" "t" }
Feature = { "PRI" }
 
MAX_STARTERS = 1
 
POLLING_FREQUENCY       = 30
POLLS_PER_UPDATE        = 15
 
START           : T
SUSPEND         : F
 
START_DAEMONS = True
X_RUNS_HERE   = False

The following represents a local configuration file for a user's machine.

#                      SECONDARY SERVER ==> common
#
# This loadl_config.local is tuned to be "nice" to a workstation owner
# who permits loadl jobs on his system but wants good response whenever
# he is doing his own work.
#
# Run only one LoadLeveler job at a time.
#
# Check the keyboard for activity every five seconds.
#
#
# Suspend a job if the load average exceeds 1.4
#
# Continue a job when keyboard again goes idle for 10 minutes and the load
# average is <.5
 
SCHEDD_RUNS_HERE  = False
STARTD_RUNS_HERE  = True
 
Class = { "a" "b" "b1" "b4" "c" "o" "r" "s" "t" }
MAX_STARTERS = 1
 
START           : $(FirstShift_KB9999) && $(StartS1) || ($(Off_Shift) ||
$(Week_End)) && $(Mach_Idle_S)
SUSPEND         : $(CPU_Busy) || $(KeyboardBusy)
CONTINUE        : $(Mach_Idle_C)
VACATE          : ((Class == "a") && $(Vacate_A)) || ($(Vacate_ClassesB)
&& $(Vacate_B)) || $(Vacate_X)
KILL            : $(Kill_Job)
 
START_DAEMONS = True
X_RUNS_HERE   = True

Customer 3: High-Energy Physics

This scientific customer provides experimental facilities for physicists from its 17 member states and for visiting scientists from throughout the world. The computing requirements of these users vary from mail and text processing to heavy batch and parallel processing.

System Configuration

Their processor is an SP2 using RISC System/6000 nodes linked by an internal high-speed network with a centrally managed software environment. The nodes are functionally divided into four groups of 16 each for different types of work: interactive logins, sequential job batch processing, parallel job batch processing and data, and tape and network services.

This customer uses AFS heavily. It provides the single system image for users' home directories and the files common to their experiments. Many software products are served directly out of AFS using symbolic links.

LoadLeveler provides this customer with the following facilities:

LoadLeveler Batch Configuration

The batch configuration is designed to maximize short job turnaround while allowing the heavy CPU jobs to get good usage of the resources available.

The basic configuration uses a range of classes - short, medium, long and verylong - with a range of maximum job CPU times of from five minutes to six days. An additional class, night, provides off-peak and weekend computing time on the interactive areas of the SP2 during periods of low demand. Access to this class is limited to specific users.

Users in different experiments are defined in LoadLeveler groups which provide associated queue priorities. This allows groups with a large computing budget to be given higher priorities. An automated procedure calculates each group's resource utilization over the last month and adjusts their priorities accordingly. This ensures a fair allocation of CPU time among the groups.

LoadLeveler Interactive Configuration

This customer uses the Interactive Session Support facility to provide a name servier which returns the least loaded node according to a site defined metric. This allows a user to be given the least loaded operational node when he or she logs in.

This metric is based on the number of logged in users, with some weight given to those using Xstations. Every few minutes, the system is scanned to evaluate the following:

Xterminals*3 + Telnet*2 + Process

Where:

This metric tries to balance users across the system while providing some factor for their likely future utilization. A metric based on the CPU load average is too dependent on the current load to provide good balancing.

The metric can also be set to return a low priority if the file /etc/iss.nologin exists. This allows the administrator to drain the interactive use of a node if there is scheduled system maintenance. When the maintenance is completed, the file can be removed and the metric will return the correct value for the node. Users will therefore see an improved availability, since they will not be given a node that is about to shutdown.

Processor Configuration

The processors are configured as follows:


Customer 4: Computer Chip Design

This customer uses EDA to perform work in the area of computer chip design.

System Configuration

The customer has seven clusters of RISC/System 6000 machines. The largest cluster has 530 machines; the smallest cluster has 87 machines. The total number of machines at this installation is over 1200.

Interactive Configuration

This customer has defined two configuration files for interactive work: one for standard workstations and one for large interactive servers. These files are meant to be tailored to machines of differing processing power.

Standard Workstation Configuration

#==============================================================================#
# Description: LoadL_config.local for Standard Workstations (<370 Class)
#==============================================================================#
# Need 2x Paging Space to Real Memory ( minimum ) For Worst Case Of One
# Suspended and One Foreground Running Job.
#    *) All Jobs (btv,lp) Suspend on LoadAvg or Keyboard/Mouse Movement.
#==============================================================================#
# Class defines the permissable classes, MAX_STARTERS defines the max
# total jobs to be permitted.
#==============================================================================#
Class        = { "btv" "lp" }
MAX_STARTERS = 1
#==============================================================================#
# The next definitions are used in the expressions below to regulate the
# conditions under which jobs get started, suspended, and evicted.
#     All times are specified in units of seconds.
#==============================================================================#
BackgroundLoad   = 0.8
HighLoad         = 1.6
StartIdleTime    = 900
ContinueIdleTime = 900

#==============================================================================#
# LoadAvg is an internal variable whose value is the (Berkeley) load average
# of the machine.
#
#    CPU_Idle - No LoadL job running, or One job just finishing.
#    CPU_Busy - One LoadL job running, second job ( Foreground or Batch )
#               starting up.
#    CPU_Max  - Two LoadL jobs running.
#==============================================================================#
CPU_Idle = (LoadAvg <= $(BackgroundLoad))
CPU_Busy = (LoadAvg >= $(HighLoad))
 
#==============================================================================#
# This defines a boolean "KeyboardBusy" whose value is TRUE if the keyboard
# or mouse has been used since loadl last checked.  Thus if POLLING_FREQUENCY
# is 5 seconds, KeyboardBusy is TRUE if anybody has used the kbd or mouse in
# the last 5 seconds.
#==============================================================================#
KeyboardBusy = KeyboardIdle < $(POLLING_FREQUENCY)
 
#==============================================================================#
# This statement indicates when a job should be started on this machine
#==============================================================================#
Weekend   = ( (tm_wday >=  6) || (tm_wday <  1) )
Day       = ( (tm_hour >=  7) && (tm_hour < 18) )
Night     = ( (tm_hour >= 18) || (tm_hour <  4) )
Inactive  = ( (KeyboardIdle > $(StartIdleTime)) && $(CPU_Idle) )
 
HP        = ( (Class == "btv") )
LP        = ( ($(Weekend) || $(Night)) )
 
START     : ( ($(HP) || $(LP)) && $(Inactive) )
 
#==============================================================================#
# The SUSPEND statement here says that a job should be suspended but not
# killed if:
#                LoadAvg >= 1.6  Or  KeyboardIdle < 5
#==============================================================================#
SUSPEND  : ( $(CPU_Busy) || $(KeyboardBusy) )
 
#==============================================================================#
# This CONTINUE statement indicates that a suspended job should be continued
# if the cpu goes idle and the keyboard/mouse has not been used for the last
# 15 minutes.
#==============================================================================#
CONTINUE : $(CPU_Idle) && KeyboardIdle > $(ContinueIdleTime)
 
#==============================================================================#
# Jobs in the SUSPEND state are never killed, after 60 minutes they are
# relocated to a different machine if possible.
#==============================================================================#
MaxSuspendTime   = 60 * $(MINUTE)
VACATE    : $(StateTimer) > $(MaxSuspendTime)
KILL      : F

#==============================================================================#
# If you set START_DAEMONS to False loadl can never start on this machine.
# For example you may want to stop loadl for a couple days for maintenance
# and make sure no procedure automatically restarts it.
#==============================================================================#
START_DAEMONS = True
 
#==============================================================================#
# Set the maximum size each of the logs can reach before wrapping.
#==============================================================================#
MAX_SCHEDD_LOG    = 128000
MAX_COLLECTOR_LOG = 128000
MAX_STARTD_LOG    = 128000
MAX_SHADOW_LOG    = 128000
MAX_KBDD_LOG      = 128000

Large Interactive Server Configuration

#==============================================================================#
# Description: LoadL_config.local for Interactive Large Servers (580-590 Class)
 
#==============================================================================#
# Need 3x Real Memory To Paging Space ( minimum ) For Worst Case Of Two
# Suspended and One Foreground Running Job.
#    *) All Jobs (btv,lp) Suspend on LoadAvg or Keyboard/Mouse Movement.
#    *) Real Memory >= 192meg.
#==============================================================================#
 
#==============================================================================#
# Class defines the permissable classes, MAX_STARTERS defines the max
# total jobs to be permitted.
#==============================================================================#
Class        = { "btv" "lp" }
MAX_STARTERS = 2
 
#==============================================================================#
# The next definitions are used in the expressions below to regulate the
# conditions under which jobs get started, suspended, and evicted.
#
#     All times are specified in units of seconds.
#==============================================================================#
BackgroundLoad   = 0.8
LowLoad          = 1.0
HighLoad         = 1.6
MaxLoad          = 2.0
StartIdleTime    = 900
ContinueIdleTime = 900

#==============================================================================#
# LoadAvg is an internal variable whose value is the (Berkeley) load average
# of the machine.
#
#    CPU_Idle - No LoadL job running, or One job just finishing.
#    CPU_Busy - One LoadL job running, second job ( Foreground or Batch )
#               starting up.
#    CPU_Max  - Two LoadL jobs running.
#==============================================================================#
CPU_Idle = (LoadAvg <= $(BackgroundLoad))
CPU_Run  = (LoadAvg <= $(LowLoad))
CPU_Busy = (LoadAvg >= $(HighLoad))
CPU_Max  = (LoadAvg >= $(MaxLoad))
 
#==============================================================================#
# This defines a boolean "KeyboardBusy" whose value is TRUE if the keyboard
# or mouse has been used since loadl last checked.  Thus if POLLING_FREQUENCY
# is 5 seconds, KeyboardBusy is TRUE if anybody has used the kbd or mouse in
# the last 5 seconds.
#==============================================================================#
KeyboardBusy = KeyboardIdle < $(POLLING_FREQUENCY)
#==============================================================================#
# This statement indicates when a job should be started on this machine
#==============================================================================#
Weekend   = ( (tm_wday >=  6) || (tm_wday <  1) )
Day       = ( (tm_hour >=  7) && (tm_hour < 18) )
Night     = ( (tm_hour >= 18) || (tm_hour <  4) )
Inactive1 = ( (KeyboardIdle > $(StartIdleTime)) )
Inactive2 = ( (KeyboardIdle > $(ContinueIdleTime)) )
 
HP        = ( (Class == "btv") )
LP        = ( (Class == "lp") && $(CPU_Idle) )
 
START     : ( ($(HP) || $(LP)) && $(Inactive1) )
 
#==============================================================================#
# The SUSPEND statement here says that a job should be suspended but not
# killed if:
#                KeyboardIdle < 5                 Or
#                lp  Class  And  LoadAvg >= 1.6   Or
#                btv Class  And  LoadAvg >= 2.0
#==============================================================================#
SUSPEND   : ( ( (Class == "lp")  && $(CPU_Busy) ) || \
( (Class == "btv") && $(CPU_Max)  ) || \
(  $(KeyboardBusy)                )    )
 
#==============================================================================#
# This CONTINUE statement indicates that a suspended job should be continued
# if:
#               lp  Class  And  LoadAvg <= 0.8  And  KeyboardIdle > 15 min  Or
#               btv Class  And  LoadAvg <= 1.0  And  KeyboardIdle > 15 min
#==============================================================================#
CONTINUE  : ( ( (Class == "lp")  && $(CPU_Idle) && $(Inactive2) ) || \
( (Class == "btv") && $(CPU_Run)  && $(Inactive2) )    )

#==============================================================================#
# Jobs in the SUSPEND state are never killed, after 60 minutes they are
# relocated to a different box if possible.
#==============================================================================#
MaxSuspendTime   = 60 * $(MINUTE)
VACATE    : $(StateTimer) > $(MaxSuspendTime)
KILL      : F
 
#==============================================================================#
# If you set START_DAEMONS to False loadl can never start on this machine.
# For example you may want to stop loadl for a couple days for maintenance
# and make sure no procedure automatically restarts it.
#==============================================================================#
START_DAEMONS = True
 
#==============================================================================#
# Set the maximum size each of the logs can reach before wrapping.
#==============================================================================#
MAX_SCHEDD_LOG    = 128000
MAX_COLLECTOR_LOG = 128000
MAX_STARTD_LOG    = 128000
MAX_SHADOW_LOG    = 128000
MAX_KBDD_LOG      = 128000

Batch Configuration

The following configuration file defines dedicated batch machines. Notice, however, that jobs in the lp class will suspend when a machine becomes too busy. So in this sense, the machines are not fully dedicated.

#==============================================================================#
# Description: LoadL_config.local for Large Batch Servers ( 580 - 590 Class )
#==============================================================================#
# Need 3x Real Memory To Paging Space ( minimum ) For Worst Case Of One
# Suspended and Two Foreground Running Job.
#    *) High Priority Jobs (btv) Never Suspend.
#    *) Job Suspension (lp) Based on LoadAvg Only.
#    *) Real Memory >= 192meg.
#==============================================================================#
 
#==============================================================================#
# Class defines the permissable classes, MAX_STARTERS defines the max
# total jobs to be permitted.
#==============================================================================#
Class        = { "btv" "lp" }
MAX_STARTERS = 2
 
#==============================================================================#
# The next definitions are used in the expressions below to regulate the
# conditions under which jobs get started, suspended, and evicted.
#
#     All times are specified in units of seconds.
#==============================================================================#
BackgroundLoad   = 0.5
HighLoad         = 1.6
StartIdleTime    = 900
ContinueIdleTime = 900
 
#==============================================================================#
# LoadAvg is an internal variable whose value is the (Berkeley) load average
# of the machine.
#
#    CPU_Idle - No LoadL job running, or One job just finishing.
#    CPU_Busy - One LoadL job running, second job ( Foreground or Batch )
#               starting up.
#    CPU_Max  - Two LoadL jobs running.
#==============================================================================#
CPU_Idle = (LoadAvg <= $(BackgroundLoad))
CPU_Busy = (LoadAvg >= $(HighLoad))

#==============================================================================#
# This defines a boolean "KeyboardBusy" whose value is TRUE if the keyboard
# or mouse has been used since loadl last checked.  Thus if POLLING_FREQUENCY
# is 5 seconds, KeyboardBusy is TRUE if anybody has used the kbd or mouse in
# the last 5 seconds.
#==============================================================================#
KeyboardBusy = KeyboardIdle < $(POLLING_FREQUENCY)
 
#==============================================================================#
# This statement indicates when a job should be started on this machine
#==============================================================================#
HP        = ( (Class == "btv") )
LP        = ( (Class == "lp") && $(CPU_Idle) )
 
START     : ( $(HP) || $(LP) )
 
#==============================================================================#
# The SUSPEND statement here says that a "lp" job should be suspended but not
# killed if a high priority job starts up or a foreground job causes the
# Loadavg to be greater than CPU_Busy ( 1.6 ).
#==============================================================================#
SUSPEND   : (Class == "lp") && $(CPU_Busy)
 
#==============================================================================#
# This CONTINUE statement indicates that a suspended job should be continued
# if the cpu goes idle and the keyboard/mouse has not been used for the last
# 15 minutes.
#==============================================================================#
CONTINUE  : $(CPU_Idle) && KeyboardIdle > $(ContinueIdleTime)
 
#==============================================================================#
# Jobs in the SUSPEND state are never killed, after 60 minutes they are
# relocated to a different box if possible.
#==============================================================================#
MaxSuspendTime   = 60 * $(MINUTE)
VACATE    : $(StateTimer) > $(MaxSuspendTime)
KILL      : F
 
#==============================================================================#
# If you set START_DAEMONS to False loadl can never start on this machine.
# For example you may want to stop loadl for a couple days for maintenance
# and make sure no procedure automatically restarts it.
#==============================================================================#
START_DAEMONS = True
 
#==============================================================================#
# Set the maximum size each of the logs can reach before wrapping.
#==============================================================================#
MAX_SCHEDD_LOG    = 128000
MAX_COLLECTOR_LOG = 128000
MAX_STARTD_LOG    = 128000
MAX_SHADOW_LOG    = 128000
MAX_KBDD_LOG      = 128000

Configuration for a Machine That Schedules (But Doesn't Run) Jobs

The following statements define a machine that schedules jobs but does not run jobs. Notice that the schedd daemon is never forced to not run.

#
# This loadl local configuration file is set up to make a machine a
# submitter only.
#
# No jobs are allowed to run on this system.
#
MAX_STARTERS            = 0
 
START                   : F
#
# If you set START_DAEMONS to False loadl can never start on this machine.
# For example you may want to stop loadl for a couple days for maintenance
# and make sure no procedure automatically restarts it.
#
START_DAEMONS           = True


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]