The negotiator daemon monitors for down machines based on the heartbeat responses of the MACHINE_UPDATE_INTERVAL time period. If the negotiator has not received an update after two MACHINE_UPDATE_INTERVAL periods, then it marks the machine as down, and notifies the schedd to remove any jobs running on that machine. The gsmonitor daemon (LoadL_GSmonitor) allows this cleanup to occur more reliably. The gsmonitor daemon uses the Group Services Application Programming Interface (GSAPI) to monitor machine availability and notify the negotiator quickly when a machine is no longer reachable. Because it uses the GSAPI, the gsmonitor daemon requires that the Group Services subsytem, which is provided by the IBM Parallel System Support Programs (PSSP), be installed and operational.
The gsmonitor daemon should be run on one or two nodes in each of the Group Services domains. By running LoadL_GSmonitor on two nodes, this allows for a backup in case one of the nodes goes down. A Group Services domain consists of the set of nodes that makes up a system partition. LoadL_GSmonitor subscribes to the Group Services system-defined host membership group, which is represented by the HA_GS_HOST_MEMBERSHIP Group Services keyword. This group monitors every configured node in the system, including those that are not in the LoadLeveler cluster.
To start the gsmonitor daemon, set GSMONITOR_RUNS_HERE to True in the local config file. The default for GSMONITOR_RUNS_HERE is False.
Notes:
The Group Services routines need to be run as root, so the LoadL_GSmonitor executable must be owned by root and have the setuid permission bit enabled.
It will not cause a problem to run more than one LoadL_GSmonitor daemon per SP System Partition, this will just cause the negotiator to be notified by each running daemon.
For more information about the Group Services subsystem, see PSSP: Administration Guide, SA22-7348. For more information about GSAPI, see Group Services Programming Guide and Reference, SA22-7355-00.