IBM PE for AIX V2R4.0: Operation and Use, Vol. 1

Managing POE Jobs

This chapter describes the tasks involved with managing POE jobs. It includes the following:

Scenarios for allocating nodes with LoadLeveler
Scenarios for allocating nodes with the Resource Manager
Information about IBM LoadLeveler support
Appropriate environment variable information to use when running your applications.

Multi-Task Core File

With the MP_COREDIR environment variable, you can create a separate directory to save a core file for each task. The corresponding command line option is -coredir. Creating this type of directory is useful when you are running a parallel job on one node, and the job dumps a core file. By checking the directory, you can see which task dumped the file. When setting MP_COREDIR, you specify the first attribute of the directory name. The second attribute is the task id. If you do not specify a directory, the default is coredir. The subdirectory containing each task's core file is named coredir.taskid. The following examples show what happens when you set the environment variable:

Example 1:
 
 
 
MP_COREDIR=my_parallel_cores
 
MP_PROCS=2
 
 
 
run generates core files
 
 
 
Core files will be located at:
 
 
 
/current directory/my_parallel_cores.0/core
 
/current directory/my_parallel_cores.1/core
 
 
 
Example 2:
 
 
 
MP_COREDIR not specified
 
MP_PROCS=2
 
 
 
run generates core files
 
 
 
Core files will be located at:
 
 
 
/current directory/coredir.0/core
 
/current directory/coredir.1/core

Stopping a POE Job

You can stop (suspend) a POE job by pressing <Ctrl-z> or by sending POE a SIGTSTP signal. POE stops, and sends a SIGSTOP signal to all the remote tasks, which stops them. To resume the parallel job, issue the fg or bg command to POE. A SIGCONT signal will be sent to all the remote tasks to resume them.

Cancelling and Killing a POE Job

You can cancel a POE job by pressing <Ctrl-c> or <Ctrl-\>. This sends POE a SIGINT or SIGQUIT signal respectively. POE terminates all the remote tasks, completes the generation of the VT trace file, and exits.

If POE is killed or terminated before the remote nodes are shut down, direct communication with the parallel job will be lost. In this situation, use the poekill script as a POE command, or individually via rsh, to terminate the partition. poekill kills all instantiations of the program name on a remote node by sending it a SIGTERM signal. See the poekill script in /usr/lpp/ppe.poe/bin, and the description of the poekill command in Appendix A. "Parallel Environment Commands".
Note: Do not kill the pmds using the poekill command. This will ensure that your remote processes will continue running.

Detecting Remote Node Failures

POE and the Partition Manager use a pulse detection mechanism to periodically check each remote node to ensure that it is actively communicating with the home node. You specify the time interval (or pulse interval), of these checks with the -pulse flag or the MP_PULSE environment variable. When POE starts a parallel job, During an execution of a POE job, POE and the Partition Manager daemons check at the interval you specify that each node is running. When a node failure is detected, POE terminates the job on all remaining nodes and issues an error message.

The default pulse interval is 600 seconds (10 minutes). You can increase or decrease this value with the -pulse flag or the MP_PULSE environment variable. To completely disable the pulse function, specify an interval value of 0 (zero). For the PE debugging facility MP_PULSE is disabled.

Considerations for Using the SP Switch

The SP switch supports dedicated User Space (US) and IP sessions, running concurrently on a single node. Users of IP communication programs that are not using a job management system (LoadLeveler or the Resource Manager), may treat this adapter like any other IP-supporting adapter. In this case, the adapter name is css0.

While US message passing programs must use a job management system to allocate nodes, IP message passing programs may use a job management system, but are not required to. When using LoadLeveler, nodes may be requested by name or number from one system pool only. When using the Resource Manager, nodes may be requested by number or by specifying one or more node pools to be used. When specifying node pools, the following rules apply:

All the nodes in a pool should support the same combination of IP and US protocols. In other words, all the nodes should be able to run:
- the IP protocol
  or
- the US protocol
  or
- the IP and US protocols concurrently.
In order to run the IP protocol, the IP switch addresses must be configured and started. In order to run the US protocol, the switch node numbers must be configured. For more information regarding these protocols and LoadLeveler, see Using and Administering LoadLeveler . For more information regarding these protocols and the Resource Manager, see IBM Parallel System Support Programs for AIX: Installation and Migration Guide .
By default, requests for the US message passing protocol also request exclusive use of the node; the job management system will not allocate concurrent IP message passing programs on this node. You can override this default so that the node can be used for both IP and US programs by specifying "multiple" CPU usage.
By default, requests for the IP message passing protocol also request multiple use of the node; the job management system can allocate both IP and US message passing programs on this node. You can override this default so that the node is designated for exclusive use by specifying "unique" CPU usage.
When running a batch parallel program under LoadLeveler, the adapter and CPU are allocated as specified by the network keyword in the LoadLeveler Job Command File. See Using and Administering LoadLeveler for more information.

Scenarios for Allocating Nodes With LoadLeveler

This section provides some examples of how someone would allocate nodes using LoadLeveler.

Scenario 1: Explicit Allocation

A POE user, Paul, wishes to run a US job 1 in nodes A, B, C, and D. He doesn't mind sharing the node with other jobs, as long as they are not also running in US. To do this, he specifies MP_EUIDEVICE=css0, MP_EUILIB=us, MP_PROCS=4, MP_CPU_USE=multiple, and MP_ADAPTER_USE=dedicated. In his host file, he also specifies:

node_A
node_B
node_C
node_D

The POE Partition Manager (PM) sees that this is a US job, and asks LoadLeveler for dedicated use of the css0 adapter on nodes A, B, C, and D and shared use of the CPU on those nodes. LoadLeveler then allocates the nodes to the job, recording that the css0/US session on A, B, C, and D has been reserved for dedicated use by this job, but that the node may also be shared by other users.

While job 1 is running, another POE user, Dan, wants to run another US job, job 2, on nodes B and C, and is willing to share the nodes with other users. He specifies MP_EUIDEVICE=css0, MP_EUILIB=us, and MP_PROCS=2, MP_CPU_USE=multiple, and MP_ADAPTER_USE=dedicated. In his host file, he also specifies:

node_B
node_C

The PM, as before, asks LoadLeveler for dedicated use of the css0/US adapter on nodes B and C. LoadLeveler determines that this adapter has already been reserved for dedicated use on nodes B and C, and does not allocate the nodes again to job 2. The allocation fails, and POE job 2 cannot run.

While job 1 is running, a second POE user, John, wishes to run IP/switch job 3 on nodes A, B, C, and D, but doesn't mind sharing the node and the SP switch with other users. He specifies MP_EUIDEVICE=css0, MP_EUILIB=ip, MP_PROCS=4, MP_CPU_USE=multiple, and MP_ADAPTER_USE=shared. In his host file, he also specifies;

node_A
node_B
node_C
node_D

The POE PM asks LoadLeveler, as requested by John, for shared use of the css0/ip adapter and CPU on nodes A, B, C, and D. LoadLeveler determines that job 1 permitted other jobs to run on those nodes as long as they did not use the css0/US session on them. The allocation succeeds, and POE IP/switch job 3 runs concurrently with POE US job 1 on A, B, C, and D.

The scenario above, illustrates a situation in which users do not mind sharing nodes with other users' jobs. If a user wants his POE job to have dedicated access to nodes or the css0 adapter on nodes, he would indicate that in the environment by setting MP_CPU_USE=unique instead of multiple. If job 1 had done that, then job 3 would not have been allocated to those nodes and, therefore, would not have been able to run.

Scenario 2: Implicit Allocation

In this scenario, all nodes have both css0/US and css0/IP sessions configured, and are assigned to pool 2.

In this example, we have eight nodes; A, B, C, D, E, F, G, H.

Job 1

Job1 is interactive, and requests 4 nodes for US using MP_RMPOOL.

MP_PROCS=4
 
MP_RMPOOL=2
 
MP_EUILIB=us

LoadLeveler allocates nodes A, B, C, and D for dedicated adapter (forced for US) and dedicated CPU (default for MP_RMPOOL).

Job 2

Job 2 is interactive, and requests six nodes for US using host.list.

MP_PROCS=6
 
MP_HOSTFILE=./host.list
 
MP_EUILIB=us
 
MP_CPU_USE=multiple
MP_ADAPTER_USE=shared
host.list
 
     @2

POE forces the adapter request to be dedicated, even though the user specified shared. Multiple (shared CPU) is supported, but in this case LoadLeveler doesn't have six nodes, either for CPU or for adapter, so the job fails.

Job 3

Job 3 is interactive and requests six nodes for IP using MP_RMPOOL.

MP_PROCS=6
 
MP_RMPOOL=2
 
MP_EUILIB=ip

The defaults are shared adapter and shared CPU, but LoadLeveler only has four nodes available for CPU use, so the job fails.

Job 4

Job 4 is interactive and requests three nodes for IP using MP_RMPOOL.

MP_PROCS=3
 
MP_RMPOOL=2
 
MP_EUILIB=ip

The defaults are shared adapter and shared CPU. LoadLeveler allocates nodes E, F, and G.

Job 5

Job 5 is interactive and requests two nodes for IP using MP_RMPOOL.

MP_PROCS=2
 
MP_RMPOOL=2
 
MP_EUILIB=ip

The defaults are shared adapter and shared CPU. LoadLeveler allocates two nodes from the list E, F, G, H (the others are assigned as dedicated to job 1).

Scenario 3: Implicit Allocation

In this scenario, all nodes have both css0/US and css0/IP sessions configured, and are assigned to pool 2.

In this example, we have eight nodes; A, B, C, D, E, F, G, H

Job 1

Job 1 is interactive and requests four nodes for US using host.list.

MP_PROCS=4
 
MP_HOSTFILE=./host.list
 
MP_EUILIB=us
 
MP_CPU_USE=multiple
MP_ADAPTER_USE=dedicated
host.list
 
     @2

LoadLeveler allocates nodes A, B, C, and D for dedicated adapter (forced for US), and shared CPU.

Job 2

Job 2 is interactive and requests six nodes for US using host.list.

MP_PROCS=6
 
MP_HOSTFILE=./host.list
 
MP_EUILIB=us
 
MP_CPU_USE=multiple
MP_ADAPTER_USE=shared
host.list
 
     @2

POE forces the adapter request to be dedicated, even though the user has specified shared. Multiple (shared CPU) is supported, but in this case, LoadLeveler doesn't have six nodes for the adapter request, so the job fails.

Job 3

Job 3 is interactive and requests six nodes for IP using MP_RMPOOL.

MP_PROCS=6
 
MP_HOSTFILE=NULL
 
MP_EUILIB=ip
 
MP_RMPOOL=2

The defaults are shared adapter and shared CPU. LoadLeveler allocates six nodes for IP from the pool.

Job 4

Job 4 is interactive and requests three nodes for IP using MP_RMPOOL.

MP_PROCS=3
 
MP_HOSTFILE=NULL
 
MP_EUILIB=ip
 
MP_RMPOOL=2

The defaults are shared adapter and shared CPU. LoadLeveler allocates three nodes from the pool.

Scenarios for Allocating Nodes With the Resource Manager

This section provides some examples of how someone would allocate nodes using the Resource Manager.

Scenario 1: Explicit Allocation

A POE user, Paul, wishes to run a US job 1 in nodes A, B, C, and D. He doesn't mind sharing the node with other jobs, as long as they are not also running in US. To do this, he specifies MP_EUIDEVICE=css0, MP_EUILIB=us, and MP_PROCS=4. In his host file, he also specifies:

node_A dedicated multiple
 
node_B dedicated multiple
 
node_C dedicated multiple
 
node_D dedicated multiple

The POE Partition Manager (PM) sees that this is a US job, and asks the RM for dedicated use of the css0 adapter on nodes A, B, C, and D (regardless of whether you specify dedicated or shared in the host file), and shared use of the CPU on those nodes. The RM then allocates the nodes to the job, recording that the css0/US session on A, B, C, and D has been reserved for dedicated use by this job, but that the node may also be shared by other users.

node_B dedicated multiple
 
node_C dedicated multiple

The PM, as before, asks the RM for dedicated use of the css0/US adapter on nodes B and C. The RM determines that this adapter has already been reserved for dedicated use on nodes B and C, and does not allocate the nodes again to job 2. The allocation fails, and POE job 2 cannot run.

While job 1 is running, a second POE user, John, wishes to run IP/switch job 3 on nodes A, B, C, and D, but doesn't mind sharing the node and the High Performance Communication Adapter with other users. He specifies MP_EUIDEVICE=css0, MP_EUILIB=ip, MP_PROCS=4. In his host file, he also specifies;

node_A shared multiple
 
node_B shared multiple
 
node_C shared multiple
 
node_D shared multiple

The POE PM asks the RM, as requested by John, for shared use of the css0/ip adapter and CPU on nodes A, B, C, and D. The RM determines that job 1 permitted other jobs to run on those nodes as long as they did not use the css0/US session on them. The allocation succeeds, and POE IP/switch job 3 runs concurrently with POE US job 1 on A, B, C, and D.

The scenario above, illustrates a situation in which users do not mind sharing nodes with other users' jobs. If a user wants his POE job to have dedicated access to nodes or the css0 adapter on nodes, he would indicate that in the host file by specifying unique instead of multiple. If job 1 had done that, then job 3 would not have been allocated to those nodes and, therefore, would not have been able to run.

Scenario 2: Implicit Allocation

In this scenario, all nodes have both css0/US and css0/IP sessions configured, and are assigned to pool 2.

In this example, we have eight nodes; A, B, C, D, E, F, G, H.

Job 1

Job1 is interactive, and requests 4 nodes for US using MP_RMPOOL.

MP_PROCS=4
 
MP_RMPOOL=2
 
MP_EUILIB=us

The RM allocates nodes A, B, C, and D for dedicated adapter (forced for US) and dedicated CPU (default for MP_RMPOOL).

Job 2

Job 2 is interactive, and requests six nodes for US using host.list.

MP_PROCS=6
 
MP_HOSTFILE=./host.list
 
MP_EUILIB=us
 
host.list
 
     @2 shared multiple

POE forces the adapter request to be dedicated, even though the user specified shared. Multiple (shared CPU) is supported, but in this case the RM doesn't have six nodes, either for CPU or for adapter, so the job fails.

Job 3

Job 3 is interactive and requests six nodes for IP using MP_RMPOOL.

MP_PROCS=6
 
MP_RMPOOL=2
 
MP_EUILIB=ip

The defaults are shared adapter and shared CPU, but the RM only has four nodes available for CPU use, so the job fails.

Job 4

Job 4 is interactive and requests three nodes for IP using MP_RMPOOL.

MP_PROCS=3
 
MP_RMPOOL=2
 
MP_EUILIB=ip

The defaults are shared adapter and shared CPU. The RM allocates nodes E, F, and G.

Job 5

Job 5 is interactive and requests two nodes for IP using MP_RMPOOL.

MP_PROCS=2
 
MP_RMPOOL=2
 
MP_EUILIB=ip

The defaults are shared adapter and shared CPU. The RM allocates two nodes from the list E, F, G, H (the others are assigned as dedicated to job 1).

Scenario 3: Implicit Allocation

In this scenario, all nodes have both css0/US and css0/IP sessions configured, and are assigned to pool 2.

In this example, we have eight nodes; A, B, C, D, E, F, G, H

Job 1

Job 1 is interactive and requests four nodes for US using host.list.

MP_PROCS=4
 
MP_HOSTFILE=./host.list
 
MP_EUILIB=us
 
host.list
 
     @2 dedicated multiple

The RM allocates nodes A, B, C, and D for dedicated adapter (forced for US), and shared CPU.

Job 2

Job 2 is interactive and requests six nodes for US using host.list.

MP_PROCS=6
 
MP_HOSTFILE=./host.list
 
MP_EUILIB=us
 
host.list
 
     @2 shared multiple

POE forces the adapter request to be dedicated, even though the user has specified shared. Multiple (shared CPU) is supported, but in this case, the RM doesn't have six nodes for the adapter request, so the job fails.

Job 3

Job 3 is interactive and requests six nodes for IP using MP_RMPOOL.

MP_PROCS=6
 
MP_HOSTFILE=NULL
 
MP_EUILIB=ip
 
MP_RMPOOL=2

The defaults are shared adapter and shared CPU. The RM allocates six nodes for IP from the pool. There is no attempt to load balance with Job 1.

Job 4

Job 4 is interactive and requests three nodes for IP using MP_RMPOOL.

MP_PROCS=3
 
MP_HOSTFILE=NULL
 
MP_EUILIB=ip
 
MP_RMPOOL=2

The defaults are shared adapter and shared CPU. The RM allocates three nodes from the pool. There is no attempt to load balance with jobs 1 and 3.

Submitting a Batch POE Job using IBM LoadLeveler

Note:

POE version 2.4.0 is only compatible with LoadLeveler version 2.1.0. Submitting a POE version 2.4.0 batch job with an earlier version of LoadLeveler is not supported.

This section is intended for users who wish to submit batch POE jobs using IBM LoadLeveler, version 2.1.0. Refer to Using and Administering LoadLeveler for more information on using this job management system.

To submit a POE job using LoadLeveler, you need to build a LoadLeveler job file, which specifies:

The number of nodes to be allocated
Any POE options, passed via environment variables using Loadleveler's environment keyword, or passed as command line options using LoadLeveler's argument keyword.
The path to your POE executable (usually /usr/bin/poe).
Adapter specifications using the network keyword.

The following POE environment variables, or associated command line options, are validated, but not used, for batch jobs submitted using LoadLeveler.

MP_PROCS
MP_RMPOOL
MP_EUIDEVICE
MP_HOSTFILE
MP_SAVEHOSTFILE
MP_PMDSUFFIX
MP_RESD
MP_RETRY
MP_RETRYCOUNT
MP_ADAPTER_USE
MP_CPU_USE
MP_NODES
MP_TASKS_PER_NODE

To run myprog on five nodes, using a Token ring adapter for IP message passing, with the message level set to the info threshold, you could use the following LoadLeveler job file. The arguments myarg1 and myarg2 are to be passed to myprog.

#!/bin/ksh
 
# @ input = myjob.in
 
# @ output = myjob.out
 
# @ error = myjob.error
 
# @ environment = COPY_ALL; \
 
    MP_EUILIB=ip; \
 
    MP_INFO_LEVEL=2
 
# @ executable = /usr/bin/poe
 
# @ arguments = myprog myarg1 myarg2
 
# @ min_processors = 5
 
# @ requirements = (Adapter == "tokenring")
 
# @ job_type = parallel
 
# @ checkpoint = no

To run myprog on 12 nodes from pool 2, using the User Space message passing interface with the message threshold set to warning, you could use the following LoadLeveler job file. See the documentation provided with the LoadLeveler program product for more information.

#!/bin/ksh
 
# @ input = myusjob.in
 
# @ output = myusjob.out
 
# @ error = myusjob.error
 
# @ environment = COPY_ALL; MP_EUILIB=us
 
# @ executable = /usr/bin/poe
 
# @ arguments = myprog -infolevel 1
 
# @ min_processors = 12
 
# @ requirements = (Pool == 2) && (Adapter == "hps_user")
 
# @ job_type = parallel
 
# @ checkpoint = no

Notes:

If you are using the POE dynamically linked message passing interface support, you must set the MP_EUILIB environment variable or the -euilib command line option.
The first token of the arguments string in the LoadLeveler job file must be the name of the program to be run under POE, unless:
- You use the MP_CMDFILE environment variable or the -cmdfile command line option
- The file you specify with the keyword input contains the name(s) of the programs to be run under POE.
When setting the environment string, make sure that no white space characters follow the backslash, and that there is a space in between the semicolon and backslash.
When LoadLeveler allocates nodes for parallel execution, POE and task 0 will be executed on the same node.
When LoadLeveler detects a condition that should terminate the parallel job, a SIGTERM will be sent to POE. POE will then send the SIGTERM to each parallel task in the partition. If this signal is caught or ignored by a parallel task, LoadLeveler will ultimately terminate the task.
Programs that call the usrinfo function with the getinfo parameter, or programs that use the getinfo function, are not guaranteed to receive correct information about the owner of the current process.
Programs that use LAPI and also the LoadLeveler requirements keyword to specify Adapter="hps_user", must set the MP_MSG_API environment variable or associated command line option accordingly.
If the value of the MP_EUILIB, MP_EUIDEVICE, or MP_MSG_API environment variables that is passed as an argument to POE differs from the specification in the network statement of the job command file, the network specification will be used, and a warning message will be printed.

Running Programs Under the C Shell

During normal configuration of an SP system, the Automount Daemon (amd) is used to mount user directories. amd's maps use the symbolic file system links, rather than the physical file system links. While the Korn shell keeps track of file system changes, so that a directory is always available, this mapping does not take place in the C shell. This is because the C shell only maintains the physical file system links. As a result, users that run POE from a C shell may find that their current directory (for example /a/moms/fileserver/sis), is not known to amd, and POE fails with message 0031-214 (unable to change directory).

By default, POE uses the Korn shell pwd command to obtain the name of the current directory. This works for C shell users if the current directory is either:

The home directory
Not mounted by amd.

If neither of the above are true (for example, if the user's current directory is a subdirectory of the home directory), then POE provides another mechanism to determine the correct amd name; the MP_REMOTEDIR environment variable.

POE recognizes the MP_REMOTEDIR environment variable as the name of a command or Korn shell script that echoes a fully-qualified file name. MP_REMOTEDIR is run from the current directory from which POE is started.

If you do not set MP_REMOTEDIR, the command defaults to pwd, and is run as ksh -c pwd. POE sends the output of this command to the remote nodes and uses it as the current directory name.

You can set MP_REMOTEDIR to some other value and then export it. For example, if you set MP_REMOTEDIR="echo /tmp", the current directory on the remote nodes becomes /tmp on that node, regardless of what it is on the home node.

The script mpamddir is also provided in /usr/lpp/ppe.poe/bin, and the setting MP_REMOTEDIR=mpamddir will run it. This script determines whether or not the current directory is a mounted file system. If it is, the script searches the amd maps for this directory, and constructs a name for the directory that is known to amd. You can modify this script or create additional ones that apply to your installation.
Note: Programs that depend upon the name of the current directory for correct operation may not function properly with an alternate directory name. In this case, you should carefully evaluate how to provide an appropriate name for the current directory on the home nodes.

If you are executing from a subdirectory of your home directory, and your home directory is a mounted file system, it may be sufficient to replace the C shell name of the mounted file system with the contents of $HOME. One approach would be:

export MP_REMOTEDIR=pwd.csh

or for C shell users:

setenv MP_REMOTEDIR pwd.csh

where the file pwd.csh is:

#!/bin/csh -fe
 
# save the current working directory name
 
set oldpwd = &rprime.pwd&rprime.
 
# get the name of the home directory
 
cd $HOME
 
set hmpwd = &rprime.pwd&rprime.
 
# replace the home directory prefix with the contents of $HOME
 
set sed_home = &rprime.echo $HOME | sed 's/\//\\\//g'&rprime.
 
set sed_hmpwd = &rprime.echo $hmpwd | sed 's/\//\\\//g'&rprime.
 
set newpwd = &rprime.echo $oldpwd | sed "s/$sed_hmpwd/$sed_home/"&rprime.
 
# echo the result to be used by amd
 
echo $newpwd

Using MP_CSS_INTERRUPT

The MP_CSS_INTERRUPT environment variable may take the value of either yes or no. By default it is set to no. In certain applications, setting this value to yes will provide improved performance.

The following briefly summarizes some general application characteristics that could potentially benefit from setting MP_CSS_INTERRUPT=yes.

Applications which have the following characteristics may see performance improvements from setting the POE environment variable MP_CSS_INTERRUPT to yes:

Applications that use nonblocking send or receive operations for communication.
Applications that have non-synchronized sets of send or receive pairs. In other words, the send from node0 is issued at a different point in time with respect to the matching receive in node1.
Applications that do not issue waits for nonblocking send or receive operations immediately after the send or receive, but rather do some computation prior to issuing the waits.

In all of the above cases, the application is taking advantage of the asynchronous nature of the nonblocking communication subroutines. This essentially means that the calls to the nonblocking send or receive routines do not actually ensure the transmission of data from one node to the next, but only post the send or receive and then return immediately back to the user application for continued processing. However, since the SP communication subsystem is a user space protocol and executes within the user's process, it must regain control from the application to complete asynchronous requests for communication.

The SP communication subsystem can regain control from the application in any one of three different methods:

Any subsequent calls to the SP communication subsystem to post send or receive, or to wait on messages.
A timer signal is received periodically to allow the communication subsystem to do recovery from transmission errors.
If the value of MP_CSS_INTERRUPT is set to yes, the communication subsystem device driver will send a signal to the user application when data is received or buffer space is available to transmit data.

Method 1 and Method 2 are always enabled. Method 3 is controlled by the POE environment variable MP_CSS_INTERRUPT, and is enabled when this variable is set to yes.

For those applications that have the characteristics mentioned above, this implies that when using asynchronous communication the completion of the communication must occur through one of the these three methods. In the case that MP_CSS_INTERRUPT is not enabled, only the first two methods are available to process communication. Depending upon the amount of time between the non-synchronized send or receive pairs, or between the nonblocking send or receive and the corresponding waits, the actual transmission of data may only complete at the matching wait call. If this is the case, it is possible that an application may see a performance degradation due to unnecessary processor stalling waiting for communication.

As an example, consider the following application template, where both processors execute the same code, and processor 0 sends and receives data from processor 1.

     DO LOOP
 
        MP_SEND (A ...., msgid1)
 
        MP_RECV (B ...., msgid2)
 
 
 
        MP_WAIT (msgid2, nbytes)
 
 
 
        COMPUTE LOOP1 (uses B)
 
 
 
        MP_WAIT (msgid1, nbytes)
 
 
 
        COMPUTE LOOP2 (modifies A)
 
     ENDDO

In this example, application B is guaranteed to be received after the wait for msgid2, and more than likely the data is actually received during the wait call. B can then be safely used in the compute loop1. A is not guaranteed to be sent until the wait for msgid1. Therefore, A cannot be modified until after this wait.

With MP_CSS_INTERRUPT=no, it is likely that processor0 receives B during the wait for msgid2, and enters the compute loop1 before the send of A has completed. In this case, processor1 will stall waiting for the completion of the wait for msgid2, which will not complete until processor0 completes the compute loop1 and reaches the wait for msgid1. The stalling of processor1 is directly related to the non-continuous flow of communication. If MP_CSS_INTERRUPT=yes, when the communication is ready to complete, the communication subsystem device driver sends a signal to the application and causes the application to immediately complete the communication. Therefore data flow is continuous and smooth. The send of A can be completed, even during the compute loop1, preventing the stalling of processor1 and improving overall performance of this application.

Finally, it should be noted that there is a cost associated with handling the signals when MP_CSS_INTERRUPT is set to yes. In some cases, this cost can degrade application performance. Therefore, MP_CSS_INTERRUPT should only be used for those applications that require it. For the IP version of the library, MP_CSS_INTERRUPT=yes enables UDP to send a SIGIO signal when a message packet is received.

Support for Performance Improvements

POE provides interfaces to improve interrupt mode latency in general, and to increase performance of the receive-and-call mechanism.

Interrupt Mode Improvements

When a node receives a packet and an interrupt is generated, the interrupt handler checks its tables for the process identifier (PID) of the user process and notifies the process. The signal handler or service thread waits for at least two times the interrupt delay, checking to see if more packets will arrive. Waiting for more packets avoids the cost of incurring an interrupt each time a new packet arrives (interrupt processing is very expensive). However, the more packets that arrive, the more delay time is increased. Therefore, with the functions you can either tune the delay parameter based on your application, and/or dynamically turn interrupts on or off at selected nodes.

For an application with few nodes exchanging small messages, it will help latency if you keep the interrupt delay small. For an application with a large number of nodes, or one which exchanges large messages, keeping the delay parameter large will help the bandwidth. A large delay allows multiple read transmissions to occur in a single read cycle. You should experiment with different values and use the functions described below to achieve desired performance, depending on the communication pattern.

MP_INTRDELAY is the environment variable which allows you to set the delay parameter for how long the signal handler or service thread waits for more data. The delay specified in the environment variable is set during initialization, before running the program. In this way, user programs can tune the delay parameter without having to recompile existing applications. If none is specified, the default value of 1 microsecond is used. The application can tune this parameter based on the communication pattern it has in different parts of the application.

Five application programming interfaces are provided to help you enable or disable interrupts on specific tasks, based on the communication patterns of the tasks. If a task is frequently in the communication library, then the application can turn interrupts off for that particular task for the duration of the program. The application can enable interrupts when the task is not going to be in the communication subsystem often. The enable or disable interfaces override the setting of the MP_CSS_INTERRUPT environment variable.

The first two functions allow you to query what the current delay parameter is and to set the delay parameter to a new value.

int mpc_queryintrdelay() - for C programs

void mp_queryintrdelay(int rc) - for Fortran programs

This function returns the current interrupt delay (in microseconds). If none was set by the user, the default is returned.

int mpc_setintrdelay(int val) - for C programs

void mp_setintrdelay(int val, int rc) - for Fortran programs

This function sets the delay parameter to the value, in microseconds, specified by "val". The function can be called at multiple places within the program to set the delay parameter to different values during execution.

The following three functions allow you to control dynamically masking interrupts on individual nodes, and query the state of interrupts. In the current system only "all" nodes or "none" can be selected to statically enable or disable running in interrupt mode.

int mpc_queryintr() - for C programs

void mp_queryintr(int rc) - for Fortran programs

This function returns 0 if the node on which it is executed has interrupts turned off, and it returns 1 otherwise.

int mpc_disableintr() - for C programs

void mp_disableintr(int rc) - for Fortran programs

This function disables interrupts on the node on which it is executed. Return code = 0, if successful, -1 otherwise.

int mpc_enableintr() - for C programs

void mp_enableintr(int rc) - for Fortran programs

This function enables interrupts on the node on which it is executed. Return code = 0, if successful, -1 otherwise.
Note: The last two of the above functions override the setting of the environment variable MP_CSS_INTERRUPT. If they are not used properly they can deadlock the application. Please use these functions only if you are sure of what you are doing. These functions are useful in reducing latency if the application is doing blocking recv/wait and interrupts are otherwise enabled. Interrupts should be turned off before executing blocking communication calls and turned on immediately after those calls.

All of the above functions can also be used for programs running IP.

Rcvncall Improvements

The mpc_wait function can be called just before re-posting the Rcvncall instead of in the beginning of the Rcvncall handler, if information provided by the wait function call (like length of message) is already available. This removes the wait time from the critical path for latency. The wait function provides the message id, the length of the message, and also cleans up the resources used by the previously posted Rcvncall. This applies to the signal-handling MPI/MPL library only.

Parallel File Copy Utilities

During the course of developing and running parallel applications on numerous nodes, the potential need exists to efficiently copy data and files to and from a number of places. POE provides three utilities for this reason:

mcp - to copy a single file from the home node to a number of remote nodes. This was discussed briefly in "Step 2: Copy Files to Individual Nodes".
mcpscat - to copy a number of files from task 0 and scatter them in sequence to all tasks, in a round robin order.
mcpgath - to copy (or gather) a number of files from all tasks back to task 0.

mcp is for copying the same file to all tasks. The input file must reside on task 0. You can copy it to a new name on the other tasks, or to a directory. It accepts the source file name and a destination file name or directory, in addition to any POE command line argument, as input parameters.

mcpscat is intended for distributing a number of files in sequence to a series of tasks, one at a time. It will use a round robin ordering to send the files in a one to one correspondence to the tasks. If the number of files exceeds the number of tasks, the remaining files are sent in another round through the tasks.

mcpgath is for when you need to copy a number of files from each of the tasks back to a single location, task 0. The files must exist on each task. You can optionally specify to have the task number appended to the file name when it is copied.

Both mcpscat and mcpgath accept the source file names and a destination directory, in addition to any POE command line argument, as input parameters. You can specify multiple file names, a directory name (where all files in that directory, not including subdirectories, are copied), or use wildcards to expand into a list of files as the source. Wildcards should be enclosed in double quotes, otherwise they will be expanded locally, which may not produce the intended file name resolution.

These utilities are actually message passing applications provided with POE. Their syntax is described in Appendix A. "Parallel Environment Commands".

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]