This appendix documents various limitations, restrictions, and programming considerations for user applications written to run under the IBM Parallel Environment for AIX (PE) licensed program.

PE includes two versions of the message passing libraries. These are called the signal-handling library and the threaded library.

The signal-handling library uses AIX signals as an asynchronous way to move data in and out of message buffers. They also ensure that message packets are acknowledged and retransmitted when necessary. It supports both MPL and MPI calls.
The threaded library uses AIX kernel threads for the same message passing tasks. It supports MPI only. The threaded library also supports message passing on user-created threads. The threaded library is required if MPI coexists with the other user space protocols, for example, the LAPI interface on the IBM RS/6000 SP.

This appendix consists of sections that list the programming considerations common to both libraries, as well as those unique to either the signal-handling library or the threaded library. There is also a subsection on using POE and the Fortran compiler. Specifically, the sections are as follows:

"MPI Signal-Handling and MPI Threaded Library Considerations"
"MPI Signal-Handling Library Considerations"
"MPI Threaded Library Considerations"
"Fortran Considerations"

MPI Signal-Handling and MPI Threaded Library Considerations

The information in this section pertains to both the (MPL/MPI) signal-handling library and the MPI threaded library.

Environment Overview

As the end user, you are encouraged to think of the Parallel Operating Environment(POE) (also referred to as the poe command) as an ordinary (serial) command. It accepts redirected I/O, can be run under the nice and time commands, interprets command flags, and can be invoked in shell scripts.

An n-task parallel job running in the Parallel Operating Environment actually consists of the n user tasks, an equal number (n) of instances of the IBM Parallel Environment for AIX pmd daemon (which is the parent task of the user's task), and the POE home node task in which the poe command runs. A pmd daemon is started by the POE home node on each machine on which each user task runs, and serves as the point of contact between the home node and the user's tasks.

The POE home node routes standard input, standard output and standard error streams between the home node and the user's tasks via the pmd daemon, using TCP/IP sockets for this purpose. The sockets are created when the POE home node starts the pmd daemon for each task of a parallel job. The POE home node and pmd also use the sockets to exchange control messages to provide task synchronization, exit status and signaling. These capabilities do not depend upon the message passing library and are available to control any parallel program run by the poe command.

Exit Status

Exit status is a value between 0 and 255 inclusive. It is returned from POE on the home node reflecting the composite exit status of your parallel application, as follows:

If MPI_ABORT(comm,nn>0,ierror) or MPI_Abort(comm,nn>0) is called, the exit status is nn (mod 256).
If MP_STOPALL(nn>=0) or mpc_stopall(nn>=0) is called, the exit status is nn (mod 256). This does not apply to threaded libraries.
If all tasks terminate via exit(MM>=0) or STOP MM>=0 and MM is not equal to 1 and is <128 for all nodes, then POE provides a synchronization barrier at the exit. The exit status is the largest value of MM from any parallel job (mod 256).
If any task terminates via exit(MM =1) or STOP MM =1, then POE will immediately terminate the parallel job, as if MP_STOPALL(1) or MPI_Abort(MPI_COMM_WORLD,1) had been called. This may also occur if a Fortran I/O library error occurs.
If any task terminates via a signal (for example, a segment violation), the exit status is 128+signal and the entire job is immediately terminated.
If POE terminates before the start of the user's application, the exit status is =1.
If the user's application cannot be loaded or fails before the user's main() is called, the exit status is =255.
You should explicitly call exit(MM) or STOP MM to set the desired exit code. A program exiting without an explicit exit value returns unpredictable status, and may result in causing premature termination of the parallel application.

POE Job Step Function

The POE job-step function is intended for the execution of a sequence of separate yet inter-related dependent programs. Therefore, it provides you with a job control mechanism that allows both job-step progression and job-step termination. The job control mechanism is the program's exit code.

Job-step progression:
POE continues the job-step sequence if the task exit code is 0 or in the range of 2 - 127.
Job-step termination:
POE terminates the parallel job, and does not execute any remaining user programs in the job-step list if the task exit code is 1 or greater than 127.
Default termination:
Any POE infrastructure detected failure (such as failure to open pipes to the child task or an exec failure to start the user's executable) terminates the parallel job, and does not execute any remaining user programs in the job-step queue.

POE Additions To The User Executable

POE links in the following routines when your executable is compiled with any of the POE compilation scripts (mpcc, mpcc_r, mpxlf,etc.).

Signal Handlers

POE installs signal handlers for most signals that cause program termination in order to notify the other tasks of termination and to complete the VT trace file, if enabled. POE then causes the program to exit normally with a code of (128+signal). When running non-threaded applications under POE, you may install a signal handler for any of these signals, and it should call the POE registered signal handler if the task decides to terminate. (See "Let POE Handle Signals When Possible".) When running threaded applications, any attempt to install a signal handler is ignored.

Signals that are specifically handled by POE or the message passing library follow:

SIGHUP
Caught and exits with an exit code of 128+SIGHUP.
SIGINT
Caught and exits with an exit code of 128+SIGINT.
Note: This signal may be caught by user or by dbx, in which case this usage is ignored.
SIGQUIT
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGQUIT. The exit handler dumps the user's context and takes the default signal action.
SIGFPE
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGFPE. The exit handler dumps the user's context and takes the default signal action.
SIGSEGV
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGSEGV. The exit handler dumps the user's context and takes the default signal action.
SIGBUS
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGBUS. The exit handler dumps the user's context and takes the default signal action.
SIGTERM
Caught and exits with an exit code of 128+SIGTERM. This is also used by POE to signal orderly termination of a parallel job. If it must be caught by the user, please read carefully the section on program termination (below).
SIGSTOP
Default action (cannot be caught)
SIGTSTP
Default action
SIGCONT
Default action
SIGPWR
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGPWR. The exit handler dumps the user's context and takes the default signal action.
SIGDANGER
Caught and exits with an exit code of 128+SIGDANGER.

The signal-handling library uses SIGIO, SIGALRM and SIGPIPE for its operations and it also handles these signals. For more information about the signal-handling library, see "MPI Signal-Handling Library Considerations". For more information about signals, see "Use of AIX Signals".

Replacement exit/atexit

POE requires its own versions of the library exit()/atexit() functions, and expects to load them dynamically from its own version of libc.a (or libc_r.a) in /usr/lpp/ppe.poe/lib; therefore, do not code your own exit function to override the library function. This is to synchronize profiling and to provide barrier synchronization upon exit.

Let POE Handle Signals When Possible

Programs that handle signals must coordinate with POE's handling of most of the common signals (see above).

DO NOT issue message passing calls from signal handlers. Also, many AIX library calls are not "signal safe", and should not be issued from signal handlers. Check the AIX Technical Reference (function sigaction()) for a list of AIX functions callable from signal handlers.

POE sets up signal handlers for all the signals that normally terminate program execution. It does this so that it can terminate the entire parallel job in an orderly fashion if one task terminates abnormally (via signal). A user program may install a handler for any or all of these signals, but should save the address of the POE signal handler. If the user program decides to terminate, it should call the POE signal handler. If the user program decides not to terminate, it should just return to the interrupted code. SIGTERM is used by POE to shutdown the parallel job in a variety of abnormal circumstances, and should be allowed to terminate the job.

The POE home node converts a user's SIGTSTP signal (Ctrl-z) to a SIGSTOP signal to all the remote nodes, and passes the SIGCONT signal sent by the fg or bg command to all the remote nodes to restart the job.

Don't Hard Code File Descriptor Numbers

Do not use hard coded file descriptor numbers beyond those specified by STDIN, STDOUT and STDERR.

POE opens several files and uses file descriptors as message passing handles. These are allocated before the user gets control, so the first file descriptor allocated to a user is unpredictable.

Termination Of A Parallel Job

POE provides for orderly termination of a parallel job, so that all tasks terminate at the same time. This is accomplished in the atexit routine registered at program initialization. For normal exits (codes 0, 2-127), the atexit routine sends a control message to the POE home node, and waits for a positive response. For abnormal exits and those which don't go through the atexit routine, the pmd daemon catches the exit code and sends a control message to the POE home node.

For normal exits, when POE gets a control message for every task, it responds to each node, allowing that node to exit normally with its individual exit code. The pmd daemon monitors the exit code and passes it back to the POE home node for presentation to the user.

For abnormal exits and those detected by pmd, POE sends a message to each pmd asking that it send a SIGTERM signal to its task, thereby terminating the task. When the task finally exits, pmd sends its exit code back to the POE home node and exits itself.

User-initiated termination of the POE home node via SIGINT (Ctrl-c) and/or SIGQUIT (Ctrl-\) causes a message to be sent to pmd asking that the appropriate signal be sent to the parallel task. Again, pmd waits for the task to die then terminates itself.

Your Program Can't Run As Root

To prevent uncontrolled root access to the entire parallel job computation resource, POE checks to see that the user is not root as part of its authentication.

AIX Function Limitations

The use of the following AIX functions may be limited, but no formal testing has been done:

wide character sets
shared memory - the message passing library uses shared memory for adapter mapping. You can use the remaining data segments as desired.
getuinfo does not show terminal information, since the user program running in the parallel partition does not have an attached terminal.

Shell Execution

You can have POE run a shell script which is loaded and run on the remote nodes as if it were a binary file.

If the POE home node task is not started under the Korn shell, mounted file system names may not be mapped correctly to the names defined for the automount daemon or AIX equivalent running on the IBM RS/6000 SP. See the IBM Parallel Environment for AIX: Operation and Use, Volume 1 for a discussion of alternative name mapping techniques.

The program executed by POE on the parallel nodes does not run under a shell on those nodes. Redirection and piping of STDIO applies to the POE home node (poe binary), and not the user's code. If shell processing of a command line is desired on the remote nodes, invoke a shell script on the remote nodes to provide the desired preprocessing before the user's application is executed.

Do Not Rewind stdin, stdout Or stderr

The partition manager daemon uses pipes to direct stdin, stdout and stderr to the user's program, therefore, do not rewind these files.

Ensuring String Arguments Are Passed To Your Program Correctly

Quotation marks, either single or double, used as argument delimiters are stripped away by the shell and are never "seen" by poe. Therefore, the quotation marks must be escaped to allow the quoted string to be passed correctly to the remote task(s) as one argument. For example, if you want to pass the following string to the user program (including the imbedded blank)

a b

then you need to enter the following:

 
    poe user_program \"a b\"

user_program is passed the following argument as one token:

a b

Without the backslashes, the string would have been treated as two arguments (a and b).

POE behaves like rsh when arguments are passed to POE. Therefore, the following:

 
    poe user_program "a b"

is equivalent to:

 
    rsh some_machine user_program "a b"

In order to pass the string argument as one token, the quotes have to be escaped.

Network Tuning Considerations

Programs generating large volumes of STDOUT or STDERR may overload the home node. As described previously, standard output and standard error files generated by a user's program are piped to pmd, then forwarded to the poe binary via a TCP/IP socket. It is possible to generate so much data that the IP message buffers on the home node are exhausted, the poe binary hangs and possibly the entire node may hang). Note that the option -stdoutmode (environment variable MP_STDOUTMODE) controls which output stream is displayed by the poe binary, but does not limit the standard output traffic received from the remote nodes, even if set to display the output of just one node.

The POE environment variable MP_SNDBUF can be used to override the default network settings for the size of the TCP buffers used.

If you have large volumes of standard I/O, work with your network administrator to establish appropriate TCP/IP tuning parameters. You may also want to examine if using named pipes is appropriate for your application.

Standard I/O Requires Special Attention

When your program runs on the remote nodes, it has no controlling terminal. STDIN and STDOUT, STDERR are always piped.

Programs that depend on piping standard input or standard output as part of a processing sequence may wish to bypass the home node poe binary. Running the poe command (or starting a program compiled with one of the POE compile scripts) causes the poe binary to be loaded on the machine on which you typed the command (the POE home node). The poe binary, in turn, starts a daemon named pmd on each parallel node assigned to run the job, and then requests pmd to run your executable (via fork and exec). The poe binary reads STDIN and passes it to each of the parallel tasks via a TCP/IP socket connection to the pmd daemon, which pipes it to the user. Similarly, STDOUT and STDERR from the user are piped to pmd and sent on the socket back to the home node, where it is written to the poe binary's STDOUT and STDERR descriptors. If you know that the task reading STDIN or writing STDOUT must be on the same node (processor) as the poe binary (the poe home node), named pipes can be used to bypass poe's reading and forwarding STDIN and STDOUT.

If STDIN is piped or redirected to the poe binary (via ordinary pipes), and your application is linked with the signal handling message passing library, (via mpcc, mpxlf, or mpCC), then set the environment variable MP_HOLD_STDIN to "yes". This lets poe initialize the signal-handling library before handling the STDIN file.

If your application is linked with the threaded library, see "Standard I/O Requires Special Attention" for more information.

STDIN/STDOUT Piping Example

The following two scripts show how STDIN and STDOUT can be piped directly between pre- and post-processing steps, bypassing the POE home node task. This example assumes that parallel task 0 is known or forced to be on the same node as the POE home node.

The script compute_home runs on the home node; the script compute_parallel runs on the parallel nodes (those running tasks 0 through n-1).

compute_home:
#! /bin/ksh
# Example script compute_home runs three tasks:
#    data_generator creates/gets data and writes to stdout
#    data_processor is a parallel program that reads data
#      from stdin, processes it in parallel, and writes
#      the results to stdout.
#    data_consumer reads data from stdin and summarizes it
#
mkfifo poe_in_$$
mkfifo poe_out_$$
export MP_STDOUTMODE=0
export MP_STDINMODE=0
data_generator >poe_in_$$ |
     poe compute_parallel poe_in_$$ poe_out_$$ data_processor |
     data_consumer <poe_out_$$
 rc=$?
 rm poe_in_$$
 rm poe_out_$$
 exit rc

compute_parallel:
#! /bin/ksh
# Example script compute_parallel is a shell script that
#    takes the following arguments:
#    1) name of input named pipe (stdin)
#    2) name of output named pipe (stdout)
#    3) name of program to be run (and arguments)
#
poe_in=$1
poe_out=$2
shift 2
$*   <$poe_in   >$poe_out

Reserved Environment Variables

Environment variables starting with MP_ are intended for use by POE, and should be set only as instructed in the documentation. POE also uses a handful of MP_... environment variables for internal purposes, which should not be interfered with.

AIX Message Catalog Considerations

POE assumes that NLSPATH contains the appropriate POE message catalogs, even if LANG is set to "C" or is unset. Duplicate message catalogs are provided for languages "En_US", "en_US", and "C".

Language Bindings

The Fortran, C and C++ bindings for MPI are contained in the same library and can be freely intermixed.

libmpi.a for the signal-handling version
libmpi_r.a for the threaded version

Refer to "Fortran Considerations" for more information about the Fortran compiler.

The AIX compilers support the flag -qarch. This option allows you to target code generation to a particular processor architecture. While this option can provide performance enhancements on specific platforms, it inhibits portability, particularly between the Power and PowerPC machines. The MPI library is not targeted to a specific architecture and is the same on PowerPC and Power nodes.

The MPI-IO functions from MPI-2 are only available with the threaded library.

Available Virtual Memory Segments

AIX makes available up to 11 additional address segments for end user programs. The MPI libraries use some of these as listed in Table 16. The remaining are available to the user for either extended heap (-bmaxdata option) or shared memory (shmget). Very large jobs, which include all jobs with more than 1000 tasks, will need to use the -bmaxdata option to ensure a large enough heap.

Table 16. Memory Segments Used By the MPI and LAPI Libraries

Component RS/6000 SP node with switch RS/6000 workstation or no switch
MPI User Space 2 not available
MPI IP 1* 0
VT Trace Capture 1 0
LAPI User Space 2 not available

Component	RS/6000 SP node with switch	RS/6000 workstation or no switch
MPI User Space	2	not available
MPI IP	1*	0
VT Trace Capture	1	0
LAPI User Space	2	not available

* If the environment variable MP_CLOCK_SOURCE=AIX, the value is 0.

Using the SP Switch Clock as a Time Source

The RS/6000 SP switch clock is a globally-synchronized counter that may be used as a source for the MPI_WTIME function, provided that all tasks are run on nodes of the same SP system. The environment variable MP_CLOCK_SOURCE provides additional control. Table 17 shows how the clock source is determined. MPI guarantees that MPI_WTIME_IS_GLOBAL has the same value at every task.

Table 17. How the Clock Source Is Determined

MP_CLOCK_SOURCE Library Version All Nodes SP? Source Used MPI_WTIME_IS_GLOBAL
not set ip yes switch false

no AIX false

us yes switch true

no Error
SWITCH ip yes* switch false

no AIX false

us yes switch true

no Error
AIX ip yes AIX false

no AIX false

us yes AIX false

no AIX false

MP_CLOCK_SOURCE	Library Version	All Nodes SP?	Source Used	MPI_WTIME_IS_GLOBAL
not set	ip	yes	switch	false
		no	AIX	false
	us	yes	switch	true
		no	Error
SWITCH	ip	yes*	switch	false
		no	AIX	false
	us	yes	switch	true
		no	Error
AIX	ip	yes	AIX	false
		no	AIX	false
	us	yes	AIX	false
		no	AIX	false

* The user is responsible for ensuring all of the nodes are in the same SP system.

32-Bit and 64-Bit Support

POE compiles and runs all applications as 32-bit applications. 64-bit applications are not supported yet.

Running Applications With Large Numbers of Tasks

If you plan to run your parallel applications with a large number of tasks (more than 256), the following tips may improve stability and performance:

Use a host list file with the switch IP names, instead of the IP host name.
You may avoid a potential problem running out of memory by linking applications with a data buffer using data segment three (3), by specifying the -bD:0x3000000 loader option. The default is to use data segment zero.
To avoid potential problems opening sockets, increase the user resource limit for the number of open file descriptors (nofiles) to at least 10,000, using the ulimit command. For example:
```
ulimit -n 10000
```

MPI Signal-Handling Library Considerations

The information in this subsection provides you with specific additional programming considerations for when you are using POE and the MPL/MPI signal-handling library.

POE Gets Control First And Handles Task Initialization

POE sets up its environment environment via the entry point mp_main(). mp_main() initializes the message passing library, sets up signal handlers, sets up an atexit routine, and initializes VT trace data collection before calling your main program.

Using Message Passing Handlers

Only a subset of MPL message passing is allowed on handlers created by the MPL Receive and Call function (mpc_rcvncall or MP_RCNVCALL). MPI calls on these handlers are not supported.

POE Additions To The User Executable

POE links in the following routines when your executable is compiled with mpcc, mpxlf or mpCC. These are routines specific for the signal handling environment.

Message Passing Initialization Module

POE initializes the parallel message passing library and determines that all nodes can communicate successfully before the user main() program gains control. As a result, any program compiled with the POE compiler scripts must be run under the control of POE and is not suitable as a serial program.

If communication initialization fails, the parallel task is terminated with an appropriate exit code.

Signal Handlers

The message passing library sets up signal handlers for SIGALRM, SIGIO and SIGPIPE to manage message passing activity. A user program may install a handler for any or all of these signals, but should save the address of and invoke the POE signal handler before returning to the interrupted code. The sigaction() function returns the required structure. Also, set SA_RESTART as well as the mask so all signals are masked when the signal handler is running.

The following are the signals used and specifically handled by the message passing library in a signal handling environment:

SIGPIPE
Caught by the non-threaded User Space message passing library to manage the RS/6000 SP switch. If your application catches this signal, it should call the registered message passing signal handler before returning to the main code.
Do not block this signal for more than a few milliseconds.
SIGALRM
Caught by message passing library to manage message traffic. If you provide your own interval timing mechanism, then you should arrange to call the POE signal handler approximately every 200-800 milliseconds. Message passing calls from user programs may be blocked until the POE signal handler is called.
If the user application catches this signal but doesn't do interval timing, it should call the registered message passing signal handler before returning to the main code.
SIGIO
Caught by the user space message passing library to manage message traffic. If your application catches this signal, it should call the registered message passing signal handler before returning to the main code.

Interrupted System Calls

The message passing library uses an interval timer to manage message traffic, specifically to ensure that messages progress even when message passing calls are not being made. When this interval timer expires, a SIGALRM signal is sent to the program, interrupting whatever computation is in progress. The message passing library has a signal handler set, and normally handles the signal and returns to the user's program without the program's knowledge. However, the following library and system calls are interrupted and do not complete normally. The user is responsible for testing whether an interrupt occurred and recovering from the interrupt. In many cases, this is accomplished by just retrying the call.

sleep(see note below)/usleep/nsleep
select
open/close/fopen/fclose
pause
sigpause
accept
connect
recv/recvfrom/recvmsg
send/sendto/sendmsg
aio_read/aio_write/aio_suspend
fork
system
exec/execv/...
msem_lock/semop
AIX msg... routines
poll

Note: The normal timer interval is less than 500 milliseconds. So a sleep call (with time specified in seconds) returns to the original sleep interval, due to rounding, and can't be used to determine how much time remains in the interval. You should use the functions usleep and nsleep instead. See also the "Sample Replacement Sleep Program" in Appendix H. "Using Signals and the IBM PE Programs".

With the exception of sleep, system and exec, the routines listed above set the system error indicator (the variable errno) to EINTR, which can be tested by the user's program. See the "Sample Replacement Select Program" in Appendix H. "Using Signals and the IBM PE Programs".

Normal file read and write are restarted automatically by AIX, and should not require any special treatment by the user.

The system and fork calls create a new task in which the interval timer is still running. If a fork is followed by an exec (which is what system does), the signal handler for the timer is overlaid, and the task is terminated when the interval timer expires.

To handle this for the system call, temporarily turn the interval timer off (using the alarm(0) call) before the call, and turn it on again (ualarm(500000, 500000) will do) after the system call.

To handle the interval timer for a forked child, merely turn off the interval timer via alarm(0) in the child.

There are other restrictions on fork described below.

Forks Are Limited

As described earlier, if a task forks, the forked child inherits the running timer. The timer should be turned off before forking another program. If the forked child does not exec another program, it should be aware that an atexit routine has been registered for the parent which is also inherited by the child. In most cases, the atexit routine will request POE to terminate the task (parent). A forked child should terminate with an _exit(0) system call to prevent the atexit routine from being called. Also, if the forked parent terminates before the child, the child task will not be cleaned up by POE.

A forked child must not call the message passing library.

Checkpoint/Restart Limitations

A user may initiate a checkpoint sequence from within a parallel MPI program by calling the MP_CHKPT function. All tasks in the parallel job must issue the call, which does not return until the checkpoint files have been created for all tasks. If the job subsequently fails and is restarted, the restart returns from the MP_CHKPT function with an indication that the parallel job has been restarted.

Programs using the signal handling (non-threaded) MPI library may be linked as a checkpointable executable, which is run as a LoadLeveler batch job. LoadLeveler Version 2.1 or later is required. Restrictions on the program follow:

For some processes, it is impossible to obtain or recreate the state of the process. For this reason, you should only checkpoint programs with states that are simple to checkpoint and recreate. A program that is long-running, computation-intensive, and does not fork any processes is an example of a job that is well-suited for checkpointing.
In order to prevent unpredictable results from occurring, checkpointing jobs should not use the following system services:
- Administrative (audit and swapqry, for example)
- Dynamic loading
- Forks
- Internal timers
- Messages
- Semaphores
- Set user ID or group ID
- Shared memory
- Signals
- Threads
Another limitation of checkpointing jobs is file I/O. Because individual write calls are not traced, the file recovery scheme requires that all I/O operations, when repeated, must yield the same result. A job that opens all files as read-only can be checkpointed. A job that writes to a file and then reads the data back can also be checkpointed. An example of I/O that could cause unpredictable results is: reading an area of a file, writing to it, and then reading the same area of the file again.

MPI Threaded Library Considerations

When programming in a threaded environment specific skills and considerations are required. The information in this subsection provides you with specific programming considerations when using POE and the MPI threaded library. It assumes you are familiar with POSIX threads in general including mutexes, thread condition waiting, thread-specific storage, thread creation and termination.

POE Gets Control First And Handles Task Initialization

POE sets up its environment via the entry point mp_main_r(). mp_main_r() sets up signal handlers, initializes VT, and sets up an atexit routine before calling your main program.

Note: In the threaded library, message passing initialization takes place when MPI_INIT is called and not by mp_main_r. The threaded library and the signal-handling library differ significantly in this regard.

Language Bindings

The Fortran, C and C++ bindings for MPI are contained in the same library (libmpi_r.a) and can be freely intermixed.

Refer to "Fortran Considerations" for more information about running Fortran programs in a threaded environment.

MPI-IO Requires GPFS To Be Used Effectively

The subset implementation of MPI-IO provided in the thread library depends on all tasks running on a single file system. IBM Generalized Parallel File System (GPFS) is able to present a single file system to all nodes of an SP. Shared file systems (NFS and AFS, for example) do not have the same rigorous management of file consistency when updates occur from more than one node.

MPI-IO can be used with most file systems as long as all tasks are on a single node. This single node approach may be useful in learning to use MPI-IO, but is not likely to be worthwhile in any production context.

Any production use of MPI-IO must be based on GPFS.

Use of AIX Signals

The threaded POE run-time environment creates a thread to handle the following asynchronous signals:

SIGQUIT
SIGPWR
SIGDANGER
SIGTSTP
SIGTERM
SIGHUP
SIGINT

A user signal handler must not be invoked to handle the above signals, which are handled by sigwait.

The following signals, which are used by MPI in the non-threaded library, are handled as described below.

SIGALRM

The threaded library does not use SIGALRM and long system calls such as sleep are not interrupted by the message passing library. For example, sleep runs its entire duration unless interrupted by a user-generated event.

SIGIO

PE blocks SIGIO before calling your program. SIGIO is used in the IP version of the library to notify you of an I/O event or the arrival of a message packet. This notification is enabled via the environment variable MP_CSS_INTERRUPT. If this environment variable is set to YES, the message packet arrival dispatches the interrupt service thread to process the packet.

The User Space version of the library receives notification of an arriving packet via an AIX kernel event and does not use SIGO. You may unblock it or use sigwait to process SIGIO signals.

If you've registered a signal handler (via sigaction) for SIGIO before MPI_INIT is called, the function is added to the interrupt service thread and is executed each time the service thread is dispatched. Although registered as a signal handler, the function is not required to be signal safe because it is executed on a thread. You can use pthread calls to communicate with other threads. You cannot call MPI functions in this handler.

After MPI_FINALIZE is called, your signal handler is restored but you need to unblock SIGIO in order to receive subsequent SIGIO signals.

If you register or change the SIGIO signal handler after calling MPI_INIT, your changes are ignored by the MPI library but your changes are not undone by MPI_FINALIZE.

SIGPIPE

Neither the threaded or non-threaded IP libraries use SIGPIPE. The threaded User Space library polls a variable set by the AIX kernel to determine if the switch has faulted and needs to be restarted. As a result, it does not use SIGPIPE.

Limitations In Setting The Thread Stacksize

The main thread stacksize is the same as the stacksize used for non-threaded applications. If you write your own MPI reduce functions to use with nonblocking collective communications or a SIGIO handler that will be executed on one of the library service threads, you are limited to a stacksize of 96KB by default. To increase your thread stacksize, use the environment variable MP_THREAD_STACKSIZE. For more information about the default and your ability to change the default, see the manpage for AIX_PTHREAD_SET_STACKSIZE.

Forks Are Limited

If a task forks, only the thread that forked exists in the child task. Therefore, the message passing library will not operate properly. Also, if the forked child does not exec another program, it should be aware that an atexit routine has been registered for the parent which is also inherited by the child. In most cases, the atexit routine requests that POE terminate the task (parent). A forked child should terminate with an _exit(0) system call to prevent the atexit routine from being called. Also, if the forked parent terminates before the child, the child task will not be cleaned up by POE.

A forked child MUST NOT call the message passing library.

Standard I/O Requires Special Attention

When your program runs on the remote nodes, it has no controlling terminal. STDIN and STDOUT, STDERR are always piped.

If your threaded MPI program processes STDIN from a large file on the home node, you must do one of the following:

Invoke MPI_Init() before performing any STDIN processing, or
Ensure that all STDIN has been processed (EOF) before invoking MPI_Init().

This also includes programs which may not explicitly use MPI.

If STDIN is piped (or redirected) to the poe binary (via ordinary pipes) and your application is linked with the threaded library, then handle STDIN in the following way:

If all of STDIN is read by your program before MPI_Init is called, set the environment variable MP_HOLD_STDIN=NO.
If none of STDIN is read before MPI_Init is called, set the environment variable MP_HOLD_STDIN=YES.
If STDIN is less than approximately 4000 bytes in length, set MP_HOLD_STDIN=NO.
If none of the above applies, it may not be possible to run your program correctly, and you will have to devise some other mechanism for providing data to your program.

Thread-Safe Libraries

AIX provides thread-safe versions of some libraries, such as libc_r.a. However, not all libraries have a thread-safe version. It is your responsibility to determine whether the libraries you use can be safely called by more than one thread.

Program And Thread Termination

MPI_FINALIZE terminates the MPI service threads but does not affect user-created threads. Use pthread_exit to terminate any user-created threads, and exit(m) to terminate the main program (initial thread). The value of m is used to set POE's exit status as explained on "Exit Status".

Other Thread-Specific Considerations

Order Requirement For System Includes

For threaded programs, AIX requires that the system include <pthread.h> must be first with <stdio.h> or other system includes following it. <pthread.h> defines some conditional compile variables that modify the code generation of subsequent includes, particularly <stdio.h>. Please note that <pthread.h> is not required unless your file uses thread-related calls or data.

MPI_INIT

Call MPI_INIT once per task not once per thread. MPI_INIT does not have to be called on the main thread but MPI_INIT and MPI_FINALIZE must be called on the same thread.

MPI calls on other threads must adhere to the MPI standard in regard to the following:

A thread cannot make MPI calls until MPI_INIT has been called.
A thread cannot make MPI calls after MPI_FINALIZE has been called.
Unless there is a specific thread protocol programmed, you cannot rely on any specific order or speed of thread processing.

Collective Communications

Collective communications must meet the MPI standard requirement that all participating tasks execute collective communications on any given communicator in the same order. If collective communications calls are made on multiple threads, it is your responsibility to ensure the proper sequencing or to use distinct communicators.

Support for M:N Threads

By default, user threads are created with process contention scope, and M user threads are mapped to N kernel threads. The values of the ratio M:N and the default contention scope are settable by AIX environment variables. The service threads created by MPI, POE, and LAPI have system contention scope, that is, they are mapped 1:1 to kernel threads.

For PSSP 2.3 and 2.4, you must create system contention scope threads. For PSSP 3.1, you can create process contention scope threads, but any such thread will be converted to a system contention scope thread when it makes its first MPI call.

Fortran Considerations

The information in this subsection provides you with some specific programming considerations for when you are using POE and the Fortran compiler.

Fortran 90 and MPI

Incompatibilities exist between Fortran 90 and MPI which may effect the ability to use such programs. Refer to the information in

/usr/lpp/ppe.poe/samples/mpif90/README.mpif90

for further details. PE, Version 2, Release 2 provided the header file mpif90.h for use with Fortran 90. The file is still available in PE, Version 2, Release 4, but should not be used by new code. The mpif.h header file is formatted to work with either mpxlf90 or mpxlf compilation.

Fortran and Threads

Version 5 of the AIX XLF Fortran compiler supports threads.

Version 4.1 of the AIX XLF Fortran compiler is not thread-safe. However, XLF Version 4.1.0.1 provides a partial thread-support XLF runtime library. It supports multi-threaded applications that have one Fortran thread. Be sure you thoroughly test such use.

The partial thread-support library is libxlf90_t.a and is installed as /usr/lib/libxlf90_t.a. When you use the mpxlf_r command, this library is included automatically.

Restrictions

When you use libxlf90_t.a the following restrictions apply. Therefore, only one Fortran thread in a multi-threaded application may use the library.

Routines in the library are not thread-reentrant.
Use of routines in the math library (libm.a) by more than one thread may produce unpredictable results.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]