This chapter describes the Parallel Operating Environment (POE). POE is a simple and friendly environment designed to ease the transition from serial to parallel application development and execution. POE lets you develop and run parallel programs using many of the same methods and mechanisms as you would for serial jobs. POE allows you to continue to use the standard UNIX** and AIX application development and execution techniques with which you are already familiar. For example, you can redirect input and output, pipe the output of programs into more or grep, write shell scripts to invoke parallel programs, and use shell tools such as history. You do all these in just the same way you would for serial programs. So while the concepts and approach to writing parallel programs must necessarily be different, POE makes your working environment as familiar as possible.
This chapter describes the steps involved in compiling and executing your parallel C, C++, or Fortran programs using either an IBM RS/6000 SP, an RS/6000 network cluster, or a mixed system.
This section discusses how to compile and execute your parallel C, C++, or Fortran programs. It leaves out the first step in any application's life cycle which is actually writing the program. For information on writing parallel programs, refer to IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference IBM Parallel Environment for AIX: MPL Programming and Subroutine Reference IBM Parallel Environment for AIX: Hitchhiker's Guide, and IBM Parallel System Support Programs for AIX: Command and Technical Reference.
Note: | If you are using POE for the first time, check that you have authorized access. See IBM Parallel Environment for AIX: Installation for information on setting up users. |
In order to execute an MPI, MPL, or LAPI parallel program, you need to:
As with a serial application, you must compile a parallel C, C++, or Fortran program before you can run it. Instead of using the cc, xlC, or xlf commands, however, you use the commands mpcc, mpCC, or mpxlf. The mpcc, mpCC, and mpxlf commands not only compile your program, but also link in the Partition Manager and message passing interface libraries. When you later invoke the program, the subroutines in these libraries enable the home node Partition Manager to communicate with the parallel tasks, and the tasks with each other. To compile threaded C, C++, or Fortran programs, use the mpcc_r, mpCC_r, or mpxlf_r commands. These commands can also be used to compile non-threaded programs with the threaded libraries.
To compile programs with the checkpoint/restart capability, use the mpcc_chkpt, mpCC_chkpt, or mpxlf_chkpt commands. See IBM Parallel Environment for AIX: Hitchhiker's Guide for an overview of checkpointing and restarting POE programs. For specific details, see the section later in this chapter, "Checkpointing and Restarting Programs".
These compiler commands are actually shell scripts which call the appropriate compiler. You can use any of the cc, xlC, or xlf flags on these commands.
The following table shows what to enter to compile a program depending on
the language in which it is written. For more information on these
commands, see Appendix A. "Parallel Environment Commands".
To: | Enter: |
---|---|
Compile a C program. A communication subsystem library implementation will be dynamically linked when the executable is invoked. | mpcc program.c -o program |
Compile a C++ program. A communication subsystem library implementation will be dynamically linked when the executable is invoked. | mpCC program.C -o program |
Compile a Fortran program. A communication subsystem library implementation will be dynamically linked when the executable is invoked. | mpxlf program.f -o program |
Compile a C program which uses threaded MPI. A communication subsystem library implementation will be dynamically linked when the executable is invoked. | mpcc_r program.c -o program |
Compile a C++ program which uses threaded MPI. A communication subsystem library implementation will be dynamically linked when the executable is invoked. | mpCC_r program.C -o program |
Compile a Fortran program which uses threaded MPI. A communication subsystem library implementation will be dynamically linked when the executable is invoked. | mpxlf_r program.f -o program |
Notes:
In general, to create a static executable, do the following:
cc -c myprog.c -I/usr/lpp/ppe.poe/include
The following table shows how you create a C, C++, or Fortran static executable for IP or US.
Note: | When you see ld, l represents a lower case L. When you see -bI, I represents an upper case i. |
To: | For IP, Enter: | For US (SP only), Enter: |
---|---|---|
Create a C or C++ static executable | ld -o myprog myprog.o /lib/crt0.o -binitfini:poe_remote_main -bnso -lmpci -lmpi -lvtd -lc -lppe -bI :/lib/syscalls.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/ip | ld -o myprog myprog.o /lib/crt0.o -binitfini:poe_remote_main -bnso -bI:/usr/lpp/ssp/css/libus /fs_ext.exp -lmpci -lmpi -lvtd -lc -lppe -bI:/lib/syscalls.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/us |
Create a Fortran static executable | ld -o myprog myprog.o /lib/crt0.o -binitfini:poe_remote_main -bnso -lmpci -lmpi -lvtd -lxlf90 -lxlf -lc -lppe -bI:/lib/syscalls.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/ip | ld -o myprog myprog.o /lib/crt0.o -binitfini:poe_remote_main -bnso -bI:/usr/lpp/ssp/css/libus /fs_ext.exp -lmpci -lmpi -lvtd -lxlf90 -lxlf -lc -lppe -bI:/lib/syscalls.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/us |
Create a C or C++ static executable which uses threaded MPI | ld -o myprog myprog.o /lib/crt0_r.o -binitfini:poe_remote_main -bnso -lmpi_r -lvtd_r -lc_r -lppe_r -lpthreads -lmpci_r -lc /usr/lib/libc.a -bI:/lib/threads.exp -bI:/lib/syscalls.exp -L/usr/lib/threads -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/ip | ld -o myprog myprog.o /lib/crt0_r.o -binitfini:poe_remote_main -bnso -bI:/usr/lpp/ssp/css/libus/fs_ext.exp -lmpi_r -lvtd_r -lc_r -lppe_r -lpthreads -lmpci_r -lc /usr/lib/libc.a -bI:/lib/threads.exp -bI:/lib/syscalls.exp -L/usr/lib/threads -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/us |
Create a Fortran static executable which uses threaded MPI | ld -o myprog myprog.o /lib/crt0_r.o -binitfini:poe_remote_main -bnso -lmpci_r -lmpi_r -lvtd_r -lxlf90_r -lc_r -lppe_r -lc -lpthreads /usr/lib/libc.a -bI:/lib/syscalls.exp -bI:/lib/threads.exp -L/usr/lib/threads -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/ip | ld -o myprog myprog.o /lib/crt0_r.o -binitfini:poe_remote_main -bnso -bI:/usr/lpp/ssp/css/libus/fs_ext.exp -lmpci_r -lmpi_r -lvtd_r -lxlf90_r -lc_r -lppe_r -lc -lpthreads /usr/lib/libc.a -bI:/lib/syscalls.exp -bI:/lib/threads.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/us |
Notes:
If the program you are running is in a shared file system, the Partition Manager loads a copy of your executable in each processor node in your partition when you invoke a program. If your executable is in a private file system, however, you must copy it to the nodes in your partition. If you plan to use the parallel debugger pdbx, you must copy your source files to all nodes as well. You can easily copy files to nodes using the mprcp command. All you do is pass mprcp the name of the host list file you are using to define your partition and the absolute path name of the file.
For example, to send a copy of program to all the processor nodes listed in host.list in your current directory:
* The mprcp command copies program to each of the nodes listed in host.list using the rcp command. If a program of the same name already exists, the mprcp command will overwrite it.
For more information on the rcp command, refer to IBM AIX Version 4 Commands Reference . For more information on the mprcp command, see Appendix A. "Parallel Environment Commands".
You can also copy your executable to each node with the mcp command. There is an advantage in using mcp over mprcp in that mcp copies large programs faster. mcp uses the message passing facilities of the Parallel Environment to copy a file from a file system on the home node to a remote node file system. For example, assume that your executable program is on a mounted file system (/u/edgar/somedir/myexecutable), and you want to make a private copy in /tmp on each node in host.list.
Note: | If you load your executable from a mounted file system, you may experience an initial delay while the program is being initialized on all nodes. You may experience this delay even after the program begins executing, because individual pages of the program are brought in on demand. This is particularly apparent during initialization of the message passing interface; since individual nodes are synchronized, there are simultaneous demands on the network file transfer system. You can minimize this delay by copying the executable to a local file system on each node, using the mcp message passing file copy program. |
This step contains the following sections:
Before invoking your program, you need to set up your execution environment. There are a number of POE environment variables discussed throughout this book and summarized in Appendix B. "POE Environment Variables and Command-Line Flags". Any of these environment variables can be set at this time to later influence the execution of parallel programs. This step covers those environment variables most important for successful invocation of a parallel program. When you invoke a parallel program, your home node Partition Manager checks these environment variables to determine:
For specific node allocation, the Partition Manager reads an explicit list of nodes contained in a host list file you create. If you are using an RS/6000 network cluster, or if you are using a mixed system and want to include nodes not on the SP system, you must use this method of node allocation.
For non-specific node allocation, you give the Partition Manager the name or number of a LoadLeveler pool, or the number of an SP system pool. A pool name or number may also be provided in a host list file when using LoadLeveler, or a list of SP system pools may be provided if using the Resource Manager. The Partition Manager then connects to LoadLeveler or the SP system Resource Manager, which allocates nodes from the specified pool(s) for you.
There are five separate environment variables that, collectively, determine how nodes are allocated by the Partition Manager. While these are the only ones you must set to allocate nodes, keep in mind that there are many other environment variables you can set. These are summarized in Appendix B. "POE Environment Variables and Command-Line Flags", and control such things as standard I/O handling and VT trace file generation. The environment variables for node allocation are:
Notes:
The remainder of this step consists of sub-steps describing how to set each of these environment variables, and how to create a host list file. Depending on the hardware and message passing library you are using, and the method of node allocation you want, some of the sub-steps that follow may not apply to you. For this reason, pay close attention to the task variant tables at the beginning of many of the sub-steps. They will tell you whether of not you need to perform the sub-step.
For further clarification, the following tables summarize the procedure for
determining how nodes are allocated. The tables describe the possible
methods of node allocation available to you, what each environment variable
must be set to, and whether or not you need to create a host list file.
To make the procedure of setting up the execution environment easier and less
prone to error, you may eventually wish to create a shell script which
automates some of the environment variable settings. To allocate nodes
of an SP system, see Table 1. If you are using an RS/6000 network cluster, or if you are using a
mixed system and want to allocate nodes not on the SP system, see Table 2.
Table 1. Execution Environment Setup Summary (for an SP system)
| If you want to use the US communication subsystem library for communication among parallel tasks and... | If you want to use the IP communication subsystem library for communication among parallel tasks and... |
you want non-specific node allocation from a single pool: | you want specific node allocation or if you want non-specific node allocation from more than one pool: | you want non-specific node allocation from a single pool: |
you want specific node allocation or non-specific node allocation (using
the Resource Manager) from more than one pool:
| |
A host list file... | not required. | required. | not required. | required. |
MP_HOSTFILE | should be set to an empty string ("") or the word "NULL" | should be set to the name of your host list file. If not set, the host list file is assumed to be host.list in the current directory. | should be set to an empty string ("") or the word "NULL" | should be set to the name of your host list file. If not set, the host list file is assumed to be host.list in the current directory. |
MP_RESD | should be set to yes. If set to an empty string (""), or if not set, the Partition Manager assumes MP_RESD is yes. | should be set to yes. If set to an empty string (""), or if not set, the Partition Manager assumes MP_RESD is yes. | should be set to yes. If set to an empty string (""), or if not set, the Partition Manager assumes MP_RESD is yes. | should be set to yes. If set to an empty string (""), the Partition Manager assumes MP_RESD is no. |
MP_EUILIB | us | us | ip | ip |
MP_EUIDEVICE | css0 (the high performance switch). However, the actual value is ignored when MP_EUILIB is set to us. | css0 (the high performance switch). However, the actual value is ignored when MP_EUILIB is set to us. | should specify the adapter type. A valid, case-sensitive value is
css0 (the high performance switch).
Note that the MP_EUIDEVICE value is only used when the value of MP_EUILIB is ip. | should specify the adapter type. A valid, case-sensitive value is
css0 (the high performance switch).
Note that the MP_EUIDEVICE value is only used when the value of MP_EUILIB is ip. |
MP_RMPOOL | should be set to the name or number of a LoadLeveler pool, or the number of an SP system pool. It must be used if you are not using a host list file. | is ignored if you are using a host list file. | should be set to the name or number or a LoadLeveler pool, or the number of an SP system pool. It must be used if you are not using a host list file. | is ignored if you are using a host list file. |
Table 2. Execution Environment Setup Summary (for RS/6000 Network Cluster or Mixed System)
The following table shows how nodes will be allocated depending on the
value of the environment variables discussed in this step. It is
provided here for additional illustration. Refer to it in situations
when the environment variables are set in patterns other than those suggested
in Table 1 and Table 2.
Table 3. Node Allocation Summary
If | Then | ||||||
The value of MP_EUILIB is: | The value of MP_RESD is: | Your Host List file contains a list of: | The allocation mode will be: | The communication subsystem library implementation used will be: | The message passing address used will be: | ||
ip | - | nodes | Node_List | IP | Nodes | ||
pools | RM_List | IP | MP_EUIDEVICE | ||||
NULL | RM | IP | MP_EUIDEVICE | ||||
yes | nodes | RM_List | IP | MP_EUIDEVICE | |||
pools | RM_List | IP | MP_EUIDEVICE | ||||
NULL | RM | IP | MP_EUIDEVICE | ||||
no | nodes | Node_List | IP | Nodes | |||
pools | Error | - | - | ||||
NULL | Error | - | - | ||||
us | - | nodes | RM_List | US | N/A | ||
pools | RM_List | US | N/A | ||||
NULL | RM | US | N/A | ||||
yes | nodes | RM_List | US | N/A | |||
pools | RM_List | US | N/A | ||||
NULL | RM | US | N/A | ||||
no | nodes | Error | - | - | |||
pools | Error | - | - | ||||
NULL | Error | - | - | ||||
- | - | nodes | Node_List | IP | Nodes | ||
pools | RM_List | IP | MP_EUIDEVICE | ||||
NULL | RM | US | N/A | ||||
yes | nodes | RM_List | US | N/A | |||
pools | RM_List | US | N/A | ||||
NULL | RM | US | N/A | ||||
no | nodes | Node_List | IP | Nodes | |||
pools | Error | - | - | ||||
NULL | Error | - | - | ||||
|
Before you execute a program, you need to set the size of the
partition. To do this, use the MP_PROCS environment variable
or its associated command-line flag -procs. For example, say
you want to specify the number of task processes as 6. You
could:
Set the MP_PROCS environment variable: | Use the -procs flag when invoking the program: |
---|---|
|
|
Invoking parallel programs is discussed in more detail in "Step 5: Invoke the Executable".
Notes:
See "Step 3i: Set the MP_RMPOOL Environment Variable" for more details.
If all nodes to be used for the parallel job exist in a PSSP 2.3.0 or 2.4.0 partition, the SP_NAME environment variable should be set to the name of the control workstation of the SP system on which these nodes exist. This is the only case that results in POE contacting the Resource Manager rather than LoadLeveler for node allocation requests.
You need to create a host list file if: | You do not need to create a host list file if: |
---|---|
| you are using a LoadLeveler cluster or an SP system and want non-specific node allocation from a single pool. |
A host list file specifies the processor nodes on which the individual tasks of your program should run. When you invoke a parallel program, your Partition Manager checks to see if you have specified a host list file. If you have, it reads the file to allocate processor nodes. The procedure for creating a host list file differs depending on whether you are using an RS/6000 network cluster, a LoadLeveler cluster, an SP system, or a mixed system. If you are using an RS/6000 network cluster, see "Creating a Host List File to Allocate Nodes of a Cluster". If you are using a LoadLeveler cluster, an SP system, or a mixed system, see "Creating a Host List File to Allocate Nodes of an SP System".
If you are using an RS/6000, a host list file simply lists a series of host names - one per line. These must be the names of remote nodes accessible from the Home Node. Lines beginning with an exclamation point (!) or asterisk (*) are comments. The Partition Manager ignores blank lines and comments. The host list file can list more names than are required by the number of program tasks. The additional names are ignored.
To understand how the Partition Manager uses a host list file to determine the nodes on which your program should run, consider the following example host list file:
! Host list file for allocating 6 tasks * An asterisk may also be used to indicate a comment host1_name host2_name host3_name host4_name host5_name host6_name
The Partition Manager ignores the first two lines because they are comments, and the third line because it is blank. It then allocates host1_name to run task 0, host2_name to run task 1, host3_name to run task 2, and so on. If any of the processor nodes listed in the host list file are unavailable when you invoke your program, the Partition Manager returns a message stating this and does not run your program.
You can also have multiple tasks of a program share the same node by simply listing the same node multiple times in your host list file. For example, say your host list file contains the following:
host1_name host2_name host3_name host1_name host2_name host3_name
Tasks 0 and 3 will run on host1_name, tasks 1 and 4 will run on host2_name, and tasks 2 and 5 will run on host3_name.
If you are using a LoadLeveler cluster or SP system, you can use a host list file for either:
In either case, the host list file can contain a number of records - one per line. For specific node allocation, each record indicates a processor node. For non-specific node allocation you can have one system pool only, when using LoadLeveler. When using the Resource Manager, each record indicates an SP system pool. Your host list file cannot contain a mixture of node and pool requests, so you must use one method or the other. The host list file can contain more records than required by the number of program tasks. The additional records are ignored.
Each record is either a host name or IP adapter address of a specific processor node of the SP system. If you are using a mixed system and want to allocate nodes not on the SP system, you must request them by host name. Lines beginning with an exclamation point (!) and asterisk (*) are comments. The Partition Manager ignores blank lines and comments.
To understand how the Partition Manager uses a host list file to determine the SP system nodes on which your program should run, consider the following representation of a host list file.
! Host list file for allocating 6 tasks host1_name host2_name host3_name 9.117.8.53 9.117.8.53 9.117.8.53
The Partition Manager ignores the first line because it is a comment, and the second because it is blank. It then allocates host1_name to run task 0, host2_name to run task 1, host3_name to run task 2, and so on. The last three nodes are requested by adapter IP address using dot decimal notation.
Notes:
After installation of a LoadLeveler cluster or SP system, your system administrator divides its processor nodes into a number of pools. With LoadLeveler, each pool has an identifying pool name or number. With an SP system, each pool has an identifying pool number. Using LoadLeveler for non-specific node allocation, you need to supply the appropriate pool name or number. LoadLeveler does not use more than one pool. Using Resource Manager for non-specific node allocation from a number of pools, you need to supply the appropriate pool numbers.
If you require information about LoadLeveler pools, use the command llstatus. To use llstatus on a workstation that is external to the LoadLeveler cluster, the LoadL.so fileset must be installed on the external node (see Using and Administering LoadLeveler for more information).
* LoadLeveler lists information about pools in the LoadLeveler cluster.
If you require information about SP system pools, use the command jm_status. To use jm_status on a workstation that is external to the SP system, the ssp.clients fileset must be installed on the external node (see IBM Parallel Environment for AIX: Installation for more information).
* The Resource Manager lists information about all SP system pools.
With regard to LoadLeveler, in a host list file intended for non-specific node allocation, each record is a pool name or number preceded by an at symbol (@). Lines beginning with an exclamation point (!) and asterisk (*) are comments. The Partition Manager ignores blank lines and comments.
To understand how the Partition Manager uses a host list file for non-specific node allocation, consider the following example host list file:
! Host list file for allocating 3 tasks with LoadLeveler @6 @6 @6
The Partition Manager ignores the first line because it is a comment, and the second line because it is blank. The at (@) symbols tell the Partition Manager that these are pool requests. It connects to LoadLeveler to request three nodes from pool 6.
With regard to the Resource Manager only, in a host list file intended for non-specific node allocation from a number of pools, each record is a pool number preceded by an at symbol (@). Lines beginning with an exclamation point (!) and asterisk (*) are comments. The Partition Manager ignores blank lines and comments.
To understand how the Partition Manager uses a host list file for non-specific node allocation from a number of pools, consider the following example host list file:
! Host list file for allocating 6 tasks with the Resource Manager @6 @6 @6 @12 @12 @12
The Partition Manager ignores the first line because it is a comment, and the second line because it is blank. The at (@) symbols tell the Partition Manager that these are pool requests. It connects to the SP system Resource Manager to request three nodes from pool 6, and three nodes from pool 12.
Notes:
When requesting nodes of an SP system, you can optionally request how each node's resources - its adapter and CPU - should be used. You can specify:
Note: | When using LoadLeveler, you can request how nodes are used with the MP_CPU_USE and/or MP_ADAPTER_USE environment variables, or their associated command line options. Usage specification in a host list file will be ignored when using LoadLeveler. |
With regard to the Resource Manager, on each record of the host list file, you can make either or both of the specifications listed above. For example, if you wanted your program task to have exclusive use of both the adapter and CPU, the host list record would be:
host1_name dedicated unique
or
host1_name d u
This is the same for pool requests:
@6 dedicated unique
or
@6 d u
The environment variables MP_ADAPTER_USE and MP_CPU_USE, or the associated command line options (-adapter_use and -cpu_use) can be used to make either or both of these specifications. These specifications will then affect the resource usage for each node allocated from the pool specified using MP_RMPOOL or -rmpool. For example, if you wanted nodes from Resource Manager pool 5, and you wanted your program to have exclusive use of both the adapter and CPU, the following command line could be used:
poe [program] -rmpool 5 -adapter_use d[edicated] -cpu_use u[nique] [more_poe_options]
Associated environment variables (MP_RMPOOL, MP_ADAPTER_USE, MP_CPU_USE) could also be used to specify any or all of the options in this example.
The following tables illustrate how node resources are used. Table 4 shows the default settings for adapter and CPU use, while Table 5 outlines how the two separate specifications determine how the allocated
node's resources are used.
Table 4. Adapter/CPU Default Settings
Adapter | CPU | |||
---|---|---|---|---|
If host list file contains non-specific pool requests: | Dedicated | Unique | ||
If host list file requests specific nodes: | Shared 1 | Multiple | ||
If host list file is not used nodes: | Dedicated2 | Unique3 | ||
1 For US jobs, adapter is dedicated. 2 For IP jobs, adapter is shared. 3 For IP jobs, CPU is multiple. |
Table 5. Adapter/CPU Use under LoadLeveler
| If the Node's CPU is "Unique": | If the Node's CPU is "Multiple": |
---|---|---|
If the adapter use is "Dedicated": | Intended for production runs of high performance applications. Only the tasks of that parallel job use the adapter and CPU. | The adapter you specified with MP_EUIDEVICE is dedicated to the tasks of your parallel job. However, you and other users still have access to the CPU through another adapter. |
If the adapter use is "Shared": | Only your program tasks have access to the node's CPU, but other program's tasks can share the adapter. | Both the adapter and CPU can be used by a number of your program's tasks and other users. |
Table 6. Adapter/CPU Use under the Resource Manager
| If the Node's CPU is "Unique": | If the Node's CPU is "Multiple": |
---|---|---|
If the adapter use is "Dedicated": | Intended for production runs of high performance applications. Only one task uses the adapter and CPU. | The adapter you specified with MP_EUIDEVICE is dedicated to your program task. However, you and other users still have access to the CPU through another adapter. |
If the adapter use is "Shared": | Only you have access to the node's CPU, but a number of your program's tasks can share the adapter. | Both the adapter and CPU can be used by a number of your program's tasks and other users. |
Notes:
When running parallel programs in a LoadLeveler cluster or on an SP system, you can generate an output host list file of the
nodes allocated by LoadLeveler or the Resource Manager. When you have LoadLeveler or the Resource Manager perform non-specific node allocation from SP
system pools, this enables you to learn which nodes were allocated.
This information is vital if you want to perform some postmortem analysis or
file cleanup on those nodes, or if you want to rerun the program using the
same nodes. To generate a host list file, set the
MP_SAVEHOSTFILE environment variable to a file name.
You can specify this using a relative or full path name. As with most
POE environment variables, you can temporarily override the value of
MP_SAVEHOSTFILE using its associated command-line flag
-savehostfile. For example, to save LoadLeveler's or the Resource
Manager's node allocation into a file called
/u/hinkle/myhosts, you could:
Set the MP_SAVEHOSTFILE environment variable: | Use the -savehostfile flag when invoking the program: |
---|---|
|
|
Each record in the output host list file will be the original non-specific pool request. Following each record will be comments indicating the specific node that was allocated. The specific node is identified by:
For example, using LoadLeveler, say the input host list file contains the following records:
@mypool @mypool @mypool
The following is a representation of the output hostlist file.
host1_name ! 9.117.11.47 9.117.8.53 !@mypool host1_name ! 9.117.11.47 9.117.8.53 !@mypool host1_name ! 9.117.11.47 9.117.8.53 !@mypool
Using the Resource Manager, say the input host list file contains the following records:
@6 @6 @6 @12 @12 @12
The following is a representation of the output hostlist file.
host1_name dedicated unique ! 9.117.11.47 9.117.8.53 !@6 host2_name dedicated unique ! 9.117.11.47 9.117.8.53 !@6 host3_name dedicated unique ! 9.117.11.47 9.117.8.53 !@6 host4_name dedicated unique ! 9.117.11.47 9.117.8.53 !@12 host5_name dedicated unique ! 9.117.11.47 9.117.8.53 !@12 host6_name dedicated unique ! 9.117.11.47 9.117.8.53 !@12
Note: | The name of your output host list file can be the same as your input host list file. If a file of the same name already exists, it is overwritten by the output host list file. |
You need to set the MP_HOSTFILE environment variable if: | You do not need to set the MP_HOSTFILE environment variable if: |
---|---|
| If your host list file is the default ./host.list |
The default host list file used by the Partition Manager to allocate nodes
is called host.list and is located in your current
directory. You can specify a file other than
host.list by setting the environment variable
MP_HOSTFILE to the name of a host list file, or by using either the
-hostfile
or -hfile flag when
invoking the program. In either case, you can specify the file using
its relative or full path name. For example, say you want to use the
host list file myhosts located in the directory
/u/hinkle. You could:
Set the MP_HOSTFILE environment variable: | Use the -hostfile flag when invoking the program: |
---|---|
|
|
If you are using LoadLeveler or the SP system Resource Manager for non-specific node allocation
from a single pool specified by MP_RMPOOL, and a host list file exists in the current directory, you must set MP_HOSTFILE to an empty string or to the
word "NULL". Otherwise the Partition Manager uses the host list file. You can either:
Set the MP_HOSTFILE environment variable: | Use the -hostfile flag when invoking the program: |
---|---|
|
|
To indicate whether a job management system should be used, you set the MP_RESD environment variable to yes or no. As specified in Table 1 and Table 2, MP_RESD controls whether or not the Partition Manager connects to LoadLeveler or the Resource Manager to allocate processor nodes.
If you are allocating nodes that are not part of a LoadLeveler cluster, MP_RESD should be set to no. If MP_RESD is set to yes, only nodes within the LoadLeveler cluster are allocated.
If you are allocating nodes of an RS/6000 network cluster, you do not have a job management system and should set MP_RESD to no. If you are using a mixed system, you may set MP_RESD to yes. However, the job management system only has knowledge of SP system nodes. To allocate any of the additional RS/6000 processors which supplement the SP system nodes in a mixed system, you must also use a host list file.
As with most POE environment variables, you can temporarily override the
value of MP_RESD using its associated command-line flag
-resd. For example, to specify that you want the
Partition Manager to connect to the Resource Manager, you could:
Set the MP_RESD environment variable: | Use the -resd flag when invoking the program: |
---|---|
|
|
You can also set MP_RESD to an empty string. If set to an
empty string, or if not set, the default value of MP_RESD is
interpreted as yes or no depending on the
context. Specifically, the value of MP_RESD will be
determined by the value of MP_EUILIB and whether or not you are
using a host list file. The following table shows how the context
determines the value of MP_RESD.
MP_EUILIB setting | and you are using a host list file: | and you are not using a host list file: |
---|---|---|
If MP_EUILIB is set to ip, an empty string, the word "NULL", or if not set: | MP_RESD is interpreted as no by default, unless host list file includes pool requests. | MP_RESD is interpreted as yes by default. |
If MP_EUILIB is set to us: | MP_RESD is interpreted as yes by default. | MP_RESD is interpreted as yes by default. |
Notes:
During execution, the tasks of your program can communicate via calls to message passing routines. The message passing routines in turn call communication subsystem library routines which enable the processor nodes to exchange the message data. Before you invoke your program, you need to decide which communication subsystem library implementation you wish to use - the Internet Protocol (IP) communication subsystem or the User Space (US) communication subsystem.
The MP_EUILIB environment variable, or its associated
command-line flag -euilib, is used to indicate which
communication subsystem library implementation you are using. POE
needs to know which communication subsystem implementation to dynamically link
in as part of your executable when you invoke it. The following table
shows the appropriate setting for MP_EUILIB depending on the
communication subsystem library implementation you want and whether or not it
has already been statically linked.
and you want it dynamically linked when you invoke your program: | |
---|---|
If you want the IP communication subsystem or US communication subsystem: | MP_EUILIB should be set to ip or us This specification is case-sensitive. |
For example, say you want to dynamically link in the communication
subsystem library at execution time. You could:
Set the MP_EUILIB environment variable: | Use the -euilib flag when invoking the program: |
---|---|
|
|
Set the MP_EUILIBPATH environment variable: | Use the -euilibpath flag when invoking the program: |
---|---|
|
|
The expected library for loading the communication subsystem library implementation is in directory /usr/lpp/ppe.poe/lib/$MP_EUILIB. Setting the MP_EUILIBPATH environment variable causes POE to try to load the communication subsystem library from the directory $MP_EUILIBPATH/$MP_EUILIB. If the communication subsystem library (libmpci.a) is not in the requested path, it will be loaded from the library path for the IP communication subsystem library implementation used when the program was compiled - $MP_PREFIX/ppe.poe/lib/ip. MP_PREFIX can also be set by the user, but is normally /usr/lpp. Thus the default library path is normally /usr/lpp/ppe.poe/lib/ip, provided the library is not specified by the MP_EUILIB and/or MP_EUILIBPATH environment variables.
You need to set the MP_EUIDEVICE environment variable if: | You do not need to set the MP_EUIDEVICE environment variable if: |
---|---|
you have set the MP_EUILIB environment variable to ip, and are using LoadLeveler or the Resource Manager. | you have set the MP_EUILIB environment variable to us. The Partition Manager assumes that MP_EUIDEVICE is css0 - the high performance communication adapter. |
If you are using the IP communication subsystem library implementation for
communication among parallel tasks on an SP system, you can specify which
adapter set to use for message passing - either Ethernet, FDDI,
token-ring, or a high performance switch. The MP_EUIDEVICE
environment variable and its associated command-line flag
-euidevice are used to select an
alternate adapter set for communication among processor nodes. If
neither MP_EUIDEVICE device nor the -euidevice flag is
set, the communication subsystem library uses the external IP address of each
remote node. The following table shows the possible, case-sensitive,
settings for MP_EUIDEVICE.
Setting the MP_EUIDEVICE environment variable to: | Selects: |
---|---|
en0 | The Ethernet adapter |
fi0 | The FDDI adapter |
tr0 | The token-ring adapter |
css0 | The high performance switch adapter |
For example, say you want to use IP over the high performance
switch. The nodes have been initialized for IP as described in IBM
Parallel System Support Programs for AIX: Installation and Migration
Guide, and you have already set the MP_EUILIB environment variable to
ip. To specify the high performance switch, you could:
Set the MP_EUIDEVICE environment variable: | Use the -euidevice flag when invoking the program: |
---|---|
|
|
Notes:
The MP_MSG_API environment variable, or its
associated command line option, is used to indicate to POE which message
passing API is being used by the parallel tasks.
You need to set the MP_MSG_API environment variable if: | You do not need to set the MP_MSG_API environment variable if: |
---|---|
A parallel task is using LAPI alone or in conjunction with MPI. | A parallel task is using MPI only. |
You need to set the MP_RMPOOL environment variable if: | You do not need to set the MP_RMPOOL environment variable if: |
---|---|
You are using a LoadLeveler cluster oran SP system and want non-specific node allocation from a single pool. | You are allocating nodes using a host list file. |
After installation of a LoadLeveler cluster or SP system, your system administrator divides its processor nodes into a number of pools. Each pool has an identifying pool name or number. When using LoadLeveler, and you want non-specific node allocation from a single pool, you need to set the MP_RMPOOL environment variable to the name or number of that pool. When using the Resource Manager, and you want non-specific node allocation from a single pool, you need to set the MP_RMPOOL environment variable to the number of that pool. The pool number you specify should consist of nodes configured for the appropriate communication subsystem library implementation. Check with your system administrator to learn which pools consist of nodes initialized for the US communication subsystem and which were initialized for the IP communication subsystem.
If you need information about available pools and are using LoadLeveler, use the command llstatus. To use llstatus on a workstation that is external to the LoadLeveler cluster, the LoadL.so fileset must be installed on the external node (see Using and Administering LoadLeveler and IBM Parallel Environment for AIX: Installation for more information).
* LoadLeveler lists information about all LoadLeveler pools and/or features.
If you need information about available pools and are using the Resource Manager, use the command jm_status to get job manager status. To use jm_status on a workstation that is external to the SP system, the ssp.clients fileset must be installed on the external node (see IBM Parallel Environment for AIX: Installation for more information).
* The Resource Manager lists information about all SP system pools.
As with most POE environment variables, you can temporarily override the
value of MP_RMPOOL using its associated command-line flag
-rmpool. To specify pool 6, for example,
you could:
Set the MP_RMPOOL environment variable: | Use the -rmpool flag when invoking the program: |
---|---|
|
|
Notes:
In conjunction with MP_RMPOOL, when using LoadLeveler, the MP_NODES or MP_TASKS_PER_NODE environment variables or associated command line options may be used.
Table 7. LoadLeveler Node Allocation
MP_PROCS set? | MP_TASKS_PER_NODE set? | MP_NODES set? | Conditions and Results | ||
---|---|---|---|---|---|
Yes | Yes | Yes | MP_TASKS_PER_NODE multiplied by MP_NODES must equal MP_PROCS, otherwise an error occurs. | ||
Yes | Yes | No | MP_TASKS_PER_NODE must divide evenly into MP_PROCS, otherwise an error occurs. | ||
Yes | No | Yes | MP_NODES (n) must be less than or equal to MP_PROCS (p). If less than, LoadLeveler will allocate one task to each node, from 0 to n - 1, and will then allocate a second task to each of the nodes from 0 to n - 1, etc., until there are p tasks allocated. For example, if n = 3 and p = 5, 2 tasks will run on node 0, 2 tasks will run on node 1, and 1 task will run on node 2. | ||
Yes | No | No | The parallel job will run with the indicated number of MP_PROCS (p) on p nodes. | ||
No | Yes | Yes | The parallel job will consist of MP_TASKS_PER_NODE multiplied by MP_NODES tasks. | ||
No | Yes | No | An error occurs. MP_NODES or MP_PROCS must be specified with MP_TASKS_PER_NODE. | ||
No | No | Yes | One parallel task will be run on each of n nodes. | ||
No | No | No | One parallel task will be run on one node. | ||
|
You need to set the MP_AUTH environment variable if: | You do not need to set the MP_AUTH environment variable if: |
---|---|
You are using DFS/DCE based user authorization and your system administrator has not defined the MP_AUTH value in /etc/poe.limits. | You are using AIX based user authorization defined by /etc/hosts.equiv or .rhosts entries, or your system administrator has defined the MP_AUTH value in /etc/poe.limits. |
POE allows two types of user authorization:
Note: | If POE is run under LoadLeveler, LoadLeveler handles the user authorization, and the POE user authorization steps are skipped. |
The type of user authorization is controlled by the MP_AUTH environment variable. The valid values are AIX (the default) or DFS.
The system administrator can also define the value for MP_AUTH in the /etc/poe.limits file. If MP_AUTH is specified in /etc/poe.limits, POE will override the value of the MP_AUTH environment variable, if different.
For more information on running POE in a DFS environment, see "Running POE within a Distributed File System".
For more information on user authorization and on the /etc/poe.limits entries, see IBM Parallel Environment for AIX: Installation
If you wish to use either of the POE X-Windows analysis tools - the Program Marker Array or the System Status Array - you should start them before invoking the executable. For more information on these tools and how to start them, see Figure 1 and "Using the System Status Array".
Note: | In order to perform this step, you need to have a user account on, and be able to remotely login to, each of the processor nodes. This requires that you have an .rhosts file set up in your home directory on each of the remote processor nodes. Alternatively, your user id on the home node can be authorized in the /etc/hosts.equiv file on each remote node. For more information on the TCP/IP .rhosts file format, see IBM General Concepts and Procedures for RS/6000, and IBM AIX Version 4 Files Reference |
The poe command enables you to load and execute programs on remote nodes. You can use it to:
When you invoke poe, the Partition Manager allocates processor nodes for each task and initializes the local environment. It then loads your program, and reproduces your local environment, on each processor node. The Partition Manager also passes the option list to each remote node. If your program is in a shared file system, the Partition Manager loads a copy of it on each node. If your program is in a private file system, you will have already manually copied your executable to the nodes using the mprcp or mcp command. If you are using the dynamic message passing interface, the appropriate communication subsystem library implementation (IP or US) is automatically loaded at this time.
Since the Partition Manager attempts to reproduce your local environment on each remote node, your current directory is important. When you invoke poe, the Partition Manager will, immediately before running your executable, issue the cd command to your current working directory on each remote node. If you are in a local directory that does not exist on remote nodes, you will get an error as the Partition Manager attempts to change to that directory on remote nodes. Typically, this will happen when you invoke poe from a directory under /tmp. We suggest that you invoke poe from a file system that is mounted across the system. If it is important that the current directory be under /tmp, make sure that directory exists on all the remote nodes. If you are running in the C shell, see "Running Programs Under the C Shell".
Note: | The Parallel Environment opens several file descriptors before passing control to the user. The Parallel Environment will not assign specific file descriptors other than standard in, standard out, and standard error. |
Before using the poe command, you can first specify which
programming model you are using by setting the MP_PGMMODEL
environment variable to either
spmd or mpmd. As with most POE environment
variables, you can temporarily override the value of MP_PGMMODEL
using its associated command-line flag -pgmmodel. For
example, if you want to run an MPMD
program, you could:
Set the MP_PGMMODEL environment variable: | Use the -pgmmodel flag when invoking the program: |
---|---|
|
|
Note: | If you do not set the MP_PGMMODEL environment variable or -pgmmodel flag, the default programming model is SPMD. |
Note: | If you load your executable from a mounted file system, you may experience an initial delay while the program is being initialized on all nodes. You may experience this delay even after the program begins executing, because individual pages of the program are brought in on demand. This is particularly apparent during initialization of the message passing interface; since individual nodes are synchronized, there are simultaneous demands on the network file transfer system. You can minimize this delay by copying the executable to a local file system on each node, using the mcp message passing file copy program. |
If you have an SPMD program, you want to load it as a separate task on each node of your partition. To do this, follow the poe command with the program name and any options. The options can be program options or any of the POE command-line flags shown in Appendix B. "POE Environment Variables and Command-Line Flags". You can also invoke an SPMD program by entering the program name and any options:
or
program [options]
You can also enter poe without a program name:
* Once your partition is established, a prompt appears.
Note: | For National Language Support, POE displays messages located in an
externalized message catalog.
POE checks the LANG and NLSPATH environment
variables, and if either is not set, it will set up the following
defaults:
For more information about the message catalog, see "National Language Support". |
Note: | You must set the MP_PGMMODEL environment variable or -pgmmodel flag to invoke an MPMD program. |
With an SPMD application, the name of the same executable is sent to, and runs on, each of the processor nodes of your partition. If you are invoking an MPMD application, you are dealing with more than one program and need to individually load the nodes of your partition.
For example, say you have two programs - master and workers - designed to run together and communicate via calls to message passing subroutines. The program master is designed to run on one processor node. The workers program is designed to run as separate tasks on any number of other nodes. The master program will coordinate and synchronize the execution of all the worker tasks. Neither program can run without the other, as master only does sends and the workers tasks only do receives.
You can establish a partition and load each node individually using:
To establish a partition and load each node individually using STDIN:
* The Partition Manager allocates the processor nodes of your partition. Once your partition is established, a prompt containing both the logical node identifier 0 and the actual host name it maps to, appears.
* A prompt for the next node in the partition displays.
* When you have specified the program to run on the last node of your partition, the message "Partition loaded..." displays and execution begins.
For additional illustration, the following shows the command prompts that would appear, as well as the program names you would enter, to load the example master and workers programs. This example assumes that the MP_PROCS environment variable is set to 5.
0:host1_name> master [options] 1:host2_name> workers [options] 2:host3_name> workers [options] 3:host4_name> workers [options] 4:host5_name> workers [options] Partition loaded...
% poe 0:host1_name> master [options] 1:host2_name> workers [options] 2:host3_name> workers [options] 3:host4_name> workers [options] 4:host5_name> workers [options] Partition loaded...
Note: | You can use some POE command-line flags on individual program names, but not
those that are used to set up the partition. The flags you can use are
mainly those having to do with VT trace file collection. They
are:
|
The MP_CMDFILE environment variable, and its associated command-line flag -cmdfile, let you specify the name of a POE commands file. You can use such a file when individually loading a partition - thus freeing STDIN. The POE commands file simply lists the individual programs you want to load and run on the nodes of your partition. The programs are loaded in task order. For example, say you have a typical master/workers MPMD program that you want to run as 5 tasks. Your POE commands file would contain:
master [options] workers [options] workers [options] workers [options] workers [options]
Once you have created a POE commands file, you can specify it using a
relative or full path name on the MP_CMDFILE environment variable
or -cmdfile flag. For example, if your POE commands file is
/u/hinkle/mpmdprog, you could:
Set the MP_CMDFILE environment variable: | Use the -cmdfile flag on the poe command: |
---|---|
|
|
Once you have set the MP_CMDFILE environment variable to the name of the POE commands file, you can individually load the nodes of your partition. To do this:
* The Partition Manager allocates the processor nodes of your partition. The programs listed in your POE commands file are run on the nodes of your partition.
By default, the Partition Manager releases your partition when your program completes its run. However, you can set the environment variable MP_NEWJOB, or its associated command-line flag -newjob, to specify that the Partition Manager should maintain your partition for multiple job steps.
For example, say you have three separate SPMD programs. The first one sets up a particular computation by adding some files to /tmp on each of the processor nodes on the partition. The second program does the actual computation. The third program does some postmortem analysis and file cleanup. These three parallel programs must run as job steps on the same processor nodes in order to work correctly. While specific node allocation using a host list file might work, the requested nodes might not be available when you invoke each program. The better solution is to instruct the Partition Manager to maintain your partition after execution of each program completes. You can then read multiple job steps from:
In either case, you must first specify that you want the Partition Manager
to maintain your partition for multiple job steps. To do this, you
could:
Set the MP_NEWJOB environment variable: | Use the -newjob flag on the poe command: |
---|---|
|
|
Notes:
Say you want to run three SPMD programs - setup, computation, and cleanup - as job steps on the same partition. Assuming STDIN is keyboard entry, MP_PGMMODEL is set to spmd, and MP_NEWJOB is set to yes, you would:
* The Partition Manager allocates the processor nodes of your partition, and the following prompt displays:
0031-503 Enter program name (or quit):
* The program setup executes on all nodes of your partition. When execution completes, the following prompt displays:
0031-503 Enter program name (or quit):
* The program computation executes on all nodes of your partition. When execution completes, the following prompt displays:
0031-503 Enter program name (or quit):
* The program cleanup executes on all nodes of your partition. When execution completes, the following prompt displays:
0031-503 Enter program name (or quit):
* The Partition Manager releases the nodes of your partition.
Notes:
POE's STDIN processing model allows redirected STDIN to be passed to all steps of a newjob sequence, when the redirection is from a file. If redirection is from a pipe, POE does not distribute the input to each step, only to the first step.
The MP_CMDFILE environment variable, and its associated command-line flag -cmdfile, lets you specify the name of a POE commands file. If MP_NEWJOB is yes, you can have the Partition Manager read job steps from a POE commands file. The commands file in this case simply lists the programs you want to run as job steps. For example, say you want to run the three SPMD programs setup, computation, and cleanup as job steps on the same partition. Your POE commands file would contain the following three lines:
setup [program-options] computation [program-options] cleanup [program-options]
Program-options represent the actual values you need to specify.
If you are loading a series of MPMD programs, the POE commands file is also responsible for individually loading the partition. For example, say you had three master/worker MPMD job steps that you wanted to run as 4 tasks on the same partition. The following is a representation of what your POE commands file would contain. Options represent the actual values you need to specify.
master1 [options] workers1 [options] workers1 [options] workers1 [options] master2 [options] workers2 [options] workers2 [options] workers2 [options] master3 [options] workers3 [options] workers3 [options] workers3 [options]
While you could also redirect STDIN to read job steps from a file, a POE
commands file gives you more flexibility by not tying up STDIN. You can
specify a POE commands file using its relative or full path name. Say
your POE commands file is called /u/hinkle/jobsteps. To
specify that the Partition Manager should read job steps from this file rather
than STDIN, you could:
Set the MP_CMDFILE environment variable: | Use the -cmdfile flag on the poe command: |
---|---|
|
|
Once MP_NEWJOB is set to yes, and MP_CMDFILE is set to the name of your POE commands file, you would:
* The Partition Manager allocates the processor nodes of your partition, and reads job steps from your POE commands file. The Partition Manager does not release your partition until it reaches the end of your commands file.
You can also use POE to run non-parallel programs on the remote nodes of your partition. Any executable (binary file, shell script, UNIX utility) is suitable, and it does not need to have been compiled with mpcc, mpCC, or mpxlf. For example, if you wanted to check the process status (using the AIX command ps) for all remote nodes in your partition, you would:
* The process status for each remote node is written to standard out (STDOUT) at your home node. How STDOUT from all the remote nodes is handled at your home node depends on the output mode. See "Managing Standard Output (STDOUT)" for more information.
This section describes a number of additional POE environment variables for monitoring and controlling program execution. It describes how to use the:
For a complete listing of all POE environment variables, see Appendix B. "POE Environment Variables and Command-Line Flags".
You can run programs in one of two modes - develop mode or
run mode. In develop mode, intended for developing
applications, the message passing interface performs more detailed checking
during execution. Because of the additional checking it performs,
develop mode can significantly slow program performance. In run mode,
intended for completed applications, only minimal checking is done.
While run mode is the default, you can use the MP_EUIDEVELOP
environment variable to specify message passing develop mode. As with
most POE environment variables, MP_EUIDEVELOP has an associated
command-line flag -euidevelop. To specify MPI develop mode,
you could:
Set the MP_EUIDEVELOP environment variable: | Use the -euidevelop flag when invoking the program: |
---|---|
|
|
To later go back to run mode, set MP_EUIDEVELOP to no.
You can also use MP_EUIDEVELOP for pedb parameter
checking by specifying the DEB value, for "debug".
Set the MP_EUIDEVELOP environment variable: | Use the -euidevelop flag when invoking the program: |
---|---|
|
|
To stop parameter checking, set MP_EUIDEVELOP to min, for "minimum".
If you are using an SP system, and there are not enough available nodes to run your program, the Partition Manager, by default, returns immediately with an error. Your program does not run. Using the MP_RETRY and MP_RETRYCOUNT environment variables, however, you can instruct the Partition Manager to repeat the node request a set number of times at set intervals. Each time the Partition Manager repeats the node request, it displays the following message:
Retry allocation ......press control-C to terminate
The MP_RETRY environment variable, and its associated
command-line flag -retry, specifies
the interval (in seconds) to wait before repeating the node request.
The MP_RETRYCOUNT environment variable, and its associated
command-line flag -retrycount, specifies
the number of times the Partition Manager should make the request before
returning. For example, if you wanted to retry the node request five
times at five minute (300 second) intervals, you could:
Set the MP_RETRY and MP_RETRYCOUNT environment variables: | Use the -retry and -retrycount flags when invoking the program: |
---|---|
|
|
Note: | If the MP_RETRYCOUNT environment variable or the -retrycount command-line flag is used, the MP_RETRY environment variable or the -retry command-line flag must be set to at least one second. |
When you invoke a parallel executable, you can specify an argument list consisting of a number of program options and POE command-line flags. The argument list is parsed by POE - the POE command-line flags are removed and the remainder of the list is passed on to the program. If any of your program arguments are identical to POE command-line flags, however, this can cause problems. For example, say you have a program that takes the argument -retry. You invoke the program with the -retry option, but it does not execute correctly. This is because there is also a POE command-line flag -retry. POE parses the argument list and so the -retry option is never passed on to your program. There are two ways to correct this sort of problem. You can:
When you invoke a parallel executable, POE, by default, parses the argument list and removes all POE command-line flags before passing the rest of the list on to the program. Using the environment variable MP_NOARGLIST, you can prevent POE from parsing the argument list. To do this:
When the MP_NOARGLIST environment variable is set to yes, POE does not examine the argument list at all. It simply passes the entire list on to the program. For this reason, you can not use any POE command-line flags, but must use the POE environment variables exclusively. While most POE environment variables have associated command-line flags, MP_NOARGLIST, for obvious reasons, does not. To specify that POE should again examine argument lists, either set MP_NOARGLIST to no, or unset it.
When you invoke a parallel executable, POE, by default, parses the entire argument list and removes all POE command-line flags before passing the rest of the list on to the program. You can use a fence, however, to prevent POE from parsing the remainder of the argument list. A fence is simply a character string you define using the MP_FENCE environment variable. Once defined, you can use the fence to separate those arguments you want parsed by POE from those you do not. For example, say you have a program that takes the argument -retry. Because there is also a POE command-line flag -retry, you need to put this argument after a fence. To do this, you could:
While this example defines Q as the fence, keep in mind that the fence can be any character string. Any arguments placed after the fence are passed by POE, unexamined, to the program. While most POE environment variables have associated command-line flags, MP_FENCE does not.
POE lets you control standard input (STDIN), standard output (STDOUT), and standard error (STDERR) in several ways. You can continue using the traditional I/O manipulation techniques such as redirection and piping, and can also:
STDIN is the primary source of data going into a command. Usually, STDIN refers to keyboard input. If you use redirection or piping, however, STDIN could refer to a file or the output from another command (see "Using MP_HOLD_STDIN"). How you manage STDIN for a parallel application depends on whether or not its parallel tasks require the same input data. Using the environment variable MP_STDINMODE or the command-line flag -stdinmode, you can specify that:
Setting MP_STDINMODE to all indicates that all tasks should receive the same input data from STDIN. The home node Partition Manager sends STDIN to each task as it is read.
To specify multiple input mode so all tasks receive the same input data
from STDIN, you could:
Set the MP_STDINMODE environment variable: | Use the -stdinmode flag when invoking the program: |
---|---|
|
|
Note: | If you do not set the MP_STDINMODE environment variable or use the -stdinmode command-line flag, multiple input mode is the default. |
There are times when you only want a single task to read from
STDIN. To do this, you set MP_STDINMODE to the appropriate
task id. For example, say you have an MPMD application consisting of
two programs - master and workers. The
program master is designed to run as a single task on one processor
node. The workers program is designed to run as separate
tasks on any number of other nodes. The master program
handles all I/O, so only its task needs to read STDIN. If
master is running as task 0, you need to specify that only task 0
should receive STDIN. To do this, you could:
Set the MP_STDINMODE environment variable: | Use the -stdinmode flag when invoking the program: |
---|---|
|
|
The environment variable MP_HOLD_STDIN is used to defer sending of STDIN from the home node to the remote node(s) until the message passing library has been initialized. The variable must be set to "yes" when using POE to invoke a program which: (1) has been compiled with mpcc, mpxlf, or mpCC and their _r equivalents for the threaded environment, and (2) will be reading STDIN from other than the keyboard (redirection or piping). Failing to export this environment variable when running these programs could likely result in the user program hanging.
In addition, if a program invoked using POE has not been compiled with mpcc, mpxlf, or mpCC, the environment variable must not be set (or set to "no") to ensure that STDIN is delivered to the remote node(s).
To set MP_HOLD_STDIN correctly, you need to know the relative order of your program's use of stdin data and initialization of the message passing library.
The discussion immediately below applies to the signal handling message passing library (MPI/MPL), which is initialized before the user's executable gets control.
The subsequent section addresses the question for the threaded MPI library.
Note: | Wherever the following description refers to a POE environment variable (starting with MP_), the use of the associated command line option produces the same effect, with the exception of MP_HOLD_STDIN, which has no associated command line option. |
A POE process can use its STDIN in two ways. First, if the program name is not supplied on the command line and no command file (MP_CMDFILE) is specified, POE uses STDIN to resolve the names of the programs to be run as the remote tasks. Second, any "remaining" STDIN is then distributed to the remote tasks as indicated by the MP_STDINMODE and MP_HOLD_STDIN settings. In this dual STDIN model, redirected STDIN can then pose two problems:
The first problem is addressed in POE by performing a rewind of STDIN between job steps (only if STDIN is redirected from a file, for reasons beyond the scope of this document). The second problem is addressed by providing an additional setting for MP_STDINMODE of "none", which tells POE to only use STDIN for program name resolution. As far as STDIN is concerned, "none" ever gets delivered to the remote tasks. This provides an additional method of reliably specifying the program name to POE, by redirecting STDIN from a file or pipe, or by using the shell's here-document syntax in conjunction with the "none" setting. If MP_STDINMODE is not set to "none" when POE attempts program name resolution on redirected STDIN, program behavior is undefined.
The following scenarios describe in more detail the effects of using (or not using) an MP_STDINMODE of "none" when redirecting (or not redirecting) STDIN, as shown in the example:
Is STDIN Redirected? Yes No Yes A B Is MP_STDINMODE set to "none"? No C D
POE will use the redirected STDIN for program name resolution, only if no program name is supplied on the command line (MP_CMDFILE is ignored when MP_STDINMODE=none). No STDIN is distributed to the remote tasks. No rewind of STDIN is performed when MP_STDINMODE=none. If MP_HOLD_STDIN is set to "yes", this is ignored because no STDIN is being distributed.
POE will use the keyboard STDIN for program name resolution, only if no program name is supplied on the command line (MP_CMDFILE is ignored when MP_STDINMODE=none). No STDIN is distributed to the remote tasks. No rewind of STDIN is performed when MP_STDINMODE=none (also, STDIN is not from a file). If MP_HOLD_STDIN is set to "yes", this is ignored because no STDIN is being distributed.
POE will use the redirected STDIN for program name resolution, if required, and will distribute "remaining" STDIN to the remote tasks. If STDIN is intended to be used for program name resolution, program behavior is undefined in this case, since POE was not informed of this by setting STDINMODE to "none" (see Problem 2 above). If STDIN is redirected from a file, POE will rewind STDIN between each job step. If MP_HOLD_STDIN is set to "yes", this feature will behave accordingly.
POE will use the keyboard STDIN for program name resolution, if required. Any "remaining" STDIN is distributed to the remote tasks. No rewind of STDIN is performed since STDIN is not from a file. If MP_HOLD_STDIN is set to "yes", it is ignored because STDIN is not redirected.
If the user's executable is compiled with the threaded MPI library, message passing initialization occurs when MPI_Init is called, not before POE gives the user program control. If MPI_Init is called before any STDIN data is read, the discussions of the previous section apply. If, however, all STDIN is read before MPI_Init is called, then MP_HOLD_STDIN should be set to "no", to allow the STDIN data to be sent to the user's executable by POE.
STDOUT is where the data coming from the command will eventually go. Usually, STDOUT refers to the display. If you use redirection or piping, however, STDOUT could refer to a file or another command. How you manage STDOUT for a parallel application depends on whether you want output data from one task or all tasks. If all tasks are writing to STDOUT, you can also specify whether or not output is ordered by task id. Using the environment variable MP_STDOUTMODE, you can specify that:
Setting MP_STDOUTMODE to unordered
specifies that all tasks should write output data to STDOUT
asynchronously. To specify unordered output mode, you could:
Set the MP_STDOUTMODE environment variable: | Use the -stdoutmode flag when invoking the program: |
---|---|
|
|
Notes:
Setting MP_STDOUTMODE to ordered specifies ordered output mode. In this mode, each task writes output data to its own buffer. Later, all the task buffers are flushed, in order of task id, to STDOUT. The buffers are flushed when:
Note: | When running the parallel application under pdbx with MP_STDOUTMODE set to ordered, there will be a difference in the ordering from when the application is run directly under poe. The buffer size available for the application's STDOUT is smaller because pdbx uses some of the buffer, so the task buffers fill up more often. |
To specify ordered output mode, you could:
Set the MP_STDOUTMODE environment variable: | Use the -stdoutmode flag when invoking the program: |
---|---|
|
|
You can specify that only one task should write its output
data to STDOUT. To do this, you set MP_STDOUTMODE to the
appropriate task id. For example, say you have an SPMD application in
which all the parallel tasks are sending the exact same output
messages. For easier readability, you would prefer output from only one
task - task 0. To specify this, you could:
Set the MP_STDOUTMODE environment variable: | Use the -stdoutmode flag when invoking the program: |
---|---|
|
|
Note: | You can also specify single output mode from your program by calling the MP_STDOUTMODE or mpc_stdoutmode Parallel Utility Function. Refer to IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference for more information. |
You can set the environment variable MP_LABELIO, or use the -labelio flag when invoking a program, so that output from the parallel tasks of your program are labeled by task id. While not necessary when output is being generated in single mode, this ability can be useful in ordered and unordered modes. For example, say the output mode is unordered. You are executing a program and receiving asynchronous output messages from all the tasks. This output is not labeled, so you do not know which task has sent which message. It would be clearer if the unordered output was labeled. For example:
7: Hello World 0: Hello World 3: Hello World 23: Hello World 14: Hello World 9: Hello World
To have the messages labeled with the appropriate task id, you
could:
Set the MP_LABELIO environment variable: | Use the -labelio flag when invoking the program: |
---|---|
|
|
To no longer have message output labeled, set the MP_LABELIO environment variable to no.
You can set the environment variable MP_INFOLEVEL to specify the
level of messages you want from POE. You can set the value of
MP_INFOLEVEL to one of the integers shown in the following
table. The integers 0, 1, and 2 give
you different levels of informational, warning, and error messages. The
integers 3 through 6 indicate debug levels that provide
additional debugging and diagnostic information. Should you require
help from the IBM Support Center in resolving a PE-related problem, you will
probably be asked to run with one of the debug levels. As with most POE
environment variables, you can override MP_INFOLEVEL when you
invoke a program. This is done using either the -infolevel
or -ilevel flag followed by the appropriate integer.
This integer: | Indicates this level of message reporting: | In other words: |
---|---|---|
0 | Error | Only error messages from POE are written to STDERR. |
1 | Normal | Warning and error messages from POE are written to STDERR. This level of message reporting is the default. |
2 | Verbose | Informational, warning, and error messages from POE are written to STDERR. |
3 | Debug Level 1 | Informational, warning, and error messages from POE are written to STDERR. Also written is some high-level debugging and diagnostic information. |
4 | Debug Level 2 | Informational, warning, and error messages from POE are written to STDERR. Also written is some high- and low-level debugging and diagnostic information. |
5 | Debug Level 3 | Debug level 2 messages plus some additional loop detail. |
6 | Debug Level 4 | Debug level 3 messages plus other informational error messages for the greatest amount of diagnostic information. |
Let us say you want the POE message level set to verbose. The
following table shows the two ways to do this. You could:
Set the MP_INFOLEVEL environment variable: | Use the -infolevel flag when invoking the program: |
---|---|
|
|
As with most POE command-line flags, the -infolevel or -ilevel flag temporarily override their associated environment variable.
Using the MP_PMDLOG environment variable, you can also specify that diagnostic messages should be logged to a file in /tmp on each of the remote nodes of your partition. The log file is named mplog.pid.n, where pid is the AIX process id of the Partition Manager Daemon, and n is the task number. Should you require help from the IBM Support Center in resolving a PE-related problem, you will probably be asked to generate these diagnostic logs.
The ability to generate diagnostic logs on each node is particularly useful
for isolating the cause of abnormal termination, especially when the
connection between the remote node and the home node Partition Manager has
been broken. As with most POE environment variables, you can
temporarily override the value of MP_PMDLOG using its associated
command-line flag -pmdlog. For example, to generate a
pmd log file, you could:
Set the MP_PMDLOG environment variable: | Use the -pmdlog flag when invoking the program: |
---|---|
|
|
Note: | By default, MP_PMDLOG is set to no. No diagnostic logs are generated. It is not suggested that you run MP_PMDLOG routinely. Running this will greatly impact performance and fill up your file system space. |
You can set the environment variables MP_CHECKFILE and MP_CHECKDIR to checkpoint or restart a program, which was previously compiled with the mpcc_chkpt, mpCC_chkpt, or mpxlf_chkpt commands. Only POE/MPI applications submitted under LoadLeveler in batch mode are able to be checkpointed. Checkpointing of interactive POE applications is not allowed.
The program's execution will be suspended when the mp_chkpt() function is reached. See IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference for the description of the mp_chkpt function. At that point, the state of the application is captured, along with all data, and saved to the file pointed to by the MP_CHECKFILE and MP_CHECKDIR variables.
MP_CHECKFILE defines the base name of the checkpoint file. MP_CHECKDIR defines the directory where the checkpoint file will reside. If the MP_CHECKFILE variable is not specified, the program cannot be checkpointed. The file name specified by MP_CHECKFILE may include the full path, in which case the MP_CHECKDIR variable will be ignored. If MP_CHECKDIR is not defined and MP_CHECKFILE does not specify a full path name, then MP_CHECKFILE is used as a relative path name from the current working directory.
Only programs compiled with the checkpoint compile scripts (mpcc_chkpt, mpCC_chkpt, or mpxlf_chkpt) that call the mp_chkpt function can be checkpointed.
When the checkpoint file is created during the checkpointing phase, the task id and a version id are appended to the base file name to differentiate between checkpoint files from different instances of the program.
There are certain limitations associated with checkpointing an application. Please refer to IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference for specific details.
A program can be restarted by executing POE, using MP_CHECKFILE and MP_CHECKDIR to point to the checkpoint file from the previously checkpointed program. The checkpoint file must be valid and accessible to all tasks specific when invoking POE. The application can be restarted on the same or a different set of nodes. However, the number of tasks must remain the same.
During the restart processing, the version and content of the checkpoint file are verified internally by POE to ensure consistency and accuracy. Any discrepancies, such as a mismatch in versions of the program files, will be reported. That is, the versions of the specified checkpoint files are not the same across all tasks.
The checkpoint file will be read, and the program will be restored to an executing state, after retrieving the program state and data information from the file. When execution is completely restored, the checkpoint files are deleted.
If you are using the MP_BUFFER_MEM environment variable to change the maximum size of memory used by the communication subsystem while checkpointing a program, please be aware that the amount of space needed for the checkpointing files will be increased by the value of MP_BUFFER_MEM.
The ability to checkpoint or restart programs is controlled by the definition and availability of the checkpoint files, as specified by the MP_CHECKFILE environment variable.
The specified file may be defined on the local file system (JFS) of the node on which the instance of the program is running, or it may be defined in some shared file system (such as NFS, AFS, DFS, GPFS, etc.). When the file is in a local file system, then in order to perform process migration, the checkpoint file will have to be moved to the new system on which the process is to be restarted. If the old system crashed and is unavailable, it may not be possible to restart the program. It may be necessary, therefore, to use some kind of file management to avoid such a problem. If migration is not desired, it is sufficient to place checkpoint files in the local JFS file system.
The program checkpoint files can be large, and numerous. There is the potential need for significant amounts of available disk space to maintain the files. It is recommended that you do not use NFS, AFS, or DFS for managing checkpoint files. The nature of these systems is such that it takes a very long time to write and read large files. The use of GPFS or JFS is recommended.
If a local JFS file system is used, the checkpoint file must be written to each remote task's local file system during checkpointing. Consequently, during a restart, each remote task's local file system must be able to access the checkpoint file from the previously checkpointed program. This is of special concern when opting to restart a program on a different set of nodes from which it was checkpointed. The local checkpoint file may need to be relocated to any new nodes. For these reasons, it is suggested that GPFS be the file system best suited for checkpoint and restart file management.
This section gives you instructions on how to run POE within a Distributed File System (DFS). Included is a description of the poeauth command, which allows you to copy DFS credentials to all nodes on which you want to run POE jobs.
Note: | When running POE under LoadLeveler, LoadLeveler handles all user authorization instead of POE. |
In order to run POE jobs from DFS, you will need to copy the DFS/DCE credentials files to each node you wish to run on, using the poeauth command. You should be set up with a DFS account, and after you login, you access your DCE user credentials by doing a dce_login.
DCE credentials are defined on a per user basis, therefore each user must use poeauth to copy the credentials prior to running a POE job on a DFS/DCE system.
To be able to run the poeauth command, you should make some initial file and directory changes. In your pool or host list file you define all nodes on which you want to run POE jobs. You change directories to a local non-DFS file system, for example, /tmp. The poeauth command sets up the DFS credentials for POE, therefore you cannot be in a DFS directory (as the current directory) to run it.
The execution of the poeauth command is dependent upon the type of user authorization specified by the MP_AUTH environment variable - either AIX or DFS/DCE authorization.
When AIX user authorization is selected (either by setting MP_AUTH=AIX or allowing it as the default), and your home directory resides in DFS, your user name must be properly authorized to access those nodes in the /etc/hosts.equiv file on each node. You should remove all entries from the .rhosts files on each node, and allow the /etc/hosts.equiv file to authorize the users on each node. Otherwise, POE will not be able to authorize users properly. Once DFS credentials are established, you can use a .rhosts file.
The dce_login sets up a new shell. As a result, you should set up any environment variables needed to run poeauth or other POE applications (such as MP_AUTH) after doing the dce_login.
You should run the poeauth command from task 0 that had a dce_login. You can use any POE command line flag or environment variable with poeauth, because it is a POE application. Each user must run poeauth before running any POE applications. When the credentials are copied, there is no need to use poeauth until the credentials expire (at which time you will need to copy them using poeauth again). After you run the poeauth command successfully, you can run POE from DFS. For more information on the poeauth command, see Appendix A. "Parallel Environment Commands".
Note: | Credentials files need to exist on the home node (task 0), that is, from where dce_login was performed. The poeauth command needs to be run from task 0. |
When POE returns error messages related to an inability to change to a DFS directory or a problem copying a file to a DFS directory, it most likely means there is a problem with the DFS credentials on that task or node. Check to see if the credentials were properly copied with poeauth, or if they have expired (use the klist command).
Since poeauth is a POE application, if you try to run it when the credentials have expired, POE will encounter an error accessing the expired credentials.
If the credentials have expired, you must do another dce_login and run the poeauth command again.
POE maintains a master control file in /tmp to keep track of the credentials. If /tmp is periodically cleaned out or the file is accidentally erased before your credentials expire, POE will not be able to access your DCE credentials and you may get errors related to the inability to access credentials. If this occurs, you will need to run the poeauth command again to redefine your credentials to POE.