IBM PE for AIX V2R4.0: Operation and Use, Vol. 1

Introduction

The IBM Parallel Environment for AIX program product (PE) is an environment designed for the development and execution of parallel Fortran, C, or C++ programs. PE consists of components and tools for developing, executing, debugging, profiling, and tuning parallel programs.

The PE is a distributed memory message passing system. It runs on the RS/6000 platform using the AIX operating system (Version 4.2.1). Specifically, you can use the PE to execute parallel programs on:

any configuration of the IBM RS/6000 SP as described in the IBM Parallel System Support Programs for AIX: Installation and Migration Guide Essentially, an SP system is a collection of RS/6000 processors grouped into a number of frames. Each frame of an SP system can contain from two to 16 RS/6000 processors.
a networked cluster of RS/6000 processors, including a single processor or a single workstation.
a mixed system. In a mixed system, additional RS/6000 processors supplement the processors of an SP system.

The RS/6000 processors of your system are called processor nodes. A parallel program executes as a number of individual, but related, parallel tasks on a number of your system's processor nodes. The group of parallel tasks is called a partition. The processor nodes are connected on the same LAN, so the parallel tasks of your partition can communicate to exchange data or synchronize execution. If you are using an SP system:

Your system may have an optional high performance switch for communication. The switch increases the speed of communication between nodes. It supports a high volume of message passing with increased bandwidth and low latency.
Your system administrator can divide its nodes into separate pools. An SP system pool is a subset of processor nodes and is given an identifying pool number. A LoadLeveler system pool is a subset of processor nodes and is given an identifying pool name or number.

PE supports the two basic parallel programming models - SPMD and MPMD. In the SPMD (Single Program Multiple Data) model, the programs running the parallel tasks of your partition are identical. The tasks, however, work on different sets of data. In the MPMD (Multiple Program Multiple Data) model, each node may be running a different program. A typical example of this is the master/worker MPMD program. In a master/worker program, one task - the master - coordinates the execution of all the others - the workers.
Note: While the remainder of this introduction describes each of the PE components and tools in relation to a specific phase of an application's life cycle, this does not imply that they are limited to one phase. They are ordered this way for descriptive purposes only; you will find many of the tools useful across an application's entire life cycle.

The application developer begins by creating a parallel program's source code. The application developer might create this program from scratch or could modify an existing serial program. In either case, the developer places calls to Message Passing Interface (MPI) or Low-level Application Programming Interface (LAPI) routines so that it can run as a number of parallel tasks. This is known as parallelizing the application. The MPI is similar to the Message Passing Library (MPL) from an earlier version of Parallel Environment. MPI provides message passing capabilities for the current version of PE. There are two libraries for MPI:

Signal handling - which uses UNIX signals and signal handlers
Threaded - which uses and supports POSIX user threads.

All tasks of a program must use either signal handling or threaded calls but not a combination of each.

MPL programs are still supported for non-threaded applications.
Note: Throughout this book, when referring to anything non-specific for MPI and MPL, the term message passing will be used. For example:
message passing program message passing routine message passing call

The message passing calls enable the parallel tasks of your partition to communicate data and coordinate their execution. The message passing routines in turn call communication subsystem library routines which handle communication among the processor nodes. There are two separate implementations of the communication subsystem library - the Internet Protocol (IP) Communication Subsystem and the User Space (US) Communication Subsystem. While the message passing application interface remains the same, the communication subsystem libraries use different protocols for communication among processor nodes. The IP communication subsystem uses Internet Protocol, while the US communication subsystem is designed for the SP system's high performance switch feature. The communication subsystem library implementations are dynamically linked when you invoke the program. For more information on the message passing subroutine calls, refer to IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference IBM Parallel Environment for AIX: MPL Programming and Subroutine Reference, and IBM Parallel Environment for AIX: Hitchhiker's Guide

In addition to message passing communication, the Parallel Environment supports a separate communication protocol known as the Low-level Application Programming Interface (LAPI). LAPI differs from MPI in that it is based on an "active message style" mechanism that provides a one-sided communications model. That is, one process initiates an operation and the completion of that operation does not require any other process to take a complimentary action.

LAPI only runs with the US Communication Subsystem. For this reason, it is designed to run on the SP system's high performance communication adapter only. The RS/6000 workstation cluster does not support LAPI.

Although LAPI is used for data communication in conjunction with PE, it is actually part of the communication subsystem for IBM's Parallel System Support Programs (PSSP). For more information on LAPI, see IBM Parallel System Support Programs for AIX: Administration Guide, and IBM Parallel System Support Programs for AIX: Command and Technical Reference

After writing the parallel program, the application developer then begins a cycle of modification and testing. The application developer now compiles and runs his program from his home node using the Parallel Operating Environment (POE). The home node is any workstation on the LAN. POE is an execution environment designed to hide, or at least smooth, the differences between serial and parallel execution.

To assist with node allocation for job management, the role of IBM LoadLeveler has been expanded to work with POE for interactive jobs. LoadLeveler will now provide resource management function both on and off the SP system. You can run parallel programs on a cluster of processors running LoadLeveler, or on a mixed system of LoadLeveler processors that supplement an SP system. LoadLeveler not only provides SP node allocation for jobs using the US communication subsystem, but also provides management for non-SP nodes, or for SP nodes being used for jobs other than user space. LoadLeveler will still be used by POE for batch jobs as well. See the IBM LoadLeveler documentation for more information on this job management system.

In general, with POE, you invoke a parallel program from your home node and run its parallel tasks on a number of remote nodes. When you invoke a program on your home node, POE starts your Partition Manager which allocates the nodes of your partition and initializes the local environment. Depending on your hardware and configuration, the Partition Manager uses a host list file, LoadLeverler, or the SP system Resource Manager to allocate nodes. A host list file contains an explicit list of node requests, while LoadLeveler or the Resource Manager allocate nodes from one or more system pools implicitly based on their availability. On an SP system using the Resource Manager, you can also use a host list file to determine how an allocated node's resources - its SP switch adapter and CPU - are used. Your program task can either:

share or not share the node's SP switch adapter
share or not share the node's CPU.

With regard to the expanded LoadLeveler function, POE now provides an option to enable you to specify whether your program will use MPI, LAPI, or both. Using this option, POE ensures that each API initializes properly and informs LoadLeveler which APIs are used so each node is set up completely.

For Single Program Multiple Data (SPMD) applications the Partition Manager executes the same program on all nodes. For Multiple Program Multiple Data (MPMD) applications, the Partition Manager prompts you for the name of the program to load on each node. The Partition Manager also connects standard I/O to each remote node so the parallel tasks can communicate with the home node. Although you are running tasks on remote nodes, POE allows you to continue using the standard UNIX** and AIX execution techniques with which you are already familiar. For example, you can redirect input and output, pipe the output of programs, or use shell tools. The POE includes:

A number of parallel compiler scripts. These are shell scripts that call the C, C++, or Fortran compilers while also linking in an interface library to enable communication between your home node and the parallel tasks running on the remote nodes. You dynamically link in a communication subsystem implementation when you invoke the executable.
A number of POE Environment Variables you can use to set up your execution environment. These are AIX environment variables you can set to influence the operation of POE. These environment variables control such things as how processor nodes are allocated, what programming model you are using, and how standard I/O between the home node and the parallel tasks should be handled. Most of the POE environment variables also have associated command-line flags that enable you to temporarily override the environment variable value when invoking POE and your parallel program.
Two X-Windows** analysis tools:
- The Program Marker Array. This is a programmable array of small boxes, or lights, which are associated with parallel tasks. Under program control, these lights can change color to provide you with immediate visual feedback as your program executes. See Figure 1 for a complete description of this tool.
- The System Status Array. This tool lets you quickly survey the utilization of your processor nodes. It is useful when listing nodes in a host list file for explicit node allocation, and is discussed in "Using the System Status Array".

The following tools are discussed in IBM Parallel Environment for AIX: Operation and Use, Volume 2, Tools Reference and allow you to debug, visualize, and tune parallel programs.

There are two parallel debugging facilities. The first - pdbx - is a line-oriented debugger based on the dbx debugger. The other - pedb - is a Motif**-based debugger.

Once the parallel program is debugged, you now want to tune the program for optimal performance. To do this, you turn to the PE parallel profiling capability and Visualization Tool to analyze the program.

The parallel profiling capability enables you to use the PE Xprofiler graphical user interface, as well as the AIX commands prof and gprof on parallel programs. Xprofiler is a tool that helps you analyze your parallel application's performance quickly and easily. It uses procedure profiling information to construct a graphical display of the functions within your application. Xprofiler provides quick access to the profiled data, which lets you identify the functions that are the most CPU-intensive. The graphical user interface also lets you manipulate the display in order to focus on the application's critical areas.

The Visualization Tool (VT) contains a set of displays which allow you to visualize performance characteristics of your program and system. Each display presents specific, often complex, information in an easily-interpretable form such as a bar chart or a strip graph. You can use VT's displays for trace visualization and online performance monitoring.

In trace visualization, you play back statistical and event records - or trace records - generated during your program's execution. You can use VT to visualize information about the program as well as its use of the underlying system. This visualized information can help you tune the program to optimize its use of the underlying system.
In performance monitoring, you use VT as an online monitor to study the operational status and activity of each of the processor nodes in your SP system or RS/6000 network cluster. This mode of VT is similar to the System Status Array in that it only displays system statistics and not communication information. Like the System Status Array, it is useful when listing nodes in a host list file for explicit node allocation.

Note: Once the parallel program is tuned to your satisfaction, you might prefer to execute it using a job management system such as IBM LoadLeveler*. If you do use a job management system, consult its documentation for information on its use.

What's New in PE 2.4?

AIX 4.3 Support

With PE 2.4, POE supports user programs developed with AIX 4.3. It also supports programs developed with AIX 4.2, intended for execution on AIX 4.3.

Parallel Checkpoint/Restart

This release of PE provides a mechanism for temporarily saving the state of a parallel program at a specific point (checkpointing), and then later restarting it from the saved state. When a program is checkpointed, the checkpointing function captures the state of the application as well as all data, and saves it in a file. When the program is restarted, the restart function retrieves the application information from the file it saved, and the program then starts running again from the place at which it was saved.

Enhanced Job Management Function

In earlier releases of PE, POE relied on the SP Resource Manager for performing job management functions. These functions included keeping track of which nodes were available or allocated and loading the switch tables for programs performing User Space communications. LoadLeveler, which had only been used for batch job submissions in the past, is now replacing the Resource Manager as the job management system for PE. One notable effect of this change is that LoadLeveler now allows you to run up to four User Space tasks per node.

MPI I/O

With PE 2.4, the MPI library now includes support for a subset of MPI I/O, described by Chapter 9 of the MPI-2 document; MPI-2: Extensions to the Message-Passing Interface, Version 2.0. MPI-I/O provides a common programming interface, improving the portability of code that involves parallel I/O.

1024 Task Support

With regard to MPI/LAPI jobs, this release of PE supports a maximum of 2048 tasks for IP, and 1024 tasks for US, as opposed to the previous release, which supported a maximum of 512 tasks.

Enhanced Compiler Support

In this release, POE is adding support for the following compilers:

Fortan Version 5
C
C++
xlhpf

Message Queue Facility

The pedb debugger now includes a message queue facility. Part of the pedb debugger interface, the message queue viewing feature can help you debug Message Passing Interface (MPI) applications by showing internal message request queue information. With this feature, you can view:

A summary of the number of active messages for each task in the application. You can select criteria for the summary information based on message type and source, destination, and tag filters.
Message queue information for a specific task.
Detailed information about a specific message.

Xprofiler Enhancements

This release includes a variety of enhancements to Xprofiler, including:

Save Configuration and Load Configuration options for saving the names of functions, currently in the display, and reloading them later in order to reconstruct the function call tree.
An Undo option that lets you undo operations that involve adding or removing nodes or arcs from the function call tree.

PE 2.4 Migration Information

This section is intended for customers migrating from earlier releases of PE to PE 2.4. It contains specific information on some differences between earlier releases that you need to consider prior to installing or using PE 2.4. To find out which release of PE you currently have installed, use lslpp.

AIX Compatibility

PE 2.4 commands and applications are compatible with AIX Version 4.3.2 or later only, not with earlier versions of AIX.

Existing Applications

Applications from previous versions of Parallel Environment are binary compatible with PE 2.4, with the following exceptions:

User applications created using PE 2.4 are not binary compatible with Version 1.
In order to run under PE 2.4, you must recompile existing applications that were developed under PE Version 1.
In order to run under PE 2.4, you must recompile any statically bound applications that were created with PE Version 2 Release 1.

Existing Host List Files

Host list files from previous releases that contained multiple pool or usage specifications will be affected as follows when using LoadLeveler:

Usage specification in a host list file will be ignored.
You can request how nodes are used with the MP_CPU_USE and/or MP_ADAPTER_USE environment variables, or their associated command line flags
LoadLeveler does not allow dissimilar pool entries.

LAPI Applications

LAPI programs must set the MP_MSGAPI environment variable.

MPI and MPL Applications

The MPI function became available in Version 2.
The MPL message passing applications are source compatible between PE Version 1 Release 2 and PE 2.4, but must be recompiled.

Coexistence

All tasks within a partition or cluster must be running the same version of PE. You cannot mix versions of PE.

Therefore, for all processors within a workstation cluster, the same release level of the PE software is required.

When you use partitioning, you may have different levels of PE software installed on different partitions; however, within a partition, all the nodes must be at the same level of PE software.
Note: See IBM Parallel Environment for AIX: Installation for more information about software compatibility within a workstation cluster or partition, and for administrative and usage information about running different versions of POE in a partitioned environment.

Use of /usr/lib in LIBPATH

Users who previously set LIBPATH to include /usr/lib should no longer do so. Setting LIBPATH to include /usr/lib would cause the POE application not to include all of the POE libraries at execution time.

/usr/lib is included in the loader section search path of all POE applications at compile time, so there is no need to include it in LIBPATH.

Compiler -ip and -us options

The -ip and -us flags for PE Version 1 mpcc, mpCC, and mpxlf compiler scripts are no longer used or supported. All application programs are dynamically linked using these scripts.

Instructions are provided on how to create statically executable versions of your applications in IBM Parallel Environment for AIX: Operation and Use, Volume 1, Using the Parallel Operating Environment. User-written scripts that utilize these options need to be rewritten.

VT trace files are incompatible

VT trace files generated using Version 1 or Version 2, Release 2 will not be compatible with Version 2.4, and vice versa. Trace files must be regenerated.

However, refer to IBM Parallel Environment for AIX: Operation and Use, Volume 2, Tools Reference for information about the VT trace file format, if you want to write your own conversion program.

SP_Name Environment Variable

Previous versions of POE allowed jobs using the SP Resource Manager to be submitted from a non-SP node by setting the SP_NAME environment variable. For POE Version 2 Release 2 or later, you must also install the ssp.clients fileset. Refer to IBM Parallel Environment for AIX: Installation for more information.

POSIX Threads Support

PE 2.4 supports IEEE POSIX 1003.1-1996 of POSIX threads (sometimes known as Draft 10), that is xpg5 compliant as a default when compiling parallel applications. Existing applications from previous releases of PE were built with an earlier version of POSIX threads (Draft 7).

Existing threaded applications are supported in binary compatibility mode, without needing to recompile. However, these will run with the older objects from the previous version's threads library.

All new applications are compiled with the new draft of POSIX threads as the default. However, the POE threaded compiler scripts (mpcc_r, mpCC_r, mpxlf_r, mpxlf90_r) also provide an optional flag (-d7) to allow applications to be compiled with the older version of the threads library. See the appropriate compiler command description for further details.

32/64 Bit Applications

POE compiles and runs all applications as 32 bit applications. 64 bit applications are not yet supported.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]