Guide and Reference

Coding and Running Your Program

This chapter explains the Parallel ESSL-specific procedures to follow when coding and running your program.

Coding Tips for Optimizing Parallel Performance

Performance has been the primary objective in the design of the Parallel ESSL subroutines. To achieve this performance goal, the Parallel ESSL subroutines use "state-of-the-art" algorithms tailored to specific operational characteristics of the hardware. In addition, Parallel ESSL will leverage the high performance provided by ESSL for AIX for processor computations.

XL HPF allows you to easily develop parallel software using the SPMD programming model. The XL HPF compiler, guided by HPF directives in your source code, handles the distribution of data and communication between programs on multiple processes. The HPF directives make developing an HPF program that calls Parallel ESSL easier than developing a message passing program that calls Parallel ESSL. However, the performance obtained when using a Parallel ESSL HPF subroutine is less than that obtained when using a Parallel ESSL message passing subroutine because there is a certain amount of overhead involved in supporting the extrinsic hpf_local interface.

Because the XL HPF compiler only supports CYCLIC(N) in the interface blocks for extrinsic hpf_local subroutines, a redistribution of data occurs whenever a Level 3 PBLAS, Dense Linear Algebraic Equations, Eigensystem Analysis or Singular Value Analysis subroutine is called. Also, data may be copied locally because the extrinsic hpf_local subroutines require the use of assumed-shape arrays while the Parallel ESSL message passing subroutines use assumed-size arrays.

The following techniques are used by most subroutines to optimize performance:

Minimizing the impact of communications by exchanging larger blocks of data
Blocking data to match the processor cache size

The following items also impact performance. They generally depend on the specific parallel routine being called. See the subroutine description in the reference section for any exceptions to these rules.

Number and types of processors (such as, POWER Thin, POWER2 Thin, POWER Wide)
Choosing the number of processors depends primarily on the problem size. It is reasonable to increase the number of processors, if the global problem size increases sufficiently to keep the amount of local data per process at a reasonable size. If, however, using more processes, such as 17 rather than 16, causes you to use a one-dimensional grid rather than a two-dimensional grid, performance may be degraded. See the next item.
Shape of process grid
For most subroutines, using a two dimensional (square or as close to square as possible) grid is suggested. For example, if sixteen processors were used, define a 4 by 4 process grid. For exceptions to this rule, see the subroutine descriptions in the reference section.

Block size(s) in the Message Passing subroutines

See the following table for suggested block sizes in your message passing program. The optimal block size depends on the underlying node computations, load balancing, communications, system buffering requirements, problem size, and dimension and shape of the process grid. To achieve optimal performance, generally requires experimentation, but the values specified in Table 29 should provide good performance for most cases. For exceptions to these rules, see the subroutine descriptions in the reference section.

Table 29. Suggested Block Sizes

Area POWER Nodes POWER2 Nodes
Level 2 PBLAS 24
24 (All subroutines, except PDTRSV) 64 (PDTRSV)

Level 3 PBLAS 40 70
Dense Linear Algebraic Equations 40
70 (Real subroutines) 30 (Complex subroutines)

Eigensystems Analysis and Singular Value Analysis 24 24
Random Number Generation .5 (data cache size) .5 (data cache size)

Note: The data cache size can be obtained with this command: lsattr -E -H -l sys0

PESSL_HPF module for the HPF subroutines
For all HPF subroutines, except GEBRD, data directives are included in the interface module PESSL_HPF; therefore, you can specify any data distribution for your vectors, matrices, and sequences, because the XL HPF compiler will, if necessary, redistribute the data prior to calling the HPF Parallel ESSL subroutine. Data directives for GEBRD cannot be included in the PESSL_HPF module, because the alignment requirements for some of the vectors depend on the size of the matrix. For details, see GEBRD--Reduce a General Matrix to Bidiagonal Form.
When using cyclic distribution in your HPF program, you can only specify CYCLIC(1) data distributions. However, the performance of the Level 3 PBLAS, Dense Linear Algebraic Equations, and Eigensystems Analysis and Singular Value Analysis subroutines is improved if a CYCLIC(N) data distribution is used. To accomplish this, PESSL_HPF contains CYCLIC(N) data directives for a two-dimensional process grid in the interface for these subroutines. The block sizes specified in the PESSL_HPF module are listed in Table 30.
There are CYCLIC(1) directives for a two-dimensional process grid in the interfaces in PESSL_HPF for the Level 2 PBLAS.
There are BLOCK data directives for a one-dimensional process grid in the interfaces in PESSL_HPF for the Banded Linear Algebraic Equations, Fourier transform and Random Number Generation subroutines.
For more information about the PESSL_HPF module, see "Using Extrinsic Procedures--The Parallel ESSL Subroutines".
Table 30. Block Sizes Specified in the PESSL_HPF Module
Area Block Size
Level 2 PBLAS 1
Level 3 PBLAS 70
Dense Linear Algebraic Equations
70 (Real) 30 (Complex)
Eigensystems Analysis and Singular Value Analysis 24
If you are using the MPI threaded library and only a single message passing thread, specify MP_SINGLE_THREAD=yes to minimize thread overhead.
You should be able to improve performance of production-level code by using the PESSL_ERROR_SYNC environment variable to disable error synchronization. For details, see "PESSL_ERROR_SYNC Environment Variable".

Area	Block Size
Level 2 PBLAS	1
Level 3 PBLAS	70
Dense Linear Algebraic Equations	70 (Real) 30 (Complex)
Eigensystems Analysis and Singular Value Analysis	24

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]