This chapter explains the Parallel ESSL-specific procedures to follow when coding and running your program.
Performance has been the primary objective in the design of the Parallel ESSL subroutines. To achieve this performance goal, the Parallel ESSL subroutines use "state-of-the-art" algorithms tailored to specific operational characteristics of the hardware. In addition, Parallel ESSL will leverage the high performance provided by ESSL for AIX for processor computations.
XL HPF allows you to easily develop parallel software using the SPMD programming model. The XL HPF compiler, guided by HPF directives in your source code, handles the distribution of data and communication between programs on multiple processes. The HPF directives make developing an HPF program that calls Parallel ESSL easier than developing a message passing program that calls Parallel ESSL. However, the performance obtained when using a Parallel ESSL HPF subroutine is less than that obtained when using a Parallel ESSL message passing subroutine because there is a certain amount of overhead involved in supporting the extrinsic hpf_local interface.
Because the XL HPF compiler only supports CYCLIC(N) in the interface blocks for extrinsic hpf_local subroutines, a redistribution of data occurs whenever a Level 3 PBLAS, Dense Linear Algebraic Equations, Eigensystem Analysis or Singular Value Analysis subroutine is called. Also, data may be copied locally because the extrinsic hpf_local subroutines require the use of assumed-shape arrays while the Parallel ESSL message passing subroutines use assumed-size arrays.
The following techniques are used by most subroutines to optimize performance:
The following items also impact performance. They generally depend on the specific parallel routine being called. See the subroutine description in the reference section for any exceptions to these rules.
Choosing the number of processors depends primarily on the problem size. It is reasonable to increase the number of processors, if the global problem size increases sufficiently to keep the amount of local data per process at a reasonable size. If, however, using more processes, such as 17 rather than 16, causes you to use a one-dimensional grid rather than a two-dimensional grid, performance may be degraded. See the next item.
For most subroutines, using a two dimensional (square or as close to square as possible) grid is suggested. For example, if sixteen processors were used, define a 4 by 4 process grid. For exceptions to this rule, see the subroutine descriptions in the reference section.
See the following table for suggested block sizes in your message passing
program. The optimal block size depends on the underlying node computations,
load balancing, communications, system buffering requirements, problem size,
and dimension and shape of the process grid. To achieve optimal performance,
generally requires experimentation, but the values specified in Table 29 should provide good performance for most cases. For exceptions to these
rules, see the subroutine descriptions in the reference section.
Table 29. Suggested Block Sizes
Area | POWER Nodes | POWER2 Nodes | ||
---|---|---|---|---|
Level 2 PBLAS | 24 |
24 (All subroutines, except PDTRSV) 64 (PDTRSV) | ||
Level 3 PBLAS | 40 | 70 | ||
Dense Linear Algebraic Equations | 40 |
70 (Real subroutines) 30 (Complex subroutines) | ||
Eigensystems Analysis and Singular Value Analysis | 24 | 24 | ||
Random Number Generation | .5 (data cache size) | .5 (data cache size) | ||
|
For all HPF subroutines, except GEBRD, data directives are included in the interface module PESSL_HPF; therefore, you can specify any data distribution for your vectors, matrices, and sequences, because the XL HPF compiler will, if necessary, redistribute the data prior to calling the HPF Parallel ESSL subroutine. Data directives for GEBRD cannot be included in the PESSL_HPF module, because the alignment requirements for some of the vectors depend on the size of the matrix. For details, see GEBRD--Reduce a General Matrix to Bidiagonal Form.
When using cyclic distribution in your HPF program, you can only specify CYCLIC(1) data distributions. However, the performance of the Level 3 PBLAS, Dense Linear Algebraic Equations, and Eigensystems Analysis and Singular Value Analysis subroutines is improved if a CYCLIC(N) data distribution is used. To accomplish this, PESSL_HPF contains CYCLIC(N) data directives for a two-dimensional process grid in the interface for these subroutines. The block sizes specified in the PESSL_HPF module are listed in Table 30.
There are CYCLIC(1) directives for a two-dimensional process grid in the interfaces in PESSL_HPF for the Level 2 PBLAS.
There are BLOCK data directives for a one-dimensional process grid in the interfaces in PESSL_HPF for the Banded Linear Algebraic Equations, Fourier transform and Random Number Generation subroutines.
For more information about the PESSL_HPF module, see "Using Extrinsic Procedures--The Parallel ESSL Subroutines".
Table 30. Block Sizes Specified in the PESSL_HPF Module
Area | Block Size |
---|---|
Level 2 PBLAS | 1 |
Level 3 PBLAS | 70 |
Dense Linear Algebraic Equations |
70 (Real) 30 (Complex) |
Eigensystems Analysis and Singular Value Analysis | 24 |