XL Fortran for AIX 8.1

User's Guide

Optimizing Loops and Array Language

The -qhot option does the following transformations to improve the performance of loops, array language, and memory management:

Scalar replacement, loop blocking, distribution, fusion, interchange, reversal, skewing, and unrolling
Reducing generation of temporary arrays

It requires at least level 2 of -O. The -C option inhibits it.

If you have SMP hardware, you can enable automatic parallelization of loops by specifying the -qsmp option. This optimization includes explicitly coded DO loops as well as DO loops that are generated by the compiler for array language (WHERE, FORALL, array assignment, and so on). The compiler can only parallelize loops that are independent (each iteration can be computed independently of any other iteration). One case where the compiler will not automatically parallelize loops is where the loops contain I/O, because doing so could lead to unexpected results. In this case, by using the PARALLEL DO directive, you can advise the compiler that such a loop can be safely parallelized. However, the type of I/O must be one of the following:

Direct-access I/O where each iteration writes to or reads from a different record
Sequential I/O where each iteration writes to or reads from a different unit

For more details, refer to the description of the PARALLEL DO directive in the XL Fortran for AIX Language Reference.

You can use the -qhot and -qsmp options on:

Programs with performance bottlenecks that are caused by loops and structured memory accesses
Programs that contain significant amounts of array language (which can be optimized in the same ways as FORTRAN 77 loops for array operations)

Related Information:

See the following sections:

Unrolling Loops

Loop unrolling involves expanding the loop body to do the work of two, three, or more iterations, and reducing the iteration count proportionately. Benefits to loop unrolling on programs compiled for the POWER, POWER2, and PowerPC architecture include the following:

Data dependence delays may be reduced or eliminated
Loads and stores may be eliminated in successive loop iterations
Loop overhead may be reduced

Loop unrolling also increases code sizes in the new loop body, which can increase register allocation and possibly cause register spilling. For this reason, unrolling sometimes does not improve performance.

Related Information:: See -qunroll Option.

Efficiency of Different Array Forms

In general, operations on arrays with constant or adjustable bounds, assumed-size arrays, and pointee arrays require less processing than those on automatic, assumed-shape, or deferred-shape arrays and are thus likely to be faster.

Reducing Use of Temporary Arrays

If your program uses array language but never performs array assignments where the array on the left-hand side of the expression overlaps the array on the right-hand side, specifying the option -qalias=noaryovrlp can improve performance by reducing the use of temporary array objects.

The -qhot option also eliminates many temporary arrays.

Cost Model for Loop Transformations

The loop transformations performed by the -qhot option are controlled by a set of assumptions about the characteristics of typical loops and the costs (in terms of registers used and potential delays introduced) of performing particular transformations.

The cost model takes into consideration:

The number of available registers and functional units that the processor has
The configuration of cache memory in the system
The number of iterations of each loop
The need to make conservative assumptions to ensure correct results

When the compiler can determine information precisely, such as the number of iterations of a loop, it uses this information to improve the accuracy of the cost model at that location in the program. If it cannot determine the information, the compiler relies on the default assumptions of the cost model. You can change these default assumptions, and thus influence how the compiler optimizes loops, by specifying compiler options:

-qassert=nodeps asserts that none of the loops in the files being compiled have dependencies that extend from one iteration to any other iteration within the same loop. This is known as a loop-carried dependency. If you can assert that the computations performed during iteration n do not require results that are computed during any other iteration, the compiler is better able to rearrange the loops for efficiency.
-qassert=itercnt=n asserts that a "typical" loop in the files that you are compiling will iterate approximately n times. If this is not specified, the assumption is that loops iterate approximately 1024 times. The compiler uses this information to assist in transformations such as putting a high-iteration loop inside a low-iteration one.
It is not crucial to get the value exactly right, and the value does not have to be accurate for every loop in the file. This value is not used if either of the following conditions is true:
- The compiler can determine the exact iteration count.
- You specified the ASSERT(ITERCNT(n)) directive.
Some of the loop transformations only speed up loops that iterate many times. For programs with many such loops or for programs whose hotspots and bottlenecks are high-iteration loops, specify a large value for n.

A program might contain a variety of loops, some of which are speeded up by these options and others unaffected or even slowed down. Therefore, you might want to determine which loops benefit most from which options, split some loops into different files, and compile the files with the set of options that suits them best.

Describing the Hardware Configuration

The -qtune setting determines the default assumptions about the number of registers and functional units in the processor. For example, when tuning loops, -qtune=pwr2 may cause the compiler to unroll most of the inner loops to a depth of two to take advantage of the extra arithmetic units.

The -qcache setting determines the blocking factor that the compiler uses when it blocks loops. The more cache memory that is available, the larger the blocking factor.

Array Padding

Because of the implementation of the POWER, POWER2, POWER3,, POWER4,, and PowerPC cache architecture, array dimensions that are powers of 2 can lead to decreased cache utilization.

The optional arraypad suboption of the -qhot option permits the compiler to increase the dimensions of arrays where doing so might improve the efficiency of array-processing loops. If you have large arrays with some dimensions (particularly the first one) that are powers of 2 or if you find that your array-processing programs are slowed down by cache misses or page faults, consider specifying -qhot=arraypad or -qhot=arraypad=n rather than just -qhot.

The padding that -qhot=arraypad performs is conservative. It also assumes that there are no cases in the source code (such as those created by an EQUIVALENCE statement) where storage elements have a relationship that is broken by padding. You can also manually pad array dimensions if you determine that doing so does not affect the program's results.

The additional storage taken up by the padding, especially for arrays with many dimensions, might increase the storage overhead of the program to the point where it slows down again or even runs out of storage. For more information, see -qhot Option.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]