At run time, you can encounter a number of different types of errors that are specifically related to the use of the Parallel ESSL subroutines:
This section explains what causes these errors, what happens when they occur (all are terminating, except computational errors), and what you can do to fix them.
This section also explains what to do when you receive informational and attention messages (600-699).
The program exceptions you can encounter in Parallel ESSL are described in the RS/6000 architecture manuals. For details, see:
This section describes how Parallel ESSL implements input-argument error checking when error synchronization is enabled. For more information on the PESSL_ERROR_SYNC environment variable, which allows you to enable or disable error synchronization, see "PESSL_ERROR_SYNC Environment Variable".
Two types of input-argument error checking may be performed:
When all the input-arguments in one stage are valid, Parallel ESSL checks the validity of the input-arguments in the next stage, and so on. (The number of errors and stages that can occur for each subroutine are listed under its "Error Conditions" section, which is in Part 2 and 3 of this book.)
When a message passing input argument is not valid on all participating processes in the parallel environment, a single comprehensive error message is issued, rather than one for each process. (This is indicated in the error message by Process(-1,-1).) Otherwise, an error message is issued from each process where the discrepancy occurred. When an HPF input-argument is not valid, an error message is issued from one or more processes.
Parallel ESSL then terminates your program on all processes, and any arguments in the stages that follow are not checked. When this occurs, you should use standard programming techniques to diagnose and fix the errors.
If the value of the global scalar argument on all processes except P00 does not match the value of the argument at process P00, a single error message is issued. (This is indicated in the error message by Process(-1,-1).) Otherwise, an error message is issued from each process where the discrepancy occurred. Parallel ESSL then terminates your program on all processes, and you should use standard programming techniques to diagnose and fix the errors.
For all other Parallel ESSL subroutines, the global scalar arguments are not checked to ensure they are the same for all processes.
How This Differs from ESSL for AIX:
The capabilities of ERRSET, ERRSAV, and ERRSTR, supported in ESSL for AIX, are not provided in Parallel ESSL.
Using the capabilities of ERRSET, ERRSAV, and ERRSTR with your ESSL for AIX subroutines does not affect the Parallel ESSL subroutines.
For the Fourier transform subroutines, an invalid transform length is not recoverable, as in ESSL for AIX. Parallel ESSL checks the validity of the transform length you provide to the Fourier transform subroutine. If it is not an acceptable value, a Parallel ESSL input-argument error message is issued, containing the next larger acceptable transform length required for successful computing of a Fourier transform. See the appropriate subroutine for additional constraints on valid transform lengths. Your program is then terminated on all processes. You should correct the value and rerun your program.
Parallel ESSL computational errors are errors that occur in the computational data, such as in your vectors and matrices, during a computation--for example, the detection of a singular system during a factorization. (The computational errors that can occur for each subroutine, are listed under "Computational Errors".) When a computational error occurs, Parallel ESSL issues an error message containing information key to the diagnosis of the error--such as the location in the input matrix where the singularity occurred. Any subroutine that issues a computational error has an info argument in its calling sequence. For all the Parallel ESSL subroutines, info is a global argument containing fullword integers, except in the tridiagonal subroutines. For these tridiagonal subroutines, info is a local argument containing fullword integers.
For message passing programs, when a computational error occurs, your program continues to execute. After each call where a computational error can occur, you should check the info output argument to see if an error occurred and take the appropriate action. When a computational error occurs, you should assume that the results are unpredictable. The result of the computation is valid only if no errors have occurred.
For HPF programs, when a computational error occurs and if the info argument is present, your program continues to execute. After each call where a computational error can occur, you should check the info output argument to see if an error occurred and take the appropriate action. When a computational error occurs, you should assume that the results are unpredictable. The result of the computation is valid only if no errors have occurred. If the info argument is not present and a computational error occurs, Parallel ESSL issues an additional error message containing the value of info and the application program is terminated.
How This Differs from ESSL for AIX:
The way you handle computational errors for Parallel ESSL differs from how you handle them for ESSL for AIX. This is because the capabilities of ERRSET, ERRSAV, and ERRSTR, supported in ESSL for AIX for recoverable computational errors, are not supported in Parallel ESSL. This results in the following differences:
For HPF programs, if you choose not to specify the optional info argument and a computational error occurs, Parallel ESSL issues a computational error message and the application program is terminated.
Using the capabilities of ERRSET, ERRSAV, and ERRSTR with your ESSL for AIX subroutines does not affect the Parallel ESSL subroutines.
A resource error occurs when a buffer storage allocation request fails in a Parallel ESSL subroutine. In general, the Parallel ESSL subroutines allocate internal auxiliary storage dynamically as needed. Without sufficient storage, the subroutine cannot complete the computation.
When a buffer storage allocation request fails, a resource error message is issued, and the application program is terminated. You need to reduce the memory constraint on the system or increase the amount of memory available before rerunning the application program.
Ways to Reduce Memory Constraints:
The following ways may reduce memory constraints:
Communication errors are errors that occur when Parallel ESSL encounters problems in communicating between processes--sending and receiving data or synchronizing operations. When a communication error occurs, at least one communication message is issued and the application program is terminated. This is because communication errors usually indicate a serious problem, where it is not feasible to continue.
Be aware that, due to the nature of communication errors, some error messages, including communications error messages from various processes, may be lost.
When you receive an informational or attention message, check your application to determine why the condition was detected. You may decide to change your application so you do not receive the message. For example, if your application called a BLACS routine to send data from one process to the same process, you would receive an attention message.
Parallel ESSL does not terminate your application program, but performance may be degraded.
A miscellaneous error is an error that does not fall under any of the other categories. Miscellaneous errors are checked in stages along with input-argument errors.
If no errors are detected in the first stage, Parallel ESSL checks the next stage, and so on. (The number of errors and stages that can occur for each subroutine are listed under its "Error Conditions" section.)
When Parallel ESSL detects a miscellaneous error, you receive an error message with information on how to correct the problem, your application program is terminated, and any arguments in the stages that follow are not checked.
For problems relating directly to ESSL for AIX, see the ESSL Version 3 Guide and Reference manual. If the ESSL for AIX error resulted from a Parallel ESSL subroutine, see "Getting Help from IBM Support" to find out how to report the problem.
If you receive an MPI error message while calling a BLACS routine, the cause is most likely one of the following: