Purpose
The SCHEDULE directive allows the user to specify the chunking method for parallelization. Work is assigned to threads in different manners depending on the scheduling type or chunk size used.
The SCHEDULE directive only takes effect if you specify the -qsmp compiler option.
Format
>>-SCHEDULE--(--sched_type--+------+--)------------------------>< '-,--n-' |
CEILING(number_of_iterations / number_of_threads)
iterations. Each partition is initially assigned to a thread, and is then further subdivided into chunks containing n iterations, if n has been specified. If n has not been specified, then the chunks consist of
CEILING(number_of_iterations_remaining_in_partition / 2)
loop iterations.
When a thread becomes free, it takes the next chunk from its initially assigned partition. If there are no more chunks in that partition, then the thread takes the next available chunk from a partition that is initially assigned to another thread.
Threads that are active will complete the work in a partition that is initially assigned to a sleeping thread.
Threads are assigned these chunks on a "first-come, first-do" basis. Chunks of the remaining work are assigned to available threads, until all work has been assigned.
If a thread is asleep, its assigned work will be taken over by an active thread, once that other thread becomes available.
The size of the initial chunk is
CEILING(number_of_iterations / number_of_threads)iterations. Subsequent chunks consist of
CEILING(number_of_iterations_remaining / number_of_threads)iterations. As each thread finishes a chunk, it dynamically obtains the next available chunk.
You can use guided scheduling in a situation in which multiple threads in a team might arrive at a DO work-sharing construct at varying times, and each iteration requires roughly the same amount of work. For example, if you have a DO loop preceded by one or more work-sharing SECTIONS or DO constructs with NOWAIT clauses, you can guarantee that no thread waits at the barrier longer than it takes another thread to execute its final iteration, or final k iterations if a chunk size of k is specified. The GUIDED schedule requires the fewest synchronizations of all the scheduling methods.
An n expression is evaluated outside of the context of the DO construct. Any function reference in the n expression must not have side effects.
The value of the n parameter on the SCHEDULE clause must be the same for all of the threads in the team.
At run time, the scheduling type can be specified using the environment variable XLSMPOPTS. If no scheduling type is specified using that variable, then the default scheduling type used is STATIC.
If n has not been specified, the chunks will contain
CEILING(number_of_iterations / number_of_threads)
iterations. Each thread is assigned one of these chunks. This is known as block cyclic scheduling.
If a thread is asleep and it has been assigned work, it will be awakened so that it may complete its work.
STATIC is the default scheduling type if the user has not specified any scheduling type at compile-time or run time.
Rules
The SCHEDULE directive must appear in the specification part of a scoping unit.
Only one SCHEDULE directive may appear in the specification part of a scoping unit.
The SCHEDULE directive applies to one of the following:
Any dummy arguments appearing or referenced in the specification expression for the chunk size n must also appear in the SUBROUTINE or FUNCTION statement and in all ENTRY statements appearing in the given subprogram.
If the specified chunk size n is greater than the number of iterations, the loop will not be parallelized and will execute on a single thread.
If you specify more than one method of determining the chunking algorithm, the compiler will follow, in order of precedence:
Examples
Example 1. Given the following information:
number of iterations = 1000 number of threads = 4
and using the GUIDED scheduling type, the chunk sizes would be as follows:
250 188 141 106 79 59 45 33 25 19 14 11 8 6 4 3 3 2 1 1 1 1
The iterations would then be divided into the following chunks:
chunk 1 = iterations 1 to 250 chunk 2 = iterations 251 to 438 chunk 3 = iterations 439 to 579 chunk 4 = iterations 580 to 685 chunk 5 = iterations 686 to 764 chunk 6 = iterations 765 to 823 chunk 7 = iterations 824 to 868 chunk 8 = iterations 869 to 901 chunk 9 = iterations 902 to 926 chunk 10 = iterations 927 to 945 chunk 11 = iterations 946 to 959 chunk 12 = iterations 960 to 970 chunk 13 = iterations 971 to 978 chunk 14 = iterations 979 to 984 chunk 15 = iterations 985 to 988 chunk 16 = iterations 989 to 991 chunk 17 = iterations 992 to 994 chunk 18 = iterations 995 to 996 chunk 19 = iterations 997 to 997 chunk 20 = iterations 998 to 998 chunk 21 = iterations 999 to 999 chunk 22 = iterations 1000 to 1000
A possible scenario for the division of work could be:
thread 1 executes chunks 1 5 10 13 18 20 thread 2 executes chunks 2 7 9 14 16 22 thread 3 executes chunks 3 6 12 15 19 thread 4 executes chunks 4 8 11 17 21
Example 2. Given the following information:
number of iterations = 100 number of threads = 4
and using the AFFINITY scheduling type, the iterations would be divided into the following partitions:
partition 1 = iterations 1 to 25 partition 2 = iterations 26 to 50 partition 3 = iterations 51 to 75 partition 4 = iterations 76 to 100
The partitions would be divided into the following chunks:
chunk 1a = iterations 1 to 13 chunk 1b = iterations 14 to 19 chunk 1c = iterations 20 to 22 chunk 1d = iterations 23 to 24 chunk 1e = iterations 25 to 25 chunk 2a = iterations 26 to 38 chunk 2b = iterations 39 to 44 chunk 2c = iterations 45 to 47 chunk 2d = iterations 48 to 49 chunk 2e = iterations 50 to 50 chunk 3a = iterations 51 to 63 chunk 3b = iterations 64 to 69 chunk 3c = iterations 70 to 72 chunk 3d = iterations 73 to 74 chunk 3e = iterations 75 to 75 chunk 4a = iterations 76 to 88 chunk 4b = iterations 89 to 94 chunk 4c = iterations 95 to 97 chunk 4d = iterations 98 to 99 chunk 4e = iterations 100 to 100
A possible scenario for the division of work could be:
thread 1 executes chunks 1a 1b 1c 1d 1e 4d thread 2 executes chunks 2a 2b 2c 2d thread 3 executes chunks 3a 3b 3c 3d 3e 2e thread 4 executes chunks 4a 4b 4c 4e
Note that in this scenario, thread 1 finished executing all the chunks in its partition and then grabbed an available chunk from the partition of thread 4. Similarly, thread 3 finished executing all the chunks in its partition and then grabbed an available chunk from the partition of thread 2.
Example 3. Given the following information:
number of iterations = 1000 number of threads = 4
and using the DYNAMIC scheduling type and chunk size of 100, the chunk sizes would be as follows:
100 100 100 100 100 100 100 100 100 100
The iterations would be divided into the following chunks:
chunk 1 = iterations 1 to 100 chunk 2 = iterations 101 to 200 chunk 3 = iterations 201 to 300 chunk 4 = iterations 301 to 400 chunk 5 = iterations 401 to 500 chunk 6 = iterations 501 to 600 chunk 7 = iterations 601 to 700 chunk 8 = iterations 701 to 800 chunk 9 = iterations 801 to 900 chunk 10 = iterations 901 to 1000
A possible scenario for the division of work could be:
thread 1 executes chunks 1 5 9 thread 2 executes chunks 2 8 thread 3 executes chunks 3 6 10 thread 4 executes chunks 4 7
Example 4. Given the following information:
number of iterations = 100 number of threads = 4
and using the STATIC scheduling type, the iterations would be divided into the following chunks:
chunk 1 = iterations 1 to 25 chunk 2 = iterations 26 to 50 chunk 3 = iterations 51 to 75 chunk 4 = iterations 76 to 100
A possible scenario for the division of work could be:
thread 1 executes chunks 1 thread 2 executes chunks 2 thread 3 executes chunks 3 thread 4 executes chunks 4
Related Information