Managing Parallel Execution

Managing Parallel Execution

The run-time library for each of the languages uses IRIX lightweight processes to implement parallel execution (see "Process-Level Parallelism").

When a parallel program starts, the run-time support creates a pool of lightweight processes using the sproc() function. Initially the extra processes are blocked, and one process executes the opening passage of the program. When execution reaches a parallel section, the run-time support unblocks as many processes as necessary. Each one begins to execute the same block of statements. The processes share global variables, while each has its own copy of variables that are local to one iteration of a loop, such as a loop index.

When a process completes its portion of the work of that section, it returns to the run-time library code, where it picks up another portion of work if any work remains, or simply blocks until the next time it is needed. At the end of the parallel section, all extra processes are blocked and the original process continues to execute the serial code following the parallel section.

Controlling the Degree of Parallelism

You can specify the number of lightweight processes that are started by a program. In IRIS POWER C, you can use #pragma numthreads to specify the exact number of processes to start, but it is not a good idea to embed this number in a source program. In all implementations, the run-time library by default starts enough processes that there is one for each CPU in the system. That default is often too high, since usually at least one of the CPUs is dedicated to other work (often more than one).

The run-time library checks an environment variable, MPC_SET_NUM_THREADS, for the number of processes to start. You can use this environment variable to choose the number of processes used by a particular run of the program, thereby tuning the program's requirements to the system load. You can even force a parallelized program to execute on a single CPU when necessary.

MIPSpro Fortran 77 and MIPSpro Fortran 90 also recognize additional environment variables that specify a range of process numbers, and use more or fewer processes within this range as system load varies. (See the Programmer's Guide for the language for details.)

At certain points the multiple processes must wait for one another before continuing. They do this by waiting in a busy-loop for a certain length of time, then by blocking until they are signaled. You can specify the amount of time that a process should spend spinning before it blocks, using either source directives or an environment variable (see the Programmer's Guide for the language for system functions for this purpose).

Choosing the Loop Schedule Type

Most parallel sections are loops. The benefit of parallelization is that some iterations of the loop are executed in one CPU, concurrent with other iterations of the same loop in other CPUs. But how are the different iterations distributed across processes? All three languages support four possible methods of scheduling loop iterations, as summarized in Table 3-3. The variables used in Table 3-3 are as follows:

N Number of iterations in the loop, determined from the source or at run-time.
P Number of available processes, set by default or by environment variable (see "Controlling the Degree of Parallelism").
Q Number of a process, from 0 to N-1.
C "Chunk" size, set by directive or by environment variable.

Loop Scheduling Types
Schedule Purpose
SIMPLE Each process executes ÎN/P� iterations starting at Q*(ÎN/P�). First process to finish takes the remainder chunk, if any.
DYNAMIC Each process executes C iterations of the loop, starting with the next undone chunk unit, returning for another chunk until none are left undone.
INTERLEAVE Each process executes C iterations at C*Q, C*2Q, C*3Q...
GSS Each process executes chunks of decreasing size, (N/2P), (N/4P), ...

Loop Scheduling Types
Schedule	Purpose
SIMPLE	Each process executes ÎN/P� iterations starting at Q*(ÎN/P�). First process to finish takes the remainder chunk, if any.
DYNAMIC	Each process executes C iterations of the loop, starting with the next undone chunk unit, returning for another chunk until none are left undone.
INTERLEAVE	Each process executes C iterations at CQ, C2Q, C*3Q...
GSS	Each process executes chunks of decreasing size, (N/2P), (N/4P), ...

The effects of the scheduling types depend on the nature of the loops being parallelized. For example:

The SIMPLE method works well when N is relatively small. However, unless N is evenly divided by P, there will be a time at the end of the loop when fewer than P processes are working, and possibly only one.
The DYNAMIC and INTERLEAVE methods allow you to set the chunk size so as to control the span of an array referenced by each process. You can use this to reduce cache effects. When N is very large so that not all data fits in memory, INTERLEAVE may reduce the amount of paging compared to DYNAMIC.
The guided self-scheduling (GSS) method is good for triangular matrices and other algorithms where loop iterations become faster toward the end.

You can use source directives or pragmas within the program to specify the scheduling type and chunk size for particular loops. Where you do not specify the scheduling, the run-time library uses a default method and chunk size. You can establish this default scheduling type and chunk size using environment variables.

N	Number of iterations in the loop, determined from the source or at run-time.
P	Number of available processes, set by default or by environment variable (see "Controlling the Degree of Parallelism").
Q	Number of a process, from 0 to N-1.
C	"Chunk" size, set by directive or by environment variable.