On Privatization of Variables for Data-Parallel Execution

Size: px

Start display at page:

Download "On Privatization of Variables for Data-Parallel Execution"

Edith Barrett
6 years ago
Views:

1 On Privatization of Variables for Data-Parallel Execution Manish Gupta IBM T. J. Watson Research Center P. O. Box 218 Yorktown Heights, NY Abstract Privatization of data is an important technique that has been used by compilers to parallelize loops by eliminating storage-related dependences. When a compiler partitions computations based on the ownership of data, selecting a proper mapping of privatizable data is crucial to obtaining the benefits of privatization. This paper presents a novel framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We show that there are numerous alternatives available for mapping privatized variables and the choice of mapping can significantly affect the performance of the program. We present an algorithm that attempts to preserve parallelism and minimize communication overheads. We also introduce the concept of partial privatization of arrays that combines data partitioning and privatization, and enables efficient handling of a class of codes with multi-dimensional data distribution that was not previously possible. Finally, we show how the ideas of privatization apply to the execution of control flow statements as well. An implementation of these ideas in the phpf prototype compiler for High Performance Fortran on the IBM SP2 machine has shown impressive results. 1 Introduction Parallelizing compilers have traditionally used scalar privatization [3] to eliminate the storage-related anti and output dependences associated with scalar variables inside loops. Under this technique, private copies of the variable are created for each processor participating in the parallel execution of a loop, so that writes followed by read acceses of these variables in iterations executed by different processors do not interfere with each other. More recent studies have pointed out the importance of applying this technique to arrays as well, in order to expose parallelism in loops that otherwise appear to be sequential [6]. A number of research compilers now aggressively perform array privatization [18, 10, 8], which has contributed significantly to the capability of these compilers to recognize coarse-grain parallelism in programs. Privatization assumes an even greater importance for parallel execution of programs on machines with high interprocessor communication costs, which includes most distributed memory machines. The penalty for not privatizing a variable amounts to much more than merely the loss of parallelism, it can lead to very high communication overheads. Under control-based parallelization, in which the entire loop body corresponding to an iteration (or chunk of iterations) of a parallel loop is assigned as a unit to a processor, the actual steps of privatizing a variable, once it has been recognized as privatizable, are relatively straightforward. However, further analysis is needed under data-driven parallelization, in which the assignment of computation to processors is based on ownership of data, due to which all statements corresponding to a loop iteration may not be executed by the same processor. An example of such a method of parallelization is the owner-computes rule [11], which assigns computation to the processor which owns the data being modified in that computation. The owner-computes rule and its generalized variants (where the computation may be assigned to processors that own some other data not being modified) are followed by most compilers for languages like High Performance Fortran (HPF) [13, 11, 19, 14, 2, 9, 1]. Many of these compilers have not paid adequate attention to the problem of mapping privatizable scalar and array variables. This paper presents a framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We describe different alternatives available for mapping privatized variables, and show how that choice influences parallelization and communication overheads. We present an algorithm to select the alignment of privatizable scalar and array variables that attempts to preserve parallelism and minimize communication overheads. Our algorithm is guided by a realistic communication cost 1

2 !HPF$ Align (i) with A(i) :: B,C,D!HPF$ Align (i) with A(*) :: E,F!HPF$ Distribute (block) :: A m =2 do i =2;n,1 S 1 : m =m+1! induction variable S 2 : x = B(i) +C(i)! align with consumer S 3 : y = A(i) +B(i)! align with producer S 4 : z = E(i) +F(i)! no alignment S 5 : A(i +1)=y=z S 6 : D(m) =x=z Figure 1. Different alignments of privatized scalars model which takes into account the placement of communication, and hence, optimizations like message vectorization. We introduce a novel concept of partial privatization of arrays that combines data partitioning and privatization, and enables efficient handling of a class of codes with multidimensional data distribution that was not previously possible. Finally, we show how the ideas of privatization apply to execution of control flow statements as well. The ideas presented in this work have been implemented in the phpf prototype compiler for HPF [9]. Our preliminary results are very encouraging. 2 Mapping of Privatized Scalars 2.1 Alignment Choices We shall first illustrate the need for scalar privatization and for different kinds of alignment using an example shown in Figure 1. It is necessary to privatize each of the variables m, x, y, and z to achieve partitioned execution of the loop, as replication of any variable would force all processors to execute the assignment to that variable under the owner-computes rule. We refer to a statement in which a variable is defined as the producer statement for that variable and a statement which uses the value of that variable as a consumer statement. Alignment with Consumer Consider the variable x, replicating it would lead to each processor executing the first statement in the loop in every iteration. Furthermore, that will require values of B(1:n) and C(1:n) to be unnecessarily broadcast to all the processors. As part of privatization, if x is aligned with the producer reference B(i) or C(i) (i.e., owned by the same processor as the owner of B(i) and C(i)), there is no communication needed to compute!hpf$ Align G(i; j) with H(i; j)!hpf$ Align A(i) with H(i; )!HPF$ Distribute H(block,) do i =1;n p =B(i)! not needed on all processors q = C(i)! needed on all processors A(i) =H(i; p) +G(q; i) Figure 2. Availability requirements for subscripts x. However, the value of x has to be communicated to the owner of D(m) (the value of m is known to be i +1via induction variable analysis). This communication takes place inside the i-loop, because of a dependence from the definition of x to the use of x inside the loop. If x is aligned instead with the consumer reference D(m), communication is now needed for two references B(i) and C(i) to the owners to D(m) in the computation of x. However, both of these communications can be moved outside the i-loop and carried out with a collective shift communication. The consumer reference for a read reference u is a reference r whose owner needs the value of u during execution of that statement. Thus, in most cases, under the ownercomputes rule, the consumer reference is the lhs (left hand side) of the assignment statement. For special cases where a read reference, such as a subscript, is needed by all processors, the consumer reference is set to be a dummy replicated reference. As an optimization, for a reference which appears as a subscript of an rhs reference which does not need communication, phpf sets the consumer reference to be the lhs reference, because only the processor executing that statement needs to know the value of the subscript. Thus, for the example shown in Figure 2, the consumer reference for p is A(i), and for q it is the dummy replicated reference. Alignment with Producer The preferable alignment for the variable y is with the producer reference A(i) (or equivalently, B(i)), which helps avoid the need for any communication in the computation of y, though communication is needed for statement S 5. The alignment of y with the consumer reference A(i +1)would lead to an extra communication, i.e., communication of both rhs references on S 2 to the owner of A(i +1), of which communication for A(i) would take place in the inner loop. Privatization without Alignment The variablez uses the value of replicated array elements E(i) and F(i) in its com-

3 putation. The alignment of z with these producer references would amount to replication, and hence is not desirable. Its alignment with one of the consumer references on statement S 5 or S 6 would lead to communication being required for the other statement. Since data needed for the computation of z is available on all processors, it can be privatized without explicit alignment with any other reference. Each processor that executes an iteration of the i-loop under the computation partitioning, as determined by the partitioning of other statements in that loop, owns and computes a temporary value of z in that loop iteration. Any scalar variable recognized as an induction variable, such as m in Figure 1, should be privatized without alignment. The phpf compiler replaces the rhs of that assignment statement by the closed-form expression for the value of that induction variable as a function of surrounding loop indices. Thus, the expression m +1 ons 1 is replaced by the expression i +1, which represents the closed-form value of the variable. Privatization of a scalar without alignment impacts computation partitioning by ensuring that the statement assigning a value to that scalar is not forced to execute on all processors, which would happen if the scalar were replicated. There is no computation partitioning guard associated with the statement [9]. Hence, if that statement appears inside a loop, it is executed by the union of all processors executing any other statement inside that loop for a given iteration. If the statement appears outside any loop, it is executed by all processors. For the purpose of communication analysis, the scalar is viewed as if it has been replicated, and therefore, no use of that scalar requires any communication. In fact, phpf selects this mapping only when the computation of the scalar value requires no communication as well. 2.2 Algorithm to Determine Mapping The mapping of scalar variables and privatizable array variables is determined in a separate, first pass during the communication analysis phase of the phpf compiler. It follows an earlier program analysis phase which constructs the static single assignment (SSA) [5] representation of the program and performs constant propagation and induction variable recognition. phpf uses the SSA representation to associate a separate mapping decision with each assignment to a scalar. This permits flexibility in choosing an appropriate mapping when logically different variables being used in different segments of a procedure happen to reuse a name. However, for simplicity of communication analysis and code generation, the compiler imposes a restriction that given a use (read reference) of a scalar variable, all reaching definitions are given an identical mapping. Thus, during the later phases of compilation, there is no scope for ambiguity DetermineMapping(def, stmt) f g /* Default mapping of def is replication */ if (IsPrivatizable(def) == TRUE) then RhsReplicated = IsRhsReplicated(stmt) if (RhsReplicated == TRUE and IsUniqueDef(def) == TRUE) then Add def to NoAlignExam list Traverse reached uses of def and select a consumer ref as AlignRef if (RhsReplicated == FALSE and (no consumer ref selected as AlignRef or alignment of def with AlignRef leads to inner loop commn. for some RHS ref on stmt)) then Select a partitioned RHS ref as AlignRef if (AlignRef has been selected and alignment valid inside current loop) then for each reached use of def do for each reaching def of use do Align def with AlignRef end do end do Figure 3. Pseudocode for determining mapping of scalar reference regarding the mapping of a scalar value it can be determined by obtaining the mapping information recorded with the first reaching definition of that reference. Any other reaching definition of that use is guaranteed by our algorithm to have the same mapping associated with it. The scalar variables involved in reduction operations are treated in a special manner, which is described in the next subsection. For other scalar variables, Figure 3 gives an overview of the algorithm to determine the mapping for a given scalar definition. The default mapping of each scalar definition is set to replication. We now explain the different steps of our algorithm. Privatization without alignment If the data flow analysis shows that a given definition is privatizable and not live outside the current loop (the compiler also takes advantage of the NEW clause in the INDEPENDENT directive of HPF to infer this), we first check if all rhs references on that statement are to replicated data. If so, the compiler considers privatizing the scalar without alignment with any refer-

4 ence, if the given definition is the only reaching definition of all the reached uses of that definition. This would ensure that in spite of the privatized execution of statement S computing the scalar value, each reached use of the scalar variable would see the correct value. We note that at this stage, an eligible scalar definition is only added to the list of definitions being considered for privatization without alignment. The reason for this deferral is that there may be rhs references to privatizable scalar or array variables in statement S for which mapping decisions have not yet been made, so those variables appear to be replicated at this stage. At the end of the compiler pass making mapping decisions, the list is examined again and if all rhs data on the corresponding statement continue to be replicated, the scalar definition is privatized without alignment. Identification of Alignment Target The next step in the algorithm is to examine each reached use of def and identify a consumer reference with which def could be aligned. The selection of a single alignment target is done using a heuristic algorithm. If any reached use appears inside a loop bound expression or a subscript that has to be broadcast to all processors (note that subscript values of rhs references not involved in communication need not be made available on all processors), the dummy replicated reference is returned as the selected consumer reference and the traversal through reached uses is terminated. Otherwise, phpf ignores any consumer reference that refers to replicated data, and selects a consumer reference (if any) to partitioned data. This selection process favors a reference in which a distributed array dimension is traversed in the innermost common loop enclosing the scalar definition and the reached use, since alignment with such a reference will ensure that the scalar is mapped to different processors during different loop iterations. For example, inside an i-loop, alignment with a reference A(i) would be preferred over alignment with a reference A(1), wherea is a partitioned array. As described earlier, a consumer reference corresponding to a use of the scalar variable is usually the lhs reference of the assignment statement in which the use occurs. If this reference is to a privatizable variable, the compiler invokes the mapping algorithm recursively on that definition to determine the mapping of variable before determining whether this consumer reference may serve as a suitable alignment target. If there are rhs references to partitioned data on the statement computing the scalar value, the compiler selects one of those producer references (similar to the selection of consumer reference) as another potential alignment target. Given a choice between producer and consumer references as the alignment target, our algorithm selects the consumer reference unless that selection leads to inner-loop communication for some rhs reference on the given assignment!hpf$ Distribute (block,block,) :: A; B do i =1;n do j =1;n ::: s =::: do k =1;n A(i; j; k) =:::! AlignLevel =2 B(s; j; k) =:::! AlignLevel =3 statement. Figure 4. AlignLevel for array references Scope of Validity of Alignment We now describe how the compiler determines the program region in which the alignment information about a scalar variable is welldefined. Given a subscript s in an array reference r, let VarLevel (s) denote the innermost loop nesting level in which subscript s varies in value. We define SubscriptAlignLevel (s) as: VarLevel (s) if s is affine function of loop indices VarLevel (s) +1 otherwise Thus, SubscriptAlignLevel (s) gives the nesting level of the outermost loop throughout which the value of the subscript s is well-defined. For example, in Figure 4, the k-loop is the outermost loop in which both subscripts s and k are welldefined. We now define AlignLevel (r) as the the maximum of SubscriptAlignLevel values for each subscript in a partitioned dimension of r. Therefore, in Figure 4, the Align- Level of A(i; j; k) is 2, which corresponds to the j-loop, and the AlignLevel of B(s; j; k) is 3, corresponding to the k-loop, which is the outermost loop in which subscript s is invariant. Given a reference r which is selected as an alignment target of a scalar definition, AlignLevel(r) indicates the outermost loop throughout which the alignment information is valid. Therefore, the scalar definition which is privatizable at nesting level l can be aligned unambiguously with the selected reference r if AlignLevel (r) l. Marking alignment information Finally, once a valid alignment target for the scalar definition def has been identified, the compiler records that alignment information for each reaching definition of every reached use of def. The mapping information at a use during communication analysis is obtained initially by accessing the information recorded with its first reaching definition, and is cached in

5 !HPF$ Align B(i) with A(i; )!HPF$ Distribute (block,block) :: do i =1;n s =0 do j =1;n s =s+a(i; j) B(i) =s Figure 5. Scalar variable involved in reduction a data structure associated with the use for subsequent inquiries. Our procedure ensures that a consistent mapping information is seen by each reached use of def. 2.3 Mapping of Scalars Involved in Reductions Any scalar computed in a reduction operation, such as sum, carried out across a processor grid dimension is handled in a special manner. An additional privatized temporary copy of the scalar is created during code generation to hold the results of the local reduction computation initially performed by each processor. A global reduction operation combines the values of the local operations and stores the result into the variable which retains the original name of the scalar. This scalar variable is replicated in the dimensions in which reduction takes place. However, it may be privatized with respect to other processor grid dimensions. Given a statement assigning value to a scalar variable which is recognized as a reduction, the compiler checks if the scalar definition is privatizable without copy-out with respect to the loop immediately surrounding the reduction loop. If so, the special array reference whose ownership governs the partitioning of the partial reduction operation [9] serves as the alignment target. However, in this case, the compiler constructs a new alignment mapping in which the scalar variable is replicated in each dimension over which reduction takes place, and is aligned with the target array reference in only the remaining grid dimensions. This alignment information is propagated for each reaching definition of every reached use of the original definition. Finally, at code generation time, another privatized copy of the scalar variable is created, which differs from the original variable only in that it is private on each processor involved in reduction rather than being replicated. For example, in Figure 5, a sum reduction takes place in the j-loop. The definition of variable s is verified as being privatizable with respect to the i-loop, and A(i; j) serves as the alignment target. Hence, s is replicated in the second grid dimension and is aligned with the ith row of A in the A first dimension. As a result of this alignment, the reduction computation can proceed without the need to broadcast the ith row of A to other processors along the first grid dimension. 3 Mapping of Privatized Arrays We now describe the procedure for mapping privatizable arrays. The phpf compiler currently relies on directives from the programmer to infer that arrays are privatizable. While the basic alternatives explored for alignment of privatized arrays are the same as those for scalars, there are additional options available when dealing with arrays. 3.1 Basic Procedure The INDEPENDENT directive for loops in HPF asserts that the iterations of the loop may be executed in any order without changing the program semantics. The NEW clause attached to such a directive supplies a list of variables (see example in Figure 6) and modifies the INDEPENDENT directive to assert that the independence of different loop iterations holds if new objects are created for the named variables for each loop iteration. Thus, those variables can be regarded as privatizable with respect to the INDEPEN- DENT loop. phpf is also able to infer the privatizablity of an array from a weaker form of a parallel loop directive which indicates that a loop has no true loop-carried valuebased dependences. Any lhs array reference in which each subscript is either invariant with respect to the parallel loop or is an affine function of inner loop indices contributes to memory-based loop-carried dependences, which can be eliminated only by privatizing that array. The algorithm to determine the target alignment reference is identical to that used for scalar variables. Similarly, once an alignment target has been selected and the Align- Level for that reference determined, the compiler examines each reached use to ensure that it does not appear outside the loop corresponding to AlignLevel. Any seemingly reached uses outside the loop associated with the NEW directive are assumed to be spurious and hence are ignored. The alignment information is kept in a data structure associated with the loop with respect to which the array has been privatized, and applies to all references to that array variable within that loop. The privatizable arrays used to hold results of a reduction operation are also handled in a similar manner as scalar variables in reduction computations. 3.2 Partial Privatization The concept of privatization of variables has traditionally been associated with a single parallel loop at a time. With nested loop parallelism, which is enabled by the use of a

6 !HPF$ Distribute (;,block,block) ::!HPF$ INDEPENDENT, NEW(c) do k =2;nz,1 do j =2;ny,1 do i =2;nx,1 c(i; j; 1) = ::: ::: do j =3;ny,1 do i =2;nx,1 rsd (1;i;j;k)=:::c(i; j, 1; 1) ::: rsd!hpf$ Align (:) with A(:) :: B,C!HPF$ Distribute (block) :: A do i =1;n if (B(i) 6= 0:0) then A(i) =A(i)=B(i) if (B(i) < 0:0) go to 100 else A(i) =C(i) C(i) =C(i) continue Figure 7. Privatized execution of control flow statements Figure 6. Need for partial privatization multi-dimensional processor grid in HPF, the idea of privatization can be trivially extended to apply to each grid dimension and to each loop. However, yet another alternative which can be considered in this scenario is one combining partitioning and privatization the array may be partitioned in some grid dimensions and privatized with respect to the other dimensions. We refer to this as partial privatization. Figure 6 shows a program segment adapted from the APPSP program of NAS Benchmarks, which illustrates the benefits of partial privatization. The array c is privatizable with respect to the k-loop, but not with respect to the j- loop. Correspondingly, the compiler will fail in its attempt to privatize the array in both grid dimensions. Clearly, replication in either dimension would lead to loss of parallelism and a great deal of extra communication, due to the ownercomputes rule. In accordance with the HPF data distribution directives, the only way to exploit parallelism in both the k and the j-loops is to partition the second dimension of c across the first grid dimension, and to privatize it along the second grid dimension. Under partial privatization, the procedure to determine AlignLevel of a target reference is modified to consider subscripts only in those distributed dimensions in which the array is to be privatized. Thus, in the example shown in Figure 6, since c is to be privatized only in the second grid dimension, the AlignLevel for rsd (1;i;j;k) is obtained as 1 (corresponding to the k-loop) rather than 2 (corresponding to the j-loop). The information from the NEW clause indicates that the value of c can be discarded after the k- loop. Hence, the compiler is able to proceed with partial privatization of c, whereas complete privatization was not possible. It is well-known that distribution of arrays on multi- dimensional rather than a single-dimensional processor grid can lead to more scalable solutions for many stencil codes, due to lower volume of interprocessor communication resulting from such a distribution. For the APPSP program specifically, a 3-D distribution of arrays has been known to outperform the 2-D distribution, which in turn has outperformed a 1-D distribution in a hand-tuned message-passing implementation [15]. 4 Control Flow Statements The handling of control flow statements is often left vague under the data-driven compilation model. A default strategy of executing these statements on all processors would lead to problems very similar to those encountered with indiscriminate replication of variables. For example, if all processors were forced to execute any control flow statement in Figure 7, the loop would not be parallelized effectively, and expensive communication would be required for the array B. On the other hand, privatized execution of these statements by the owner of A(i) (which also owns B(i) and C(i)) eliminates the need for any communication and allows the loop to be parallelized effectively, as the loop bounds can be shrunk [9] in the final SPMD code. The phpf compiler applies the following rules to privatize the execution of a control flow statement S (otherwise executed by all processors, by default), and if necessary, identify a reference which will serve as the alignment target for any data needed to execute that statement. If the statement S cannot transfer control to a target statement outside the body of loop L, thens does not contribute to a computation partitioning guard [9] for the loop L. Essentially, S will be executed by the union of all processors executing any other statement inside loop L for a given iteration. Conceptu-

7 ally, this corresponds to the notion of privatization without alignment. Any data referenced in the control predicate of S has to be communicated to the union of all processors that participate in the execution of any statement that is control-dependent on S. In the example shown in Figure 7, both of the if statements transfer control only to a statement inside the i-loop. Hence the execution of those statements can be privatized. Furthermore, following the owner-computes rule, only the owner of A(i) (or equivalently, C(i)) needs to participate in the execution of any statement that is control dependent on either of those control flow statements. Therefore, no communication is needed for the predicate of those if statements, as B(i) is owned by the same processor as A(i). 5 Experimental Results The ideas presented in this paper have been implemented in the phpf prototype compiler [9]. In this section, we describe some preliminary experimental results to show the impact of this analysis. We present performance results on three benchmark programs, each of which illustrates different aspects of our procedure for mapping privatized variables. The first program is TOMCATV, a mesh-generator with Thomson's solver. The program, originally from SPEC92FP benchmark, has been been augmented with HPF directives. The second program, DGEFA, performs gaussian elimination on a matrix with partial pivoting. It is the HPF version of the original routine from LINPACK, in which we have applied procedure-inlining by hand. The third program, APPSP from NAS benchmarks, is a pseudo-application for performance evaluation of a solver for five coupled, nonlinear partial differential equations. Each of these programs was compiled with the -O3 option for optimizations. All measurements were done using 16 thin nodes of an IBM SP TOMCATV Table 1 shows the performance of TOMCATV obtained with three different levels of optimization. The first version, which is the most naive version of the compiler, does not perform privatization and replicates all scalar variables. The second version performs privatization, but always aligns each scalar definition with a producer reference, i.e., with a partitioned array or scalar reference on that statement. The third version applies the algorithm described in Section 2.2 to determine the alignment of privatized scalar variables. As expected, replication of all scalar variables leads to extremely poor performance. This can be attributed to Program #P Execution Time (sec) Replication Producer Selected Alignment Alignment TOMCATV (, block) n = Table 1. Performance of TOMCATV on IBM SP2 Program #Procs Execution Time (sec) Default Alignment DGEFA (,cyclic) n = Table 2. Performance of DGEFA on IBM SP2 the loss of parallelism and excessive communication in the main computational loop nest of the program. We find the performance figures in the second column to be even more interesting. They show that in spite of privatization, there can be a substantial loss of performance if the scalar variables are not mapped carefully. The alignment of a privatizable scalar variable with a partitioned producer reference on the same statement is quite simple to support. In contrast, alignment with a consumer reference requires a more complex procedure that may be recursively invoked to deal with a privatizable consumer reference which in turn needs to be aligned with a target reference. However, alignment of scalar variables with producer references leads to a considerable amount of inner-loop communication in TOMCATV. Our algorithm is able to avoid that by selecting alignment with consumer references in the main computational loop of the program. With proper alignment, we obtain performance improvements of more than two orders of magnitude on 16 processors. In fact, it is only with the appropriate alignment of scalar variables that the program exhibits speedups. 5.2 DGEFA The array on which gaussian elimination is performed is partitioned column-wise in a cyclic manner. In each step of the elimination, partial pivoting involves a maxloc operation along a single array column which is mapped to a single processor. Our optimization to align privatizable variables holding the results of a reduction operation in the dimensions not involved in reduction leads to the computa-

8 tion for partial pivoting being confined to just the relevant processor in each step, and also helps avoid unnecessary communication. Table 2 shows the performance results of DGEFA without and with this optimization. The communication overhead incurred when the reduction variable is replicated across the columns remains roughly constant, but it accounts for an increasing percentage of the execution time as the number of processors is increased. 5.3 APPSP We present performance results for two HPF versions of the program, one with a 1-D distribution of arrays and redistribution (transpose) of data in the sweepz subroutine, and the other employing a fixed 2-D distribution throughout the program. The first two columns of results in Table 3 show that the execution time of the program becomes prohibitively large if array privatization is disabled. In fact, in that case, we had to abort the parallel program after more than a day of execution. The remaining columns show that with a 2-D distribution of arrays, even regular array privatization does not help and the program performs extremely poorly. However, with partial privatization employed by the compiler, we obtain significantly better performance. The program version using 2-D distribution starts out at fewer processors with better performance, mainly due to the absence of global transpose operations in the sweepz subroutine, but does not scale as well as the version using 1-D distribution, unlike hand-tuned message-passing versions of APPSP [15]. An examination of the message-passing code produced by the HPF compiler showed that there is considerable scope for improving the performance of that version by global message combining across loop nests. The phpf compiler does not currently perform that optimization. 6 Related Work There has been a great deal of work done on techniques related to privatization for exposing more parallelism, such as scalar expansion [16], scalar privatization [3], array expansion [7], and array privatization [18, 10, 8]. Our work focuses on the additional analysis needed to apply privatization effectively to data-driven execution, and hence is complementary to previous work. Knobe and Dally present a subspace model and describe an algorithm meant to be applied before data partitioning and scheduling, which attempts to resolve mismatches in the shape of various operands [12]. Their method achieves privatization by adding an expansion dimension that is indexed by a loop induction variable. They also apply the subspace model to optimize the execution of control flow statements. They do not discuss alternatives regarding the alignment of privatized data with other partitioned data or the impact of such mapping on the loop-level placement of communication involving privatized data. Chatterjee et al. present the concept of mobile alignment of arrays with respect to loops [4], which is similar to the idea of array privatization. Their work focuses on choosing between replication and mobile alignment of data. Their algorithm does not take into account information about the privatizability of arrays, which can make code generation difficult or expensive for arrays with mobile alignment that are not privatizable. The work done by Palermo et al. [17] is the most closely related to our work. They use a simpler algorithm in which an assignment to a privatized scalar variable is executed by each processor that participates in the execution of any statement in the given loop iteration, which is similar to our notion of privatization without alignment with a specific reference. This could potentially lead to more communication if there are fewer processors using a scalar value than those made to execute the assignment statement. An earlier implementation of the phpf compiler [9] employed a simpler and more limited version of our analysis for handling privatizable scalar variables. It did not privatize a scalar definition that was not the only reaching definition of the reached uses, and did not deal with privatizable arrays. Privatization of variables is performed by many other HPF compilers as well. However, the method of determining the ownership of those variables has usually not been discussed. 7 Conclusions In this paper, we have presented a novel framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We show that there are numerous alternatives available for mapping privatized variables and the choice of mapping can significantly affect the performance of the program, by as much as two orders of magnitude in some cases on the IBM SP2. Our algorithm to select this mapping is guided by a realistic communication cost model which takes into account optimizations like message vectorization. We also introduce the notion of partial privatization of arrays, which enables a compiler to exploit nested parallelism even when that nested form is incompatible with the conventional definition of array privatization. Our preliminary results, based on an implementation of these ideas in the phpf compiler have been very encouraging. In the future, we plan to integrate our mapping techniques with automatic array privatization. Acknowledgements The author wishes to thank Sam Midkiff for his help in implementing the technique of partial privatization.

9 Program # Processors Execution Time (sec) 1-D, No Array Priv. 1-D, Priv. 2-D, No Partial Priv. 2-D, Partial Priv. APPSP n =64 2 > (1 day) > (1 day) niter = Table 3. Performance of APPSP on IBM SP2 References [1] P. Banerjee, J. Chandy, M. Gupta, E. Hodges, J. Holm, A. Lain, D. Palermo, S. Ramaswamy, and E. Su. An overview of the PARADIGM compiler for distributedmemory multicomputers. IEEE Computer, October [2] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. A compilation approach for Fortran 90D/HPF compilers on distributed memory MIMD computers. In Proc. Sixth Annual Workshop on Languages and Compilers for Parallel Computing, Portland, Oregon, August [3] M. Burke, R. Cytron, J. Ferrante, and W. Hsieh. Automatic generation of nested, fork-join parallelism. Journal of Supercomputing, pages 71 88, [4] S. Chatterjee, J. R. Gilbert, and R. Schreiber. Mobile and replicated alignment of arrays in data-parallel programs. In Proc. Supercomputing ' 94, Washington D.C., November [5] R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4): , October [6] R. Eigenmann, J. Hoeflinger, Z. Li, and D. Padua. Experience in the automatic parallelization of four Perfect- Benchmark programs. In Proc. 4th Workshop on Languages and Compilers for Parallel Computing. Pitman/MIT Press, August [7] P. Feautrier. Array expansion. In Proc ACM International Conference on Supercomputing, July [8] J. Gu, Z. Li, and G. Lee. Symbolic array dataflow analysis for array privatization and program parallelization. In Proc. Supercomputing '95, San Diego, CA, December [9] M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K. Wang, D. Shields, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proc. Supercomputing '95, San Diego, CA, December [10] M. Hall, S. Amarasinghe, B. Murphy, S.-W. Liao, and M. Lam. Detecting coarse-grain parallelism using an interprocedural parallelizing compiler. In Proc. Supercomputing '95, San Diego, CA, December [11] S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8):66 80, August [12] K. Knobe and W. Dally. The subspace model: A theory of shapes for parallel systems. In Proc. 5th Workshop on Compilers for Parallel Computers, Malaga, Spain, June [13] C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr., and M. E. Zosel. The High Performance FORTRAN Handbook. The MIT Press, Cambridge, MA, [14] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Systems, 2(4): , October [15] V. K. Naik. Scalability issues for a class of CFD applications. In Proc Scalable High Performance Computing Conference, pages , May [16] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12): , December [17] D. Palermo, E. Su, E. HodgesIV, and P. Banerjee. Compiler support for privatization on distributed-memory machines. In Proc. 25th International Conference on Parallel Processing, Bloomingdale, IL, August [18] P. Tu and D. Padua. Automatic array privatization. In Proc. 6th Workshop on Languages and Compilers for Parallel Computing, Portland, OR, August [19] H. Zima, H. Bast, and M. Gerndt. SUPERB: A tool for semiautomatic MIMD/SIMD parallelization. Parallel Computing, 6:1 18, 1988.

Ptran II - A Compiler for High Performance Fortran. M. Gupta S. Midki E. Schonberg P. Sweeney K.Y. Wang M. Burke

Ptran II - A Compiler for High Performance Fortran M. Gupta S. Midki E. Schonberg P. Sweeney K.Y. Wang M. Burke fmgupta,midki,schnbrg,pfs,kyw,burkemg@watson.ibm.com I.B.M. T.J Watson Research Center T.J.