On Privatization of Variables for Data-Parallel Execution

Size: px
Start display at page:

Download "On Privatization of Variables for Data-Parallel Execution"

Transcription

1 On Privatization of Variables for Data-Parallel Execution Manish Gupta IBM T. J. Watson Research Center P. O. Box 218 Yorktown Heights, NY Abstract Privatization of data is an important technique that has been used by compilers to parallelize loops by eliminating storage-related dependences. When a compiler partitions computations based on the ownership of data, selecting a proper mapping of privatizable data is crucial to obtaining the benefits of privatization. This paper presents a novel framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We show that there are numerous alternatives available for mapping privatized variables and the choice of mapping can significantly affect the performance of the program. We present an algorithm that attempts to preserve parallelism and minimize communication overheads. We also introduce the concept of partial privatization of arrays that combines data partitioning and privatization, and enables efficient handling of a class of codes with multi-dimensional data distribution that was not previously possible. Finally, we show how the ideas of privatization apply to the execution of control flow statements as well. An implementation of these ideas in the phpf prototype compiler for High Performance Fortran on the IBM SP2 machine has shown impressive results. 1 Introduction Parallelizing compilers have traditionally used scalar privatization [3] to eliminate the storage-related anti and output dependences associated with scalar variables inside loops. Under this technique, private copies of the variable are created for each processor participating in the parallel execution of a loop, so that writes followed by read acceses of these variables in iterations executed by different processors do not interfere with each other. More recent studies have pointed out the importance of applying this technique to arrays as well, in order to expose parallelism in loops that otherwise appear to be sequential [6]. A number of research compilers now aggressively perform array privatization [18, 10, 8], which has contributed significantly to the capability of these compilers to recognize coarse-grain parallelism in programs. Privatization assumes an even greater importance for parallel execution of programs on machines with high interprocessor communication costs, which includes most distributed memory machines. The penalty for not privatizing a variable amounts to much more than merely the loss of parallelism, it can lead to very high communication overheads. Under control-based parallelization, in which the entire loop body corresponding to an iteration (or chunk of iterations) of a parallel loop is assigned as a unit to a processor, the actual steps of privatizing a variable, once it has been recognized as privatizable, are relatively straightforward. However, further analysis is needed under data-driven parallelization, in which the assignment of computation to processors is based on ownership of data, due to which all statements corresponding to a loop iteration may not be executed by the same processor. An example of such a method of parallelization is the owner-computes rule [11], which assigns computation to the processor which owns the data being modified in that computation. The owner-computes rule and its generalized variants (where the computation may be assigned to processors that own some other data not being modified) are followed by most compilers for languages like High Performance Fortran (HPF) [13, 11, 19, 14, 2, 9, 1]. Many of these compilers have not paid adequate attention to the problem of mapping privatizable scalar and array variables. This paper presents a framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We describe different alternatives available for mapping privatized variables, and show how that choice influences parallelization and communication overheads. We present an algorithm to select the alignment of privatizable scalar and array variables that attempts to preserve parallelism and minimize communication overheads. Our algorithm is guided by a realistic communication cost 1

2 !HPF$ Align (i) with A(i) :: B,C,D!HPF$ Align (i) with A(*) :: E,F!HPF$ Distribute (block) :: A m =2 do i =2;n,1 S 1 : m =m+1! induction variable S 2 : x = B(i) +C(i)! align with consumer S 3 : y = A(i) +B(i)! align with producer S 4 : z = E(i) +F(i)! no alignment S 5 : A(i +1)=y=z S 6 : D(m) =x=z Figure 1. Different alignments of privatized scalars model which takes into account the placement of communication, and hence, optimizations like message vectorization. We introduce a novel concept of partial privatization of arrays that combines data partitioning and privatization, and enables efficient handling of a class of codes with multidimensional data distribution that was not previously possible. Finally, we show how the ideas of privatization apply to execution of control flow statements as well. The ideas presented in this work have been implemented in the phpf prototype compiler for HPF [9]. Our preliminary results are very encouraging. 2 Mapping of Privatized Scalars 2.1 Alignment Choices We shall first illustrate the need for scalar privatization and for different kinds of alignment using an example shown in Figure 1. It is necessary to privatize each of the variables m, x, y, and z to achieve partitioned execution of the loop, as replication of any variable would force all processors to execute the assignment to that variable under the owner-computes rule. We refer to a statement in which a variable is defined as the producer statement for that variable and a statement which uses the value of that variable as a consumer statement. Alignment with Consumer Consider the variable x, replicating it would lead to each processor executing the first statement in the loop in every iteration. Furthermore, that will require values of B(1:n) and C(1:n) to be unnecessarily broadcast to all the processors. As part of privatization, if x is aligned with the producer reference B(i) or C(i) (i.e., owned by the same processor as the owner of B(i) and C(i)), there is no communication needed to compute!hpf$ Align G(i; j) with H(i; j)!hpf$ Align A(i) with H(i; )!HPF$ Distribute H(block,) do i =1;n p =B(i)! not needed on all processors q = C(i)! needed on all processors A(i) =H(i; p) +G(q; i) Figure 2. Availability requirements for subscripts x. However, the value of x has to be communicated to the owner of D(m) (the value of m is known to be i +1via induction variable analysis). This communication takes place inside the i-loop, because of a dependence from the definition of x to the use of x inside the loop. If x is aligned instead with the consumer reference D(m), communication is now needed for two references B(i) and C(i) to the owners to D(m) in the computation of x. However, both of these communications can be moved outside the i-loop and carried out with a collective shift communication. The consumer reference for a read reference u is a reference r whose owner needs the value of u during execution of that statement. Thus, in most cases, under the ownercomputes rule, the consumer reference is the lhs (left hand side) of the assignment statement. For special cases where a read reference, such as a subscript, is needed by all processors, the consumer reference is set to be a dummy replicated reference. As an optimization, for a reference which appears as a subscript of an rhs reference which does not need communication, phpf sets the consumer reference to be the lhs reference, because only the processor executing that statement needs to know the value of the subscript. Thus, for the example shown in Figure 2, the consumer reference for p is A(i), and for q it is the dummy replicated reference. Alignment with Producer The preferable alignment for the variable y is with the producer reference A(i) (or equivalently, B(i)), which helps avoid the need for any communication in the computation of y, though communication is needed for statement S 5. The alignment of y with the consumer reference A(i +1)would lead to an extra communication, i.e., communication of both rhs references on S 2 to the owner of A(i +1), of which communication for A(i) would take place in the inner loop. Privatization without Alignment The variablez uses the value of replicated array elements E(i) and F(i) in its com-

3 putation. The alignment of z with these producer references would amount to replication, and hence is not desirable. Its alignment with one of the consumer references on statement S 5 or S 6 would lead to communication being required for the other statement. Since data needed for the computation of z is available on all processors, it can be privatized without explicit alignment with any other reference. Each processor that executes an iteration of the i-loop under the computation partitioning, as determined by the partitioning of other statements in that loop, owns and computes a temporary value of z in that loop iteration. Any scalar variable recognized as an induction variable, such as m in Figure 1, should be privatized without alignment. The phpf compiler replaces the rhs of that assignment statement by the closed-form expression for the value of that induction variable as a function of surrounding loop indices. Thus, the expression m +1 ons 1 is replaced by the expression i +1, which represents the closed-form value of the variable. Privatization of a scalar without alignment impacts computation partitioning by ensuring that the statement assigning a value to that scalar is not forced to execute on all processors, which would happen if the scalar were replicated. There is no computation partitioning guard associated with the statement [9]. Hence, if that statement appears inside a loop, it is executed by the union of all processors executing any other statement inside that loop for a given iteration. If the statement appears outside any loop, it is executed by all processors. For the purpose of communication analysis, the scalar is viewed as if it has been replicated, and therefore, no use of that scalar requires any communication. In fact, phpf selects this mapping only when the computation of the scalar value requires no communication as well. 2.2 Algorithm to Determine Mapping The mapping of scalar variables and privatizable array variables is determined in a separate, first pass during the communication analysis phase of the phpf compiler. It follows an earlier program analysis phase which constructs the static single assignment (SSA) [5] representation of the program and performs constant propagation and induction variable recognition. phpf uses the SSA representation to associate a separate mapping decision with each assignment to a scalar. This permits flexibility in choosing an appropriate mapping when logically different variables being used in different segments of a procedure happen to reuse a name. However, for simplicity of communication analysis and code generation, the compiler imposes a restriction that given a use (read reference) of a scalar variable, all reaching definitions are given an identical mapping. Thus, during the later phases of compilation, there is no scope for ambiguity DetermineMapping(def, stmt) f g /* Default mapping of def is replication */ if (IsPrivatizable(def) == TRUE) then RhsReplicated = IsRhsReplicated(stmt) if (RhsReplicated == TRUE and IsUniqueDef(def) == TRUE) then Add def to NoAlignExam list Traverse reached uses of def and select a consumer ref as AlignRef if (RhsReplicated == FALSE and (no consumer ref selected as AlignRef or alignment of def with AlignRef leads to inner loop commn. for some RHS ref on stmt)) then Select a partitioned RHS ref as AlignRef if (AlignRef has been selected and alignment valid inside current loop) then for each reached use of def do for each reaching def of use do Align def with AlignRef end do end do Figure 3. Pseudocode for determining mapping of scalar reference regarding the mapping of a scalar value it can be determined by obtaining the mapping information recorded with the first reaching definition of that reference. Any other reaching definition of that use is guaranteed by our algorithm to have the same mapping associated with it. The scalar variables involved in reduction operations are treated in a special manner, which is described in the next subsection. For other scalar variables, Figure 3 gives an overview of the algorithm to determine the mapping for a given scalar definition. The default mapping of each scalar definition is set to replication. We now explain the different steps of our algorithm. Privatization without alignment If the data flow analysis shows that a given definition is privatizable and not live outside the current loop (the compiler also takes advantage of the NEW clause in the INDEPENDENT directive of HPF to infer this), we first check if all rhs references on that statement are to replicated data. If so, the compiler considers privatizing the scalar without alignment with any refer-

4 ence, if the given definition is the only reaching definition of all the reached uses of that definition. This would ensure that in spite of the privatized execution of statement S computing the scalar value, each reached use of the scalar variable would see the correct value. We note that at this stage, an eligible scalar definition is only added to the list of definitions being considered for privatization without alignment. The reason for this deferral is that there may be rhs references to privatizable scalar or array variables in statement S for which mapping decisions have not yet been made, so those variables appear to be replicated at this stage. At the end of the compiler pass making mapping decisions, the list is examined again and if all rhs data on the corresponding statement continue to be replicated, the scalar definition is privatized without alignment. Identification of Alignment Target The next step in the algorithm is to examine each reached use of def and identify a consumer reference with which def could be aligned. The selection of a single alignment target is done using a heuristic algorithm. If any reached use appears inside a loop bound expression or a subscript that has to be broadcast to all processors (note that subscript values of rhs references not involved in communication need not be made available on all processors), the dummy replicated reference is returned as the selected consumer reference and the traversal through reached uses is terminated. Otherwise, phpf ignores any consumer reference that refers to replicated data, and selects a consumer reference (if any) to partitioned data. This selection process favors a reference in which a distributed array dimension is traversed in the innermost common loop enclosing the scalar definition and the reached use, since alignment with such a reference will ensure that the scalar is mapped to different processors during different loop iterations. For example, inside an i-loop, alignment with a reference A(i) would be preferred over alignment with a reference A(1), wherea is a partitioned array. As described earlier, a consumer reference corresponding to a use of the scalar variable is usually the lhs reference of the assignment statement in which the use occurs. If this reference is to a privatizable variable, the compiler invokes the mapping algorithm recursively on that definition to determine the mapping of variable before determining whether this consumer reference may serve as a suitable alignment target. If there are rhs references to partitioned data on the statement computing the scalar value, the compiler selects one of those producer references (similar to the selection of consumer reference) as another potential alignment target. Given a choice between producer and consumer references as the alignment target, our algorithm selects the consumer reference unless that selection leads to inner-loop communication for some rhs reference on the given assignment!hpf$ Distribute (block,block,) :: A; B do i =1;n do j =1;n ::: s =::: do k =1;n A(i; j; k) =:::! AlignLevel =2 B(s; j; k) =:::! AlignLevel =3 statement. Figure 4. AlignLevel for array references Scope of Validity of Alignment We now describe how the compiler determines the program region in which the alignment information about a scalar variable is welldefined. Given a subscript s in an array reference r, let VarLevel (s) denote the innermost loop nesting level in which subscript s varies in value. We define SubscriptAlignLevel (s) as: VarLevel (s) if s is affine function of loop indices VarLevel (s) +1 otherwise Thus, SubscriptAlignLevel (s) gives the nesting level of the outermost loop throughout which the value of the subscript s is well-defined. For example, in Figure 4, the k-loop is the outermost loop in which both subscripts s and k are welldefined. We now define AlignLevel (r) as the the maximum of SubscriptAlignLevel values for each subscript in a partitioned dimension of r. Therefore, in Figure 4, the Align- Level of A(i; j; k) is 2, which corresponds to the j-loop, and the AlignLevel of B(s; j; k) is 3, corresponding to the k-loop, which is the outermost loop in which subscript s is invariant. Given a reference r which is selected as an alignment target of a scalar definition, AlignLevel(r) indicates the outermost loop throughout which the alignment information is valid. Therefore, the scalar definition which is privatizable at nesting level l can be aligned unambiguously with the selected reference r if AlignLevel (r) l. Marking alignment information Finally, once a valid alignment target for the scalar definition def has been identified, the compiler records that alignment information for each reaching definition of every reached use of def. The mapping information at a use during communication analysis is obtained initially by accessing the information recorded with its first reaching definition, and is cached in

5 !HPF$ Align B(i) with A(i; )!HPF$ Distribute (block,block) :: do i =1;n s =0 do j =1;n s =s+a(i; j) B(i) =s Figure 5. Scalar variable involved in reduction a data structure associated with the use for subsequent inquiries. Our procedure ensures that a consistent mapping information is seen by each reached use of def. 2.3 Mapping of Scalars Involved in Reductions Any scalar computed in a reduction operation, such as sum, carried out across a processor grid dimension is handled in a special manner. An additional privatized temporary copy of the scalar is created during code generation to hold the results of the local reduction computation initially performed by each processor. A global reduction operation combines the values of the local operations and stores the result into the variable which retains the original name of the scalar. This scalar variable is replicated in the dimensions in which reduction takes place. However, it may be privatized with respect to other processor grid dimensions. Given a statement assigning value to a scalar variable which is recognized as a reduction, the compiler checks if the scalar definition is privatizable without copy-out with respect to the loop immediately surrounding the reduction loop. If so, the special array reference whose ownership governs the partitioning of the partial reduction operation [9] serves as the alignment target. However, in this case, the compiler constructs a new alignment mapping in which the scalar variable is replicated in each dimension over which reduction takes place, and is aligned with the target array reference in only the remaining grid dimensions. This alignment information is propagated for each reaching definition of every reached use of the original definition. Finally, at code generation time, another privatized copy of the scalar variable is created, which differs from the original variable only in that it is private on each processor involved in reduction rather than being replicated. For example, in Figure 5, a sum reduction takes place in the j-loop. The definition of variable s is verified as being privatizable with respect to the i-loop, and A(i; j) serves as the alignment target. Hence, s is replicated in the second grid dimension and is aligned with the ith row of A in the A first dimension. As a result of this alignment, the reduction computation can proceed without the need to broadcast the ith row of A to other processors along the first grid dimension. 3 Mapping of Privatized Arrays We now describe the procedure for mapping privatizable arrays. The phpf compiler currently relies on directives from the programmer to infer that arrays are privatizable. While the basic alternatives explored for alignment of privatized arrays are the same as those for scalars, there are additional options available when dealing with arrays. 3.1 Basic Procedure The INDEPENDENT directive for loops in HPF asserts that the iterations of the loop may be executed in any order without changing the program semantics. The NEW clause attached to such a directive supplies a list of variables (see example in Figure 6) and modifies the INDEPENDENT directive to assert that the independence of different loop iterations holds if new objects are created for the named variables for each loop iteration. Thus, those variables can be regarded as privatizable with respect to the INDEPEN- DENT loop. phpf is also able to infer the privatizablity of an array from a weaker form of a parallel loop directive which indicates that a loop has no true loop-carried valuebased dependences. Any lhs array reference in which each subscript is either invariant with respect to the parallel loop or is an affine function of inner loop indices contributes to memory-based loop-carried dependences, which can be eliminated only by privatizing that array. The algorithm to determine the target alignment reference is identical to that used for scalar variables. Similarly, once an alignment target has been selected and the Align- Level for that reference determined, the compiler examines each reached use to ensure that it does not appear outside the loop corresponding to AlignLevel. Any seemingly reached uses outside the loop associated with the NEW directive are assumed to be spurious and hence are ignored. The alignment information is kept in a data structure associated with the loop with respect to which the array has been privatized, and applies to all references to that array variable within that loop. The privatizable arrays used to hold results of a reduction operation are also handled in a similar manner as scalar variables in reduction computations. 3.2 Partial Privatization The concept of privatization of variables has traditionally been associated with a single parallel loop at a time. With nested loop parallelism, which is enabled by the use of a

6 !HPF$ Distribute (;,block,block) ::!HPF$ INDEPENDENT, NEW(c) do k =2;nz,1 do j =2;ny,1 do i =2;nx,1 c(i; j; 1) = ::: ::: do j =3;ny,1 do i =2;nx,1 rsd (1;i;j;k)=:::c(i; j, 1; 1) ::: rsd!hpf$ Align (:) with A(:) :: B,C!HPF$ Distribute (block) :: A do i =1;n if (B(i) 6= 0:0) then A(i) =A(i)=B(i) if (B(i) < 0:0) go to 100 else A(i) =C(i) C(i) =C(i) continue Figure 7. Privatized execution of control flow statements Figure 6. Need for partial privatization multi-dimensional processor grid in HPF, the idea of privatization can be trivially extended to apply to each grid dimension and to each loop. However, yet another alternative which can be considered in this scenario is one combining partitioning and privatization the array may be partitioned in some grid dimensions and privatized with respect to the other dimensions. We refer to this as partial privatization. Figure 6 shows a program segment adapted from the APPSP program of NAS Benchmarks, which illustrates the benefits of partial privatization. The array c is privatizable with respect to the k-loop, but not with respect to the j- loop. Correspondingly, the compiler will fail in its attempt to privatize the array in both grid dimensions. Clearly, replication in either dimension would lead to loss of parallelism and a great deal of extra communication, due to the ownercomputes rule. In accordance with the HPF data distribution directives, the only way to exploit parallelism in both the k and the j-loops is to partition the second dimension of c across the first grid dimension, and to privatize it along the second grid dimension. Under partial privatization, the procedure to determine AlignLevel of a target reference is modified to consider subscripts only in those distributed dimensions in which the array is to be privatized. Thus, in the example shown in Figure 6, since c is to be privatized only in the second grid dimension, the AlignLevel for rsd (1;i;j;k) is obtained as 1 (corresponding to the k-loop) rather than 2 (corresponding to the j-loop). The information from the NEW clause indicates that the value of c can be discarded after the k- loop. Hence, the compiler is able to proceed with partial privatization of c, whereas complete privatization was not possible. It is well-known that distribution of arrays on multi- dimensional rather than a single-dimensional processor grid can lead to more scalable solutions for many stencil codes, due to lower volume of interprocessor communication resulting from such a distribution. For the APPSP program specifically, a 3-D distribution of arrays has been known to outperform the 2-D distribution, which in turn has outperformed a 1-D distribution in a hand-tuned message-passing implementation [15]. 4 Control Flow Statements The handling of control flow statements is often left vague under the data-driven compilation model. A default strategy of executing these statements on all processors would lead to problems very similar to those encountered with indiscriminate replication of variables. For example, if all processors were forced to execute any control flow statement in Figure 7, the loop would not be parallelized effectively, and expensive communication would be required for the array B. On the other hand, privatized execution of these statements by the owner of A(i) (which also owns B(i) and C(i)) eliminates the need for any communication and allows the loop to be parallelized effectively, as the loop bounds can be shrunk [9] in the final SPMD code. The phpf compiler applies the following rules to privatize the execution of a control flow statement S (otherwise executed by all processors, by default), and if necessary, identify a reference which will serve as the alignment target for any data needed to execute that statement. If the statement S cannot transfer control to a target statement outside the body of loop L, thens does not contribute to a computation partitioning guard [9] for the loop L. Essentially, S will be executed by the union of all processors executing any other statement inside loop L for a given iteration. Conceptu-

7 ally, this corresponds to the notion of privatization without alignment. Any data referenced in the control predicate of S has to be communicated to the union of all processors that participate in the execution of any statement that is control-dependent on S. In the example shown in Figure 7, both of the if statements transfer control only to a statement inside the i-loop. Hence the execution of those statements can be privatized. Furthermore, following the owner-computes rule, only the owner of A(i) (or equivalently, C(i)) needs to participate in the execution of any statement that is control dependent on either of those control flow statements. Therefore, no communication is needed for the predicate of those if statements, as B(i) is owned by the same processor as A(i). 5 Experimental Results The ideas presented in this paper have been implemented in the phpf prototype compiler [9]. In this section, we describe some preliminary experimental results to show the impact of this analysis. We present performance results on three benchmark programs, each of which illustrates different aspects of our procedure for mapping privatized variables. The first program is TOMCATV, a mesh-generator with Thomson's solver. The program, originally from SPEC92FP benchmark, has been been augmented with HPF directives. The second program, DGEFA, performs gaussian elimination on a matrix with partial pivoting. It is the HPF version of the original routine from LINPACK, in which we have applied procedure-inlining by hand. The third program, APPSP from NAS benchmarks, is a pseudo-application for performance evaluation of a solver for five coupled, nonlinear partial differential equations. Each of these programs was compiled with the -O3 option for optimizations. All measurements were done using 16 thin nodes of an IBM SP TOMCATV Table 1 shows the performance of TOMCATV obtained with three different levels of optimization. The first version, which is the most naive version of the compiler, does not perform privatization and replicates all scalar variables. The second version performs privatization, but always aligns each scalar definition with a producer reference, i.e., with a partitioned array or scalar reference on that statement. The third version applies the algorithm described in Section 2.2 to determine the alignment of privatized scalar variables. As expected, replication of all scalar variables leads to extremely poor performance. This can be attributed to Program #P Execution Time (sec) Replication Producer Selected Alignment Alignment TOMCATV (, block) n = Table 1. Performance of TOMCATV on IBM SP2 Program #Procs Execution Time (sec) Default Alignment DGEFA (,cyclic) n = Table 2. Performance of DGEFA on IBM SP2 the loss of parallelism and excessive communication in the main computational loop nest of the program. We find the performance figures in the second column to be even more interesting. They show that in spite of privatization, there can be a substantial loss of performance if the scalar variables are not mapped carefully. The alignment of a privatizable scalar variable with a partitioned producer reference on the same statement is quite simple to support. In contrast, alignment with a consumer reference requires a more complex procedure that may be recursively invoked to deal with a privatizable consumer reference which in turn needs to be aligned with a target reference. However, alignment of scalar variables with producer references leads to a considerable amount of inner-loop communication in TOMCATV. Our algorithm is able to avoid that by selecting alignment with consumer references in the main computational loop of the program. With proper alignment, we obtain performance improvements of more than two orders of magnitude on 16 processors. In fact, it is only with the appropriate alignment of scalar variables that the program exhibits speedups. 5.2 DGEFA The array on which gaussian elimination is performed is partitioned column-wise in a cyclic manner. In each step of the elimination, partial pivoting involves a maxloc operation along a single array column which is mapped to a single processor. Our optimization to align privatizable variables holding the results of a reduction operation in the dimensions not involved in reduction leads to the computa-

8 tion for partial pivoting being confined to just the relevant processor in each step, and also helps avoid unnecessary communication. Table 2 shows the performance results of DGEFA without and with this optimization. The communication overhead incurred when the reduction variable is replicated across the columns remains roughly constant, but it accounts for an increasing percentage of the execution time as the number of processors is increased. 5.3 APPSP We present performance results for two HPF versions of the program, one with a 1-D distribution of arrays and redistribution (transpose) of data in the sweepz subroutine, and the other employing a fixed 2-D distribution throughout the program. The first two columns of results in Table 3 show that the execution time of the program becomes prohibitively large if array privatization is disabled. In fact, in that case, we had to abort the parallel program after more than a day of execution. The remaining columns show that with a 2-D distribution of arrays, even regular array privatization does not help and the program performs extremely poorly. However, with partial privatization employed by the compiler, we obtain significantly better performance. The program version using 2-D distribution starts out at fewer processors with better performance, mainly due to the absence of global transpose operations in the sweepz subroutine, but does not scale as well as the version using 1-D distribution, unlike hand-tuned message-passing versions of APPSP [15]. An examination of the message-passing code produced by the HPF compiler showed that there is considerable scope for improving the performance of that version by global message combining across loop nests. The phpf compiler does not currently perform that optimization. 6 Related Work There has been a great deal of work done on techniques related to privatization for exposing more parallelism, such as scalar expansion [16], scalar privatization [3], array expansion [7], and array privatization [18, 10, 8]. Our work focuses on the additional analysis needed to apply privatization effectively to data-driven execution, and hence is complementary to previous work. Knobe and Dally present a subspace model and describe an algorithm meant to be applied before data partitioning and scheduling, which attempts to resolve mismatches in the shape of various operands [12]. Their method achieves privatization by adding an expansion dimension that is indexed by a loop induction variable. They also apply the subspace model to optimize the execution of control flow statements. They do not discuss alternatives regarding the alignment of privatized data with other partitioned data or the impact of such mapping on the loop-level placement of communication involving privatized data. Chatterjee et al. present the concept of mobile alignment of arrays with respect to loops [4], which is similar to the idea of array privatization. Their work focuses on choosing between replication and mobile alignment of data. Their algorithm does not take into account information about the privatizability of arrays, which can make code generation difficult or expensive for arrays with mobile alignment that are not privatizable. The work done by Palermo et al. [17] is the most closely related to our work. They use a simpler algorithm in which an assignment to a privatized scalar variable is executed by each processor that participates in the execution of any statement in the given loop iteration, which is similar to our notion of privatization without alignment with a specific reference. This could potentially lead to more communication if there are fewer processors using a scalar value than those made to execute the assignment statement. An earlier implementation of the phpf compiler [9] employed a simpler and more limited version of our analysis for handling privatizable scalar variables. It did not privatize a scalar definition that was not the only reaching definition of the reached uses, and did not deal with privatizable arrays. Privatization of variables is performed by many other HPF compilers as well. However, the method of determining the ownership of those variables has usually not been discussed. 7 Conclusions In this paper, we have presented a novel framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We show that there are numerous alternatives available for mapping privatized variables and the choice of mapping can significantly affect the performance of the program, by as much as two orders of magnitude in some cases on the IBM SP2. Our algorithm to select this mapping is guided by a realistic communication cost model which takes into account optimizations like message vectorization. We also introduce the notion of partial privatization of arrays, which enables a compiler to exploit nested parallelism even when that nested form is incompatible with the conventional definition of array privatization. Our preliminary results, based on an implementation of these ideas in the phpf compiler have been very encouraging. In the future, we plan to integrate our mapping techniques with automatic array privatization. Acknowledgements The author wishes to thank Sam Midkiff for his help in implementing the technique of partial privatization.

9 Program # Processors Execution Time (sec) 1-D, No Array Priv. 1-D, Priv. 2-D, No Partial Priv. 2-D, Partial Priv. APPSP n =64 2 > (1 day) > (1 day) niter = Table 3. Performance of APPSP on IBM SP2 References [1] P. Banerjee, J. Chandy, M. Gupta, E. Hodges, J. Holm, A. Lain, D. Palermo, S. Ramaswamy, and E. Su. An overview of the PARADIGM compiler for distributedmemory multicomputers. IEEE Computer, October [2] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. A compilation approach for Fortran 90D/HPF compilers on distributed memory MIMD computers. In Proc. Sixth Annual Workshop on Languages and Compilers for Parallel Computing, Portland, Oregon, August [3] M. Burke, R. Cytron, J. Ferrante, and W. Hsieh. Automatic generation of nested, fork-join parallelism. Journal of Supercomputing, pages 71 88, [4] S. Chatterjee, J. R. Gilbert, and R. Schreiber. Mobile and replicated alignment of arrays in data-parallel programs. In Proc. Supercomputing ' 94, Washington D.C., November [5] R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4): , October [6] R. Eigenmann, J. Hoeflinger, Z. Li, and D. Padua. Experience in the automatic parallelization of four Perfect- Benchmark programs. In Proc. 4th Workshop on Languages and Compilers for Parallel Computing. Pitman/MIT Press, August [7] P. Feautrier. Array expansion. In Proc ACM International Conference on Supercomputing, July [8] J. Gu, Z. Li, and G. Lee. Symbolic array dataflow analysis for array privatization and program parallelization. In Proc. Supercomputing '95, San Diego, CA, December [9] M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, K. Wang, D. Shields, W.-M. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proc. Supercomputing '95, San Diego, CA, December [10] M. Hall, S. Amarasinghe, B. Murphy, S.-W. Liao, and M. Lam. Detecting coarse-grain parallelism using an interprocedural parallelizing compiler. In Proc. Supercomputing '95, San Diego, CA, December [11] S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8):66 80, August [12] K. Knobe and W. Dally. The subspace model: A theory of shapes for parallel systems. In Proc. 5th Workshop on Compilers for Parallel Computers, Malaga, Spain, June [13] C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr., and M. E. Zosel. The High Performance FORTRAN Handbook. The MIT Press, Cambridge, MA, [14] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Systems, 2(4): , October [15] V. K. Naik. Scalability issues for a class of CFD applications. In Proc Scalable High Performance Computing Conference, pages , May [16] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12): , December [17] D. Palermo, E. Su, E. HodgesIV, and P. Banerjee. Compiler support for privatization on distributed-memory machines. In Proc. 25th International Conference on Parallel Processing, Bloomingdale, IL, August [18] P. Tu and D. Padua. Automatic array privatization. In Proc. 6th Workshop on Languages and Compilers for Parallel Computing, Portland, OR, August [19] H. Zima, H. Bast, and M. Gerndt. SUPERB: A tool for semiautomatic MIMD/SIMD parallelization. Parallel Computing, 6:1 18, 1988.

Ptran II - A Compiler for High Performance Fortran. M. Gupta S. Midki E. Schonberg P. Sweeney K.Y. Wang M. Burke

Ptran II - A Compiler for High Performance Fortran. M. Gupta S. Midki E. Schonberg P. Sweeney K.Y. Wang M. Burke Ptran II - A Compiler for High Performance Fortran M. Gupta S. Midki E. Schonberg P. Sweeney K.Y. Wang M. Burke fmgupta,midki,schnbrg,pfs,kyw,burkemg@watson.ibm.com I.B.M. T.J Watson Research Center T.J.

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program

Systolic arrays Parallel SIMD machines. 10k++ processors. Vector/Pipeline units. Front End. Normal von Neuman Runs the application program SIMD Single Instruction Multiple Data Lecture 12: SIMD-machines & data parallelism, dependency analysis for automatic vectorizing and parallelizing of serial program Part 1 Parallelism through simultaneous

More information

Interprocedural Dependence Analysis and Parallelization

Interprocedural Dependence Analysis and Parallelization RETROSPECTIVE: Interprocedural Dependence Analysis and Parallelization Michael G Burke IBM T.J. Watson Research Labs P.O. Box 704 Yorktown Heights, NY 10598 USA mgburke@us.ibm.com Ron K. Cytron Department

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Parallelization System. Abstract. We present an overview of our interprocedural analysis system,

Parallelization System. Abstract. We present an overview of our interprocedural analysis system, Overview of an Interprocedural Automatic Parallelization System Mary W. Hall Brian R. Murphy y Saman P. Amarasinghe y Shih-Wei Liao y Monica S. Lam y Abstract We present an overview of our interprocedural

More information

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291

More information

Keywords: networks-of-workstations, distributed-shared memory, compiler optimizations, locality

Keywords: networks-of-workstations, distributed-shared memory, compiler optimizations, locality Informatica 17 page xxx{yyy 1 Overlap of Computation and Communication on Shared-Memory Networks-of-Workstations Tarek S. Abdelrahman and Gary Liu Department of Electrical and Computer Engineering The

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

Chapter 1: Interprocedural Parallelization Analysis: A Case Study. Abstract

Chapter 1: Interprocedural Parallelization Analysis: A Case Study. Abstract Chapter 1: Interprocedural Parallelization Analysis: A Case Study Mary W. Hall Brian R. Murphy Saman P. Amarasinghe Abstract We present an overview of our interprocedural analysis system, which applies

More information

A Global Communication Optimization Technique Based on Data-Flow Analysis and Linear Algebra

A Global Communication Optimization Technique Based on Data-Flow Analysis and Linear Algebra A Global Communication Optimization Technique Based on Data-Flow Analysis and Linear Algebra M. KANDEMIR Syracuse University P. BANERJEE and A. CHOUDHARY Northwestern University J. RAMANUJAM Louisiana

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8) Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy Automatic Translation of Fortran Programs to Vector Form Randy Allen and Ken Kennedy The problem New (as of 1987) vector machines such as the Cray-1 have proven successful Most Fortran code is written

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Coarse-Grained Parallelism

Coarse-Grained Parallelism Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and

More information

Lecture Notes on Loop Optimizations

Lecture Notes on Loop Optimizations Lecture Notes on Loop Optimizations 15-411: Compiler Design Frank Pfenning Lecture 17 October 22, 2013 1 Introduction Optimizing loops is particularly important in compilation, since loops (and in particular

More information

A Propagation Engine for GCC

A Propagation Engine for GCC A Propagation Engine for GCC Diego Novillo Red Hat Canada dnovillo@redhat.com May 1, 2005 Abstract Several analyses and transformations work by propagating known values and attributes throughout the program.

More information

Generalized Iteration Space and the. Parallelization of Symbolic Programs. (Extended Abstract) Luddy Harrison. October 15, 1991.

Generalized Iteration Space and the. Parallelization of Symbolic Programs. (Extended Abstract) Luddy Harrison. October 15, 1991. Generalized Iteration Space and the Parallelization of Symbolic Programs (Extended Abstract) Luddy Harrison October 15, 1991 Abstract A large body of literature has developed concerning the automatic parallelization

More information

Pipelining Wavefront Computations: Experiences and Performance

Pipelining Wavefront Computations: Experiences and Performance Pipelining Wavefront Computations: Experiences and Performance E Christopher Lewis and Lawrence Snyder University of Washington Department of Computer Science and Engineering Box 352350, Seattle, WA 98195-2350

More information

Incompatibility Dimensions and Integration of Atomic Commit Protocols

Incompatibility Dimensions and Integration of Atomic Commit Protocols The International Arab Journal of Information Technology, Vol. 5, No. 4, October 2008 381 Incompatibility Dimensions and Integration of Atomic Commit Protocols Yousef Al-Houmaily Department of Computer

More information

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Hock-Beng Lim Center for Supercomputing R & D University of Illinois Urbana, IL 61801 hblim@csrd.uiuc.edu Pen-Chung Yew Dept. of Computer

More information

Lecture Notes on Liveness Analysis

Lecture Notes on Liveness Analysis Lecture Notes on Liveness Analysis 15-411: Compiler Design Frank Pfenning André Platzer Lecture 4 1 Introduction We will see different kinds of program analyses in the course, most of them for the purpose

More information

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT 6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel

More information

Machine-Independent Optimizations

Machine-Independent Optimizations Chapter 9 Machine-Independent Optimizations High-level language constructs can introduce substantial run-time overhead if we naively translate each construct independently into machine code. This chapter

More information

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Yuan-Shin Hwang Department of Computer Science National Taiwan Ocean University Keelung 20224 Taiwan shin@cs.ntou.edu.tw

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Extended SSA with factored use-def chains to support optimization and parallelism

Extended SSA with factored use-def chains to support optimization and parallelism Oregon Health & Science University OHSU Digital Commons CSETech June 1993 Extended SSA with factored use-def chains to support optimization and parallelism Eric Stoltz Michael P. Gerlek Michael Wolfe Follow

More information

Array Optimizations in OCaml

Array Optimizations in OCaml Array Optimizations in OCaml Michael Clarkson Cornell University clarkson@cs.cornell.edu Vaibhav Vaish Cornell University vaibhav@cs.cornell.edu May 7, 2001 Abstract OCaml is a modern programming language

More information

High Performance Fortran http://www-jics.cs.utk.edu jics@cs.utk.edu Kwai Lam Wong 1 Overview HPF : High Performance FORTRAN A language specification standard by High Performance FORTRAN Forum (HPFF), a

More information

Feedback Guided Scheduling of Nested Loops

Feedback Guided Scheduling of Nested Loops Feedback Guided Scheduling of Nested Loops T. L. Freeman 1, D. J. Hancock 1, J. M. Bull 2, and R. W. Ford 1 1 Centre for Novel Computing, University of Manchester, Manchester, M13 9PL, U.K. 2 Edinburgh

More information

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming

CSE 262 Spring Scott B. Baden. Lecture 4 Data parallel programming CSE 262 Spring 2007 Scott B. Baden Lecture 4 Data parallel programming Announcements Projects Project proposal - Weds 4/25 - extra class 4/17/07 Scott B. Baden/CSE 262/Spring 2007 2 Data Parallel Programming

More information

Combined Compile-time and Runtime-driven, Pro-active Data Movement in Software DSM Systems

Combined Compile-time and Runtime-driven, Pro-active Data Movement in Software DSM Systems Combined Compile-time and Runtime-driven, Pro-active Data Movement in Software DSM Systems Seung-Jai Min and Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University Abstract Scientific

More information

Optimizing INTBIS on the CRAY Y-MP

Optimizing INTBIS on the CRAY Y-MP Optimizing INTBIS on the CRAY Y-MP Chenyi Hu, Joe Sheldon, R. Baker Kearfott, and Qing Yang Abstract INTBIS is a well-tested software package which uses an interval Newton/generalized bisection method

More information

Parallel Fast Fourier Transform implementations in Julia 12/15/2011

Parallel Fast Fourier Transform implementations in Julia 12/15/2011 Parallel Fast Fourier Transform implementations in Julia 1/15/011 Abstract This paper examines the parallel computation models of Julia through several different multiprocessor FFT implementations of 1D

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop

More information

Automatic Tiling of Iterative Stencil Loops

Automatic Tiling of Iterative Stencil Loops Automatic Tiling of Iterative Stencil Loops Zhiyuan Li and Yonghong Song Purdue University Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Journal of Instruction-Level Parallelism 5 (2003) 1-29 Submitted 10/02; published 4/03 An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey

More information

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

EE482c Final Project: Stream Programs on Legacy Architectures

EE482c Final Project: Stream Programs on Legacy Architectures EE482c Final Project: Stream Programs on Legacy Architectures Chaiyasit Manovit, Zi-Bin Yang, John Kim and Sanjit Biswas {cmanovit, zbyang, jjk12, sbiswas}@stanford.edu} June 6, 2002 1. Summary of project

More information

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*)

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) David K. Poulsen and Pen-Chung Yew Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

More information

Parallel Processing: October, 5, 2010

Parallel Processing: October, 5, 2010 Parallel Processing: Why, When, How? SimLab2010, Belgrade October, 5, 2010 Rodica Potolea Parallel Processing Why, When, How? Why? Problems too costly to be solved with the classical approach The need

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed Memory Machines

Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed Memory Machines 0-8186-8090-3/97 $10 1997 IEEE Compiler Algorithms for imizing Locality and Parallelism on Shared and Distributed Memory Machines M. Kandemir y J. Ramanujam z A. Choudhary x Abstract Distributed memory

More information

Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations

Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations Ying Hu Clark Barrett Benjamin Goldberg Department of Computer Science New York University yinghubarrettgoldberg

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Center for Supercomputing Research and Development. recognizing more general forms of these patterns, notably

Center for Supercomputing Research and Development. recognizing more general forms of these patterns, notably Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann potteng@csrd.uiuc.edu, eigenman@csrd.uiuc.edu Center for Supercomputing Research and Development University of

More information

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion Outline Loop Optimizations Induction Variables Recognition Induction Variables Combination of Analyses Copyright 2010, Pedro C Diniz, all rights reserved Students enrolled in the Compilers class at the

More information

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors Image Template Matching on Distributed Memory and Vector Multiprocessors V. Blanco M. Martin D.B. Heras O. Plata F.F. Rivera September 995 Technical Report No: UMA-DAC-95/20 Published in: 5th Int l. Conf.

More information

Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger

Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger Hybrid Analysis and its Application to Thread Level Parallelization Lawrence Rauchwerger Thread (Loop) Level Parallelization Thread level Parallelization Extracting parallel threads from a sequential program

More information

APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC PARALLELIZATION OF FLOW IN POROUS MEDIA PROBLEM SOLVER

APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC PARALLELIZATION OF FLOW IN POROUS MEDIA PROBLEM SOLVER Mathematical Modelling and Analysis 2005. Pages 171 177 Proceedings of the 10 th International Conference MMA2005&CMAM2, Trakai c 2005 Technika ISBN 9986-05-924-0 APPLICATION OF PARALLEL ARRAYS FOR SEMIAUTOMATIC

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow

More information

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Spring 2 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Outline Why Parallelism Parallel Execution Parallelizing Compilers

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Tour of common optimizations

Tour of common optimizations Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)

More information

OpenMP Optimization and its Translation to OpenGL

OpenMP Optimization and its Translation to OpenGL OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For

More information

Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers

Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers Evaluating the Performance of Software Distributed Shared Memory as a Target for Parallelizing Compilers Alan L. Cox y, Sandhya Dwarkadas z, Honghui Lu y and Willy Zwaenepoel y y Rice University Houston,

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Extrinsic Procedures. Section 6

Extrinsic Procedures. Section 6 Section Extrinsic Procedures 1 1 1 1 1 1 1 1 0 1 This chapter defines the mechanism by which HPF programs may call non-hpf subprograms as extrinsic procedures. It provides the information needed to write

More information

Enhancing Parallelism

Enhancing Parallelism CSC 255/455 Software Analysis and Improvement Enhancing Parallelism Instructor: Chen Ding Chapter 5,, Allen and Kennedy www.cs.rice.edu/~ken/comp515/lectures/ Where Does Vectorization Fail? procedure vectorize

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines

Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines Journal of Parallel and Distributed Computing 6, 924965 (2) doi:.6jpdc.2.639, available online at http:www.idealibrary.com on Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

Compiling High Performance Fortran for Distributedmemory

Compiling High Performance Fortran for Distributedmemory Compiling High Performance Fortran for Distributedmemory Systems Jonathan Harris John A. Bircsak M. Regina Bolduc Jill Ann Diewald Israel Gale Neil W. Johnson Shin Lee C. Alexander Nelson Carl D. Offner

More information

Data Dependency. Extended Contorol Dependency. Data Dependency. Conditional Branch. AND OR Original Control Flow. Control Flow. Conditional Branch

Data Dependency. Extended Contorol Dependency. Data Dependency. Conditional Branch. AND OR Original Control Flow. Control Flow. Conditional Branch Coarse Grain Task Parallel Processing with Cache Optimization on Shared Memory Multiprocessor Kazuhisa Ishizaka, Motoki Obata, Hironori Kasahara fishizaka,obata,kasaharag@oscar.elec.waseda.ac.jp Dept.EECE,

More information

Transforming Complex Loop Nests For Locality

Transforming Complex Loop Nests For Locality Transforming Complex Loop Nests For Locality Qing Yi Ken Kennedy Computer Science Department Rice University Abstract Because of the increasing gap between the speeds of processors and standard memory

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information

Horizontal Aggregations for Mining Relational Databases

Horizontal Aggregations for Mining Relational Databases Horizontal Aggregations for Mining Relational Databases Dontu.Jagannadh, T.Gayathri, M.V.S.S Nagendranadh. Department of CSE Sasi Institute of Technology And Engineering,Tadepalligudem, Andhrapradesh,

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution

Multi-Domain Pattern. I. Problem. II. Driving Forces. III. Solution Multi-Domain Pattern I. Problem The problem represents computations characterized by an underlying system of mathematical equations, often simulating behaviors of physical objects through discrete time

More information

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as 372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct

More information

Programming Languages Third Edition. Chapter 9 Control I Expressions and Statements

Programming Languages Third Edition. Chapter 9 Control I Expressions and Statements Programming Languages Third Edition Chapter 9 Control I Expressions and Statements Objectives Understand expressions Understand conditional statements and guards Understand loops and variation on WHILE

More information

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Microsoft Research 1 Microsoft Way Redmond, WA 9805 jplatt@microsoft.com Abstract Training a Support Vector Machine

More information

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:!

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:! Class Notes 18 June 2014 Tufts COMP 140, Chris Gregg Detecting and Enhancing Loop-Level Parallelism Loops: the reason we can parallelize so many things If the compiler can figure out if a loop is parallel,

More information

Qualifying Exam in Programming Languages and Compilers

Qualifying Exam in Programming Languages and Compilers Qualifying Exam in Programming Languages and Compilers University of Wisconsin Fall 1991 Instructions This exam contains nine questions, divided into two parts. All students taking the exam should answer

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Software Testing. 1. Testing is the process of demonstrating that errors are not present.

Software Testing. 1. Testing is the process of demonstrating that errors are not present. What is Testing? Software Testing Many people understand many definitions of testing :. Testing is the process of demonstrating that errors are not present.. The purpose of testing is to show that a program

More information

Single-Pass Generation of Static Single Assignment Form for Structured Languages

Single-Pass Generation of Static Single Assignment Form for Structured Languages 1 Single-Pass Generation of Static Single Assignment Form for Structured Languages MARC M. BRANDIS and HANSPETER MÖSSENBÖCK ETH Zürich, Institute for Computer Systems Over the last few years, static single

More information

Data structures for optimizing programs with explicit parallelism

Data structures for optimizing programs with explicit parallelism Oregon Health & Science University OHSU Digital Commons CSETech March 1991 Data structures for optimizing programs with explicit parallelism Michael Wolfe Harini Srinivasan Follow this and additional works

More information

Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku

Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku Todd K. Moon and Jacob H. Gunther Utah State University Abstract The popular Sudoku puzzle bears structural resemblance to

More information