Center for Supercomputing Research and Development. recognizing more general forms of these patterns, notably

Size: px
Start display at page:

Download "Center for Supercomputing Research and Development. recognizing more general forms of these patterns, notably"

Transcription

1 Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign 1308 West Main Street, Urbana, Illinois Abstract The elimination of induction variables and the parallelization of reductions in FORTRAN programs have been shown to be integral to performance improvement on parallel computers [7, 8]. As part of the Polaris project [5], compiler passes that recognize these idioms have been implemented andevaluated. Developing these techniques to the point necessary to achieve signicant speedups on real applications has prompted solutions to problems that have not been addressed in previous reports on idiom recognition techniques. These include analysis techniques capable of disproving zero-trip loops, symbolic handling facilities to compute closed forms of recurrences, and interfaces to other compilation passes such as the data-dependence test. In comparison, the recognition phase of solving induction variables, which has received most attention so far, has in fact turned out to be relatively straightforward. This paper provides an overview of techniques described in more detail in [12]. 1 Introduction The translation of sequential programs into parallel form is a well-established discipline. An introduction to this topic can be found in [2]. Parallelizing translation techniques deal mostly with loops whose iterations can be executed in parallel if no iteration produces a data value that is consumed by another iteration. In some circumstances such true dependences can be removed, for example if the value to be consumed can be expressed directly as a function of the loop index variable(s). We call this an induction variable and the direct expression its closedform. Another example of removing a true dependence involves an associative and commutative operation performed across loop iterations. Since the order of such operations can be changed, iterations need not wait for the previous iteration to perform its operation, as long as we ensure exclusive access to the operand. Important operations of this type are reductions, e.g., accumulating values across loop iterations. Both induction and reduction idioms are recognized by available parallelizing compilers if they have simple forms. This work is supported by the National Security Agency and by U.S. Army contract #DABT63-92-C This work is not necessarily representative of the positions or policies of the Army orthe Government. In our previous experiments [7, 8] we have seen the need for recognizing more general forms of these patterns, notably induction variables in triangular loops and in multiplicative expressions, and reduction operations on arrays in histogram form. We discuss these variants below. We have implemented new and powerful idiom recognition and solution techniques in the Polaris compiler [5, 6] which contribute towards substantial performance improvements over previous generations of parallelizing compilers. 2 Induction Variable Substitution The following example shows a triangular induction variable that occurs inside a triangular loop nest (the inner loop bound depends on the outer loop index). The nest can be translated into parallel form as shown. Available compilers typically are able to substitute the induction variable in the inner loop only. iv =0 do j =1 i do parallel j =1 m a(iv) = ) a(j +(i 2 ; i)=2 ; 1) = iv = iv +1 Our generalized induction variable algorithm performs the complete translation in two steps. The rst step recognizes induction variables that match the statement pattern iv = iv + inc expression where inc expression can contain outer loop indices, other induction variables, and loop-invariant terms. Induction variables may also appear in more than one induction statement, at dierent levels of a loop nest, as exemplied in section 2.2. The closed form of the induction variable is computed in the second step. In the following discussion we consider a loop nest containing an induction variable iv. oop has n iterations and the variable iter denotes the current iteration. For the sake of simplicity we treat iteration spaces from 1 to n (this is easily derived from the actual loop bounds and step). The rst and last statements of loop are labeled loopentry and loopexit, respectively. Denition 1 Given a loop, the total increment iv (iter) of the additive induction variable iv in a single iteration of the body of is dened as:

2 X loop exit iv (iter) = increments to iv in s s=loop entry Here s iterates over the statements in loop from loop entry to loop exit. The increment toiv is zero if s is not an induction statement. Inner loops are considered a single (compound) statement and their increment is determined using Denition 3. Denition 2 The increment at the beginning of iteration i of the additive induction variable iv is dened as: i@i iv = Xi;1 iter=1 iv (iter) Here iv (iter) is summed over the iteration space of from the rst to the i ; 1th iteration. Denition 3 The increment of the additive induction variable iv upon completion of loop is dened as: i@n iv = nx iter=1 iv (iter) Here we are summing iv (iter) over the entire iteration space of from 1 to n. Similar denitions hold for multiplicative induction variables. We refer the reader to the more detailed description[12]. Initially, is computed by a lexical scan which forms a symbolic sum of the right hand side increments at all induction sites of a given induction variable iv. Once has been computed for a given iv, the symbolic expressions i@i and i@n are computed using a symbolic sum function based on Bernoulli numbers [13]. The algorithm proceeds recursively when an inner loop is encountered. The current value for i@i in the enclosing loop is used as the initial value of the iv at entry to the nested loop. The sum is then determined for the inner loop (recursing again as necessary), and the expression i@i is formed for the inner loop. In other words, when iv occurs inside a nested loop, can only be partially calculated for outer loops and becomes fully known in these outer loops only as i@n is recursively returned from inner loop sums. As the recursion unwinds and loop exits are encountered, the expressions i@n are calculated. These are the last values which will be used outside the loop if the induction variables' live range extends beyond the loop exit. The overall picture is one of a statement-wise traversal of a given loop body, rst summing increments, then multiplying across iteration spaces, all done recursively as necessary. The nal substitution step is straightforward. Briey, traversing from outside in, each loop header is annotated with the value of the induction variable at entry to the loop. The same annotation is made at the site of each use of the induction variable, including any increments encountered in the body of the loop up to the point of use. Our current implementation does a direct substitution of the closed form. The algorithm works similarly for multiplicative induction variables. Instead of sum operations, however, products are formed. Again, we refer the reader to [12] for additional detail. 2.1 Zero Trip Test If the upper bound UB ofaloopislessthanthelower bound B, then the loop is not executed, and as a result the computation of the closed form of an induction variable must include analysis techniques capable of detecting zero-trip loops. Our algorithm handles this problem with new techniques termed Symbolic Range Propagation [4]. We prove in all cases save one of our application test suite that loops do not have zero trips and hence the transformation is correct. The remaining case requires the use of interprocedural range propagation, which has not yet been implemented. In addition, we have determined that in fact it is sucient that loops meet the weaker requirement UB B;1. Expressed in terms of Denition 3 above, for example, UB B ;1 ) that i@n iv is correct. In words this means simply that the last value iv last of induction variable iv in loop is correct given UB B ; 1. Consider the following example: iv =0 do j =1 m iv = iv +2 ) do parallel j =1 m iv = iv +1 a(iv) = a(i +2im) = Intuitively, one concludes that the condition m 1 insures the correctness of the transformation in fact the transformation is still correct for m =0. Wehave termed this an exact zero trip and have found it important in our application test suite. 2.2 Wrap-Around Variables A wrap-around variable is classically dened as a variable that takes on the value of an induction variable after one iteration of a loop [11]. There is one important case in our application test suite where the recognition of wrap-around loop bounds is a necessary precursor to the solution of an induction variable: m =0 do j =1 i lb = j ub = i do k = i n do l = lb ub m = m +1 a(m)= lb =1 ub = k +1 m = m + i Note the presence of the loop-variant wrap-around variables lb and ub on the l loop, which must be solved in order to determine the closed form for the induction variable m. After the induction transformation we have: m =0 do parallel j =1 i do parallel l = j i private m m = l +(i +(;9i 2 ; 3i 4 +6i +6i 3 ; 6in;

3 6in 2 +6ni 2 +6i 2 n 2 )=4 ; 3n; 3j 2 ; n 2 +2i 3 +3j +3i 2 ; 3ij; 3ji 2 +3jn +3jn 2 )=6 ; 2i +2ij a(m)= do parallel k =1+i n do parallel l =1 k private m m = l +((;9i 2 ; 3i 4 +6i +6i 3 ; 6in; 6in 2 +6ni 2 +6i 2 n 2 )=4 ; 3k ; 3n; 3j 2 ; 3n 2 ; 2i +2i 3 +3j +3k 2 ; 3ij ; 3ji 2 +3jn +3jn 2 )=6 ; i +2ij a(m)= Employing powerful symbolic manipulation, the induction pass rst recognized and then removed the wrap-around loop bounds lb and ub by peeling the rst iteration of the k loop, allowing the solution of the induction on m within the outermost i loop. Note that the k loop now executes an exact zero trip when i = n, and that the correctness of this transformation has been proven using the techniques described in section 2.1. As noted in section 2, the value of m has been substituted directly on the right hand side of each remaining induction site, and m is now a loop private variable 1. As is clear from the example, direct substitution introduces unnecessary overhead in inner loops. For this reason we are currently implementing an on-demand substitution algorithm which incorporates strength reduction of complex expressions of this nature. 3 Reduction Recognition The following example represents an important pattern of reduction operations that we have found in the analysis of real application programs: do j =1 n; i fx(i)=fx(i)+exp1 fx(i + j)=fx(i + j)+exp2 The loop nest accumulates into the array FX. Different iterations accumulate into dierent array elements. This is what we termahistogram reduction. ThePolaris- transformed code is shown below: do j =1 cpus fx0(i j)=0 private p = procid do j =1 n; i fx0(i p)=fx0(i p)+exp1 fx0(i + j p)=fx0(i + j p)+exp2 1 Note also that the third induction site just inside the outermost i loop has been automatically removed do j =1 cpus fx(i) =fx(i)+fx0(i j) 3.1 Recognition Pass The algorithm for recognizing reductions searches for assignment statements within a given loop of the form: A(1 2 )=A(1 2 )+ where represents an arbitrary expression and A may be a multi-dimensional array with subscript vector f1 2 g which may contain both loop-variant and invariant terms. Neither i nor may contain a reference to A, anda must not be referenced elsewhere in the loop outside other reduction statements. Of course f1 2 g may benull (i.e., A is a scalar variable). The algorithm recognizes potential reductions which fall into the two classes of histogram reductions (where one of the subscript dimensions, e.g., 1, is loop-variant) and singleaddress reductions (where all dimensions are loop-invariant). The reduction recognition pass of Polaris is based on powerful pattern matching primitives that are part of the Polaris FORBO environment [14]. In the recognition pass, variables that match the above pattern are agged as potential reduction variables. 3.2 Data Dependence Pass The data dependence pass analyzes candidate reduction variables. If it can prove independence, then it removes the reduction ag. This situation occurs if all loop iterations accumulate into separate elements of an array. 3.3 Transformation Pass In the transformation stage, three dierent types of parallel reductions can be generated. They are termed blocked, privatized, and expanded. All three schemes take advantage of the fact that the sum operation is mathematically commutative and associative, and thus the accumulation statement sequence can be reordered. Although it is well known that this can lead to roundo errors, we have not found this to be a problem in our experiments. Polaris includes a switch that allows the user to disable the transformation, which is the common way of dealing with this issue in many compilers. The rst scheme, termed blocked, involves the insertion of synchronization primitives around each reduction statement, making the sum operation atomic. In our example the reduction statements could simply be enclosed by lock/unlock pairs, allowing the loop to be executed in parallel. This is an elegant solution where the architecture provides fast synchronization primitives. However, in the machines used in our experiments the synchronization overhead was high, and as a result we focussed our eorts on the remaining methods. In privatized reductions, duplicates of the reduction variable that are private to each processor are created and used in the reduction statements in place of the original variable. The partial sums accumulated by these variables are initialized to zero at loop entry. In this way, the loop can now be executed in a parallel doall fashion without the need for synchronization other than the sum across processors executed prior to nal loop exit.

4 Scheme three is used in our example and it is termed expanded reductions. It is similar to the second scheme, but collects the partial sums in a shared array that has an additional dimension equal to the number of processors executing in parallel. All reduction variables are replaced by references to this new, global array, and the newly created dimension is indexed by the processor number executing the current iteration. The transformed loop may be executed fully in parallel. The partial sums are combined in a separate loop following the original one. For histogram reductions this loop can be executed in parallel as well, as shown in the example above. One thorny issue dealt with in implementing methods two and three was the determination of the size of the partial sum variables for histogram reductions. Often it is not possible to determine this size from the code surrounding the loop. Because of this we again took good advantage of Symbolic Range Propagation [4] to determine these sizes. 4 Performance Results Table 1 shows the speedups we have obtained on an 8- processor set on an SGI Challenge (R4400), relative to the serial execution speed for three codes in our application suite. We measured the overall speedup due to Polaris optimizations, and the diminished speedups after disabling the induction and reduction transformations, respectively. For comparison purposes, the benchmarks under discussion were also compiled and executed using the commercially available SGI parallelizer, Power Fortran Accelerator (PFA), with additional non-default optimization options -ag=a and -r=3. MDG TRFD TURB3D serial PFA Polaris (no reduction) (no induction) Table 1: Overall Program Speedups In all programs the Polaris compiler achieves signicant speedups. In TURB3D this is about 9% greater than PFA, and in the other programs PFA achieves little or no speedup 2. The speedups are a result of the transformations discussed in this paper in conjunction with additional techniques described in [6]. As shown by the table entries (no reduction) and (no induction) above, in all cases the performance drops substantially when these compiler capabilities are switched o. The performance gures become more clear when considering the most time-consuming loops of the test suite. Table 2 summarizes the idiom recognition techniques applied to these loops and the resulting loop speedups. These results were obtained using a loop-by-loop instrumentation facility built into the Polaris compiler. Column % serial shows the percentage time that each loop takes in the serial program execution. All of these loops would be serial without the new capabilities described in this paper. In Table 2, HR means a histogram reduction was solved in this loop, R, a reduction IV, an induction variable GIV, a generalized induction variable and W, a wrap-around variable. 2 In fact, results from more recent runsofpolaris which incorporate full inlining bring the speedup gures for TURB3D up to 6 loop transformation speedup % serial MDG interf do1000 IV, HR poteng do2000 IV, R TRFD olda do100 GIV olda do300 GIV, W TURB3D enr do2 IV, R Table 2: Key oop Timings In the program MDG, subroutine interf, loopdo1000 contains multiple induction variables as well as histogram reductions. Polaris rst solved the inductions, then applied expanded reductions to generate parallel code for the loop. Interf do1000 is by far the most time consuming loop in the program. Another loop, poteng do2000, is similar in nature however,asnoted,itcontains single-address reduction operations. Polaris agged these reductions with directives, which were then processed by the back end compiler on the SGI. In TRFD subroutine ODA, two loops dominate program execution, do 100 and do 300. Both loops exercise Polaris' generalized induction variable substitution capabilities. Do 300 contains an induction variable that is dependent on a second, triangular induction variable. We have termed this a coupled induction [13]. The coupled induction variable occurs at several nesting levels of nested triangular loops (the innermost site is in a quadruply nested, doubly triangular loop). As a result, the subscript expressions introduced by the substitution pass are non-linear and cannot be handled by previously known data-dependence tests. The new Polaris data dependence test that handles such expressions is described in [3]. Olda do300 further contains two wrap-around variables that need to be recognized in order to solve the inductions properly. This transformation was described in section 2.2. Both loops also contain reduction operations however, they occur in inner loops that were not parallelized due to the single-level parallelism supported by the SGI FORTRAN compiler. oop ENR do2 contains a single-address reduction in a quadruply nested loop. Similar to poteng do2000, this reduction is agged by Polaris and solved via privatization by the back end SGI compiler. 5 Related Work Three notable related contributions to idiom recognition techniques deal with the recognition of induction variables. Haghighat and Polychronopoulos [10] symbolically execute loops and inspect the sequence of values assumed by variables. By taking the dierence of consecutive values, the dierence of the dierence, etc., they can interpolate a polynomial representing the closed form of an induction variable. Harrison and Ammarguellat [1] do abstract interpretation of loops to capture the eect of a single iteration on each variable assigned in the loop. The eect is then matched with pre-dened templates that express recurrence relations, including induction variables of various forms. Gerlek et al recognize induction variables by nding certain patterns in the Static Single Assignment representation of the program [9]. They compute the closed form of an induction variable based on a classication of the specic SSA pattern.

5 All three approaches provide some of the functionality necessary to handle the generalized forms of induction variables that we have seen in real programs. Potentially, they can recognize complex control structures, such as induction variables assigned in dierent branches of IF statements, which we have not implemented in our algorithm. However, Polaris is the only compiler that has delivered the proof of concept in that it has demonstrated that new induction and reduction variable techniques can contribute to real speedups of real programs. In doing this we have identied necessary parts of both induction and reduction variable transformation techniques that others have ignored or missed: symbolic manipulation tools for generating closed forms of induction variables, program analysis capabilities for determining and disproving zero-trip loops, combining induction variable recognition with wrap-around variable removal, symbolically determining ranges of histogram reductions, and complementing non-linear data-dependence tests. Some of these capabilities are supported by other passes available in the Polaris compiler. Dening proper interfaces between induction and reduction variable handling and these passes is another distinguishing feature of the compiler. 6 Conclusion The new Polaris compiler is able to gain signicant speedups in real programs, and demonstrates the parallelization technology necessary to do so. Two crucial techniques have been described in this paper: the recognition and transformation of induction variables and the parallelization of reduction operations. Our implementation and measurements conrm the great importance of these idiom recognition techniques. We have demonstrated that the implementation of the necessary generalized forms of induction variable and reduction handling techniques is feasible in a parallelizing compiler and that it is possible to integrate them in such away that the resulting compiler actually achieves substantial performance results on real programs. This demonstration has not been made previously, and it is particularly signicant in that some of the transformations introduce complexities in the generated code (i.e., non-linear expressions) that cannot be readily processed by other compilation passes. In our evaluation we have also determined that the issue of recognizing complex recurrence patterns and control- ow structures, which was given much attention in related work, was not of primary concern. Instead we have found other issues to be of critical importance, namely, the provision of symbolic computation capabilities for handling closed forms of recurrences, determining iteration counts and variable ranges, and the integration of the idiom recognition pass with other compilation techniques. References [1] Zahira Ammarguellat and uddy Harrison. Automatic Recognition of Induction & Recurrence Relations by Abstract Interpretation. Proceedings of Sigplan 1990, Yorktown Heights, 25(6):283{295, June [2] Utpal Banerjee, Rudolf Eigenmann, Alexandru Nicolau, and David Padua. Automatic Program Parallelization. Proceedings of the IEEE, 81(2), February [3] William Blume and Rudolf Eigenmann. The Range Test: A Dependence Test for Symbolic, Non-linear Expressions. Proceedings of Supercomputing '94, November 1994, Washington D.C., pages 528{537, November [4] William Blume and Rudolf Eigenmann. Symbolic Range Propagation. Proceedings of the 9th International Parallel Processing Symposium, April 1995, October [5] William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeinger, David Padua, Paul Petersen, Bill Pottenger, awrence Rauchwerger, Peng Tu, and Stephen Weatherford. Polaris: Improving the Eectiveness of Parallelizing Compilers. Proceedings of the Seventh Workshop on anguages and Compilers for Parallel Computing, Ithaca, New York also: ecture Notes in Computer Science 892, Springer-Verlag, pages 141{ 154, August [6] William Blume, Rudolf Eigenmann, Jay Hoeinger, David Padua, Paul Petersen, awrence Rauchwerger, and Peng Tu. Automatic Detection of Parallelism: A Grand Challenge for High-Performance Computing. IEEE Parallel and Distributed Technology, 2(3):37{47, Fall [7] Rudolf Eigenmann, Jay Hoeinger, Zhiyuan i, and David Padua. Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs. ecture Notes in Computer Science 589. Proceedings of the Fourth Workshop on anguages and Compilers for Parallel Computing, Santa Clara, CA, pages 65{83, August [8] Rudolf Eigenmann, Jay Hoeinger, and David Padua. On the Automatic Parallelization of the Perfect Benchmarks. Technical Report 1392, Univ. of Illinois at Urbana-Champaign, Cntr. for Supercomputing Res. & Dev., December [9] Michael P. Gerlek, Eric Stoltz, and Michael Wolfe. Beyond induction variables: Detecting and classifying sequences using a demand-driven SSA form. To appear in TOPAS 95. [10] Mohammad R. Haghighat and Constantine D. Polychronopoulos. Symbolic Program Analysis and Optimization for Parallelizing Compilers. Presented at the 5th Annual Workshop on anguages and Compilers for Parallel Computing, New Haven, CT, August 3-5, [11] D. Padua and M. Wolfe. Advanced Compiler Optimization for Supercomputers. CACM, 29(12):1184{1201, December, [12] Bill Pottenger and Rudolf Eigenmann. Parallelization in the Presence of Generalized Induction and Reduction Variables. Technical Report 1396, Univ. of Illinois at Urbana-Champaign, Cntr. for Supercomputing Res. & Dev., January [13] William Morton Pottenger. Induction Variable Substitution and Reduction Recognition in the Polaris Parallelizing Compiler. Master's thesis, Univ of Illinois at Urbana-Champaign, Cntr for Supercomputing Res & Dev, December [14] Stephen Weatherford. High-evel Pattern-Matching Extensions to C++ for Fortran Program Manipulation in Polaris. Master's thesis, Univ. of Illinois at Urbana- Champaign, Cntr. for Supercomputing Res. & Dev., May 1994.

Peng Tu and David Padua. Center for Supercomputing Research and Development.

Peng Tu and David Padua. Center for Supercomputing Research and Development. Gated SSA Based Demand-Driven Symbolic Analysis Peng Tu and David Padua Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign 1308 W. Main Street, Urbana, Illinois

More information

William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeinger, David Padua, Paul

William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeinger, David Padua, Paul Polaris: Improving the Eectiveness of Parallelizing Compilers William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeinger, David Padua, Paul Petersen, William Pottenger, Lawrence Rauchwerger,

More information

Generalized Iteration Space and the. Parallelization of Symbolic Programs. (Extended Abstract) Luddy Harrison. October 15, 1991.

Generalized Iteration Space and the. Parallelization of Symbolic Programs. (Extended Abstract) Luddy Harrison. October 15, 1991. Generalized Iteration Space and the Parallelization of Symbolic Programs (Extended Abstract) Luddy Harrison October 15, 1991 Abstract A large body of literature has developed concerning the automatic parallelization

More information

Bill Blume Rudolf Eigenmann Keith Faigin John Grout. Jay Hoeinger David Padua Paul Petersen Bill Pottenger

Bill Blume Rudolf Eigenmann Keith Faigin John Grout. Jay Hoeinger David Padua Paul Petersen Bill Pottenger Polaris: The Next Generation in Parallelizing Compilers Bill Blume Rudolf Eigenmann Keith Faigin John Grout Jay Hoeinger David Padua Paul Petersen Bill Pottenger Lawrence Rauchwerger Peng Tu Stephen Weatherford

More information

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8) Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer

More information

Center for Supercomputing Research and Development. September 20, Abstract

Center for Supercomputing Research and Development. September 20, Abstract Symbolic Range Propagation William Blume blume@csrd.uiuc.edu Rudolf Eigenmann eigenman@csrd.uiuc.edu Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign 1308 W.

More information

Abstract. The limited ability of compilers to nd the parallelism in programs is a signicant barrier to the

Abstract. The limited ability of compilers to nd the parallelism in programs is a signicant barrier to the Abstract The limited ability of compilers to nd the parallelism in programs is a signicant barrier to the use of high performance computers. It forces programmers to resort to parallelizing their programs

More information

When this code is executed, the iteration space j = search neqns will be partitioned amongst the p processors participating in the computation.

When this code is executed, the iteration space j = search neqns will be partitioned amongst the p processors participating in the computation. The Role of Associativity and Commutativity in the Detection and Transformation of Loop-Level Parallelism William M. Pottenger Department of Computer Science University of Illinois at Urbana-Champaign

More information

INLINE EXPANSION FOR THE POLARIS RESEARCH COMPILER JOHN ROBERT GROUT. B.S., Worcester Polytechnic Institute, 1981 THESIS

INLINE EXPANSION FOR THE POLARIS RESEARCH COMPILER JOHN ROBERT GROUT. B.S., Worcester Polytechnic Institute, 1981 THESIS INLINE EXPANSION FOR THE POLARIS RESEARCH COMPILER BY JOHN ROBERT GROUT B.S., Worcester Polytechnic Institute, 1981 THESIS Submitted in partial fulllment of the requirements for the degree of Master of

More information

WILLIAM MORTON POTTENGER. B.S., University of Alaska, Fairbanks, B.A., Lehigh University, 1980 THESIS

WILLIAM MORTON POTTENGER. B.S., University of Alaska, Fairbanks, B.A., Lehigh University, 1980 THESIS INDUCTION VARIABLE SUBSTITUTION AND REDUCTION RECOGNITION IN THE POLARIS PARALLELIZING COMPILER BY WILLIAM MORTON POTTENGER B.S., University of Alaska, Fairbanks, 1989 B.A., Lehigh University, 1980 THESIS

More information

Rudolf Eigenmann, Jay Hoeinger, Greg Jaxon, Zhiyuan Li, David Padua. Center for Supercomputing Research & Development. Urbana, Illinois 61801

Rudolf Eigenmann, Jay Hoeinger, Greg Jaxon, Zhiyuan Li, David Padua. Center for Supercomputing Research & Development. Urbana, Illinois 61801 Restructuring Fortran Programs for Cedar Rudolf Eigenmann, Jay Hoeinger, Greg Jaxon, Zhiyuan Li, David Padua Center for Supercomputing Research & Development University of Illinois at Urbana-Champaign

More information

A Data Dependence Graph in Polaris. July 17, Center for Supercomputing Research and Development. Abstract

A Data Dependence Graph in Polaris. July 17, Center for Supercomputing Research and Development. Abstract A Data Dependence Graph in Polaris Yunheung Paek Paul Petersen July 17, 1996 Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign Urbana, Illinois 61801 Abstract

More information

On the Automatic Parallelization of the Perfect Benchmarks

On the Automatic Parallelization of the Perfect Benchmarks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 1, JANUARY 1998 5 On the Automatic Parallelization of the Perfect Benchmarks Rudolf Eigenmann, Member, IEEE, Jay Hoeflinger, and David

More information

Interprocedural Analysis for Parallelization. Mary W. Hally, ycomputer Science Dept. California Institute of Technology. Pasadena, CA 91125

Interprocedural Analysis for Parallelization. Mary W. Hally, ycomputer Science Dept. California Institute of Technology. Pasadena, CA 91125 Interprocedural Analysis for Parallelization Mary W. Hally, Brian R. Murphy, Saman P. Amarasinghe, Shih-Wei Liao, Monica S. Lam Computer Systems Laboratory Stanford University Stanford, CA 94305 ycomputer

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

Performance Analysis of Parallelizing Compilers on the Perfect. Benchmarks TM Programs. Center for Supercomputing Research and Development.

Performance Analysis of Parallelizing Compilers on the Perfect. Benchmarks TM Programs. Center for Supercomputing Research and Development. Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks TM Programs William Blume and Rudolf Eigenmann Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

More information

Non-Linear and Symbolic Data Dependence Testing. Abstract. One of the most crucial qualities of an optimizing compiler is its ability to detect when

Non-Linear and Symbolic Data Dependence Testing. Abstract. One of the most crucial qualities of an optimizing compiler is its ability to detect when Non-Linear and Symbolic Data Dependence Testing William Blume Hewlett-Packard, California bblume@hpclopt.cup.hp.com Rudolf Eigenmann Purdue University, Indiana eigenman@ecn.purdue.edu August 25, 1998 Abstract

More information

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742 UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College

More information

Parallelization System. Abstract. We present an overview of our interprocedural analysis system,

Parallelization System. Abstract. We present an overview of our interprocedural analysis system, Overview of an Interprocedural Automatic Parallelization System Mary W. Hall Brian R. Murphy y Saman P. Amarasinghe y Shih-Wei Liao y Monica S. Lam y Abstract We present an overview of our interprocedural

More information

UMIACS-TR November, Univ. of Maryland, College Park, MD Abstract

UMIACS-TR November, Univ. of Maryland, College Park, MD Abstract UMIACS-TR-94-123 November, 1994 CS-TR-3372 Nonlinear Array Dependence Analysis William Pugh pugh@cs.umd.edu Institute for Advanced Computer Studies Dept. of Computer Science David Wonnacott davew@cs.umd.edu

More information

Symbolic Evaluation of Sums for Parallelising Compilers

Symbolic Evaluation of Sums for Parallelising Compilers Symbolic Evaluation of Sums for Parallelising Compilers Rizos Sakellariou Department of Computer Science University of Manchester Oxford Road Manchester M13 9PL United Kingdom e-mail: rizos@csmanacuk Keywords:

More information

Abstract. To eectively parallelize real programs, parallelizing compilers need powerful symbolic

Abstract. To eectively parallelize real programs, parallelizing compilers need powerful symbolic Demand-driven, Symbolic Range Propagation? William Blume Hewlett Packard, Cupertino, California Rudolf Eigenmann Purdue University, West Lafayette, Indiana Abstract. To eectively parallelize real programs,

More information

Gated SSA-Based Demand-Driven Symbolic Analysis for Parallelizing. Compilers. Peng Tu and David Padua

Gated SSA-Based Demand-Driven Symbolic Analysis for Parallelizing. Compilers. Peng Tu and David Padua Gated SSA-Based Demand-Driven Symbolic Analysis for Parallelizing Compilers Peng Tu and David Padua Center for Supercomputing Research and Development Coordinated Science Laboratory, University of Illinois

More information

Chapter 1: Interprocedural Parallelization Analysis: A Case Study. Abstract

Chapter 1: Interprocedural Parallelization Analysis: A Case Study. Abstract Chapter 1: Interprocedural Parallelization Analysis: A Case Study Mary W. Hall Brian R. Murphy Saman P. Amarasinghe Abstract We present an overview of our interprocedural analysis system, which applies

More information

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu

More information

Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26

Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26 Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Technical Report ANU-TR-CS-92- November 7, 992 Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer

More information

Non-Linear and Symbolic Data Dependence Testing. April 19, Abstract

Non-Linear and Symbolic Data Dependence Testing. April 19, Abstract Non-Linear and Symbolic Data Dependence Testing William Blume Hewlett-Packard, California bblume@hpclopt.cup.hp.com Rudolf Eigenmann Purdue University, Indiana eigenman@ecn.purdue.edu April 19, 1996 Abstract

More information

Interprocedural Symbolic Range Propagation for Optimizing Compilers

Interprocedural Symbolic Range Propagation for Optimizing Compilers Interprocedural Symbolic Range Propagation for Optimizing Compilers Hansang Bae and Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 47907 {baeh,eigenman}@purdue.edu

More information

The Value Evolution Graph and its Use in Memory Reference Analysis

The Value Evolution Graph and its Use in Memory Reference Analysis The Value Evolution Graph and its Use in Memory Reference Analysis Silvius Rus Texas A&M University silviusr@cs.tamu.edu Dongmin Zhang Texas A&M University dzhang@cs.tamu.edu Lawrence Rauchwerger Texas

More information

Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger

Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger Hybrid Analysis and its Application to Thread Level Parallelization Lawrence Rauchwerger Thread (Loop) Level Parallelization Thread level Parallelization Extracting parallel threads from a sequential program

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

Finding a winning strategy in variations of Kayles

Finding a winning strategy in variations of Kayles Finding a winning strategy in variations of Kayles Simon Prins ICA-3582809 Utrecht University, The Netherlands July 15, 2015 Abstract Kayles is a two player game played on a graph. The game can be dened

More information

Center for Supercomputing Research and Development. were identied as time-consuming: The process of creating

Center for Supercomputing Research and Development. were identied as time-consuming: The process of creating Practical Tools for Optimizing Parallel Programs Rudolf Eigenmann Patrick McClaughry Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign Abstract This paper describes

More information

Dynamically Adaptive Parallel Programs

Dynamically Adaptive Parallel Programs Dynamically Adaptive Parallel Programs Michael Voss and Rudolf Eigenmann Purdue University, School of Electrical and Computer Engineering Abstract. Dynamic program optimization is the only recourse for

More information

Lecture Notes on Loop Optimizations

Lecture Notes on Loop Optimizations Lecture Notes on Loop Optimizations 15-411: Compiler Design Frank Pfenning Lecture 17 October 22, 2013 1 Introduction Optimizing loops is particularly important in compilation, since loops (and in particular

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

A Source-to-Source OpenMP Compiler

A Source-to-Source OpenMP Compiler A Source-to-Source OpenMP Compiler Mario Soukup and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4

More information

Briki: a Flexible Java Compiler

Briki: a Flexible Java Compiler Briki: a Flexible Java Compiler Michał Cierniak Wei Li Department of Computer Science University of Rochester Rochester, NY 14627 fcierniak,weig@cs.rochester.edu May 1996 Abstract We present a Java compiler

More information

Analysis and Transformation in an. Interactive Parallel Programming Tool.

Analysis and Transformation in an. Interactive Parallel Programming Tool. Analysis and Transformation in an Interactive Parallel Programming Tool Ken Kennedy Kathryn S. McKinley Chau-Wen Tseng ken@cs.rice.edu kats@cri.ensmp.fr tseng@cs.rice.edu Department of Computer Science

More information

November 8, Abstract. Shared-Memory Parallel computers (SMPs) have become a major force in the market of parallel

November 8, Abstract. Shared-Memory Parallel computers (SMPs) have become a major force in the market of parallel Compiling for the New Generation of High-Performance SMPs Insung Park Michael J. Voss Rudolf Eigenmann November 8, 1996 Abstract Shared-Memory Parallel computers (SMPs) have become a major force in the

More information

INTERPROCEDURAL PARALLELIZATION USING MEMORY CLASSIFICATION ANALYSIS BY JAY PHILIP HOEFLINGER B.S., University of Illinois, 1974 M.S., University of I

INTERPROCEDURAL PARALLELIZATION USING MEMORY CLASSIFICATION ANALYSIS BY JAY PHILIP HOEFLINGER B.S., University of Illinois, 1974 M.S., University of I cflcopyright by Jay Philip Hoeflinger 2000 INTERPROCEDURAL PARALLELIZATION USING MEMORY CLASSIFICATION ANALYSIS BY JAY PHILIP HOEFLINGER B.S., University of Illinois, 1974 M.S., University of Illinois,

More information

THEORY, TECHNIQUES, AND EXPERIMENTS IN SOLVING RECURRENCES IN COMPUTER PROGRAMS BY WILLIAM MORTON POTTENGER M.S., University of Illinois at Urbana-Cha

THEORY, TECHNIQUES, AND EXPERIMENTS IN SOLVING RECURRENCES IN COMPUTER PROGRAMS BY WILLIAM MORTON POTTENGER M.S., University of Illinois at Urbana-Cha ccopyright by WILLIAM MORTON POTTENGER 1997 THEORY, TECHNIQUES, AND EXPERIMENTS IN SOLVING RECURRENCES IN COMPUTER PROGRAMS BY WILLIAM MORTON POTTENGER M.S., University of Illinois at Urbana-Champaign,

More information

Interprocedural Dependence Analysis and Parallelization

Interprocedural Dependence Analysis and Parallelization RETROSPECTIVE: Interprocedural Dependence Analysis and Parallelization Michael G Burke IBM T.J. Watson Research Labs P.O. Box 704 Yorktown Heights, NY 10598 USA mgburke@us.ibm.com Ron K. Cytron Department

More information

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Hock-Beng Lim Center for Supercomputing R & D University of Illinois Urbana, IL 61801 hblim@csrd.uiuc.edu Pen-Chung Yew Dept. of Computer

More information

Value Range Analysis of Conditionally Updated Variables and Pointers

Value Range Analysis of Conditionally Updated Variables and Pointers Value Range Analysis of Conditionally Updated Variables and Pointers Johnnie Birch Robert van Engelen Kyle Gallivan Department of Computer Science and School of Computational Science and Information Technology

More information

Department of. Computer Science. Uniqueness and Completeness. Analysis of Array. Comprehensions. December 15, Colorado State University

Department of. Computer Science. Uniqueness and Completeness. Analysis of Array. Comprehensions. December 15, Colorado State University Department of Computer Science Uniqueness and Completeness Analysis of Array Comprehensions David Garza and Wim Bohm Technical Report CS-93-132 December 15, 1993 Colorado State University Uniqueness and

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

A Portable Parallel N-body Solver 3. Abstract. We present parallel solutions for direct and fast n-body solvers written in the ZPL

A Portable Parallel N-body Solver 3. Abstract. We present parallel solutions for direct and fast n-body solvers written in the ZPL A Portable Parallel N-body Solver 3 E Christopher Lewis y Calvin Lin y Lawrence Snyder y George Turkiyyah z Abstract We present parallel solutions for direct and fast n-body solvers written in the ZPL

More information

Department of. Computer Science. Uniqueness Analysis of Array. Omega Test. October 21, Colorado State University

Department of. Computer Science. Uniqueness Analysis of Array. Omega Test. October 21, Colorado State University Department of Computer Science Uniqueness Analysis of Array Comprehensions Using the Omega Test David Garza and Wim Bohm Technical Report CS-93-127 October 21, 1993 Colorado State University Uniqueness

More information

The LRPD Test: Speculative Run Time Parallelization of Loops with Privatization and Reduction Parallelization

The LRPD Test: Speculative Run Time Parallelization of Loops with Privatization and Reduction Parallelization The LRPD Test: Speculative Run Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David Padua University of Illinois at Urbana-Champaign Abstract Current

More information

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

c Copyright by Yunheung Paek 1997

c Copyright by Yunheung Paek 1997 ccopyright by Yunheung Paek 1997 COMPILING FOR DISTRIBUTED MEMORY MULTIPROCESSORS BASED ON ACCESS REGION ANALYSIS BY YUNHEUNG PAEK B.S., Seoul National University, 1988 M.S., Seoul National University,

More information

Appears in the proceedings of the ACM/IEEE Supercompuing 94

Appears in the proceedings of the ACM/IEEE Supercompuing 94 Appears in the proceedings of the ACM/IEEE Supercompuing 94 A Compiler-Directed Cache Coherence Scheme with Improved Intertask Locality Lynn Choi Pen-Chung Yew Center for Supercomputing R & D Department

More information

Program Behavior Characterization through Advanced Kernel Recognition

Program Behavior Characterization through Advanced Kernel Recognition Program Behavior Characterization through Advanced Kernel Recognition Manuel Arenaz, Juan Touriño, and Ramón Doallo Computer Architecture Group Department of Electronics and Systems University of A Coruña,

More information

This broad range of applications motivates the design of static analysis techniques capable of automatically detecting commuting operations commutativ

This broad range of applications motivates the design of static analysis techniques capable of automatically detecting commuting operations commutativ On the Complexity of Commutativity Analysis Oscar Ibarra y,pedro Diniz z and Martin Rinard x Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fibarra,pedro,marting@cs.ucsb.edu

More information

University of Ghent. St.-Pietersnieuwstraat 41. Abstract. Sucient and precise semantic information is essential to interactive

University of Ghent. St.-Pietersnieuwstraat 41. Abstract. Sucient and precise semantic information is essential to interactive Visualizing the Iteration Space in PEFPT? Qi Wang, Yu Yijun and Erik D'Hollander University of Ghent Dept. of Electrical Engineering St.-Pietersnieuwstraat 41 B-9000 Ghent wang@elis.rug.ac.be Tel: +32-9-264.33.75

More information

Rearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction

Rearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction Rearrangement of DNA fragments: a branch-and-cut algorithm 1 C. E. Ferreira 1 C. C. de Souza 2 Y. Wakabayashi 1 1 Instituto de Mat. e Estatstica 2 Instituto de Computac~ao Universidade de S~ao Paulo e-mail:

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11

Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11 Extending CRAFT Data-Distributions for Sparse Matrices G. Bandera E.L. Zapata July 996 Technical Report No: UMA-DAC-96/ Published in: 2nd. European Cray MPP Workshop Edinburgh Parallel Computing Centre,

More information

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs To appear in: Int. Conf. on Parallel and Distributed Systems, ICPADS'96, June 3-6, 1996, Tokyo Allowing Cycle-Stealing Direct Memory Access I/O Concurrent with Hard-Real-Time Programs Tai-Yi Huang, Jane

More information

Unconstrained Optimization

Unconstrained Optimization Unconstrained Optimization Joshua Wilde, revised by Isabel Tecu, Takeshi Suzuki and María José Boccardi August 13, 2013 1 Denitions Economics is a science of optima We maximize utility functions, minimize

More information

Zhiyuan Li. University of Minnesota. [PER89] and WHAMS3D [BT82], a program for structural. writing conicts result when the iterations are executed

Zhiyuan Li. University of Minnesota. [PER89] and WHAMS3D [BT82], a program for structural. writing conicts result when the iterations are executed Array Privatization for Parallel Execution of Loops Zhiyuan Li Department of Computer Science University of Minnesota 200 Union St. SE, Minneapolis, Minnesota 55455 li@cs.umn.edu Abstract In recent experiments,

More information

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Yuan-Shin Hwang Department of Computer Science National Taiwan Ocean University Keelung 20224 Taiwan shin@cs.ntou.edu.tw

More information

Profiling Dependence Vectors for Loop Parallelization

Profiling Dependence Vectors for Loop Parallelization Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw

More information

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o Reconstructing a Binary Tree from its Traversals in Doubly-Logarithmic CREW Time Stephan Olariu Michael Overstreet Department of Computer Science, Old Dominion University, Norfolk, VA 23529 Zhaofang Wen

More information

PPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France.

PPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France. : A Pipeline Path-based Scheduler Maher Rahmouni Ahmed A. Jerraya Laboratoire TIMA/lNPG,, Avenue Felix Viallet, 80 Grenoble Cedex, France Email:rahmouni@verdon.imag.fr Abstract This paper presents a scheduling

More information

Extended SSA with factored use-def chains to support optimization and parallelism

Extended SSA with factored use-def chains to support optimization and parallelism Oregon Health & Science University OHSU Digital Commons CSETech June 1993 Extended SSA with factored use-def chains to support optimization and parallelism Eric Stoltz Michael P. Gerlek Michael Wolfe Follow

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

ICC++ Language Denition. Andrew A. Chien and Uday S. Reddy 1. May 25, 1995

ICC++ Language Denition. Andrew A. Chien and Uday S. Reddy 1. May 25, 1995 ICC++ Language Denition Andrew A. Chien and Uday S. Reddy 1 May 25, 1995 Preface ICC++ is a new dialect of C++ designed to support the writing of both sequential and parallel programs. Because of the signicant

More information

A Parallel Intermediate Representation based on. Lambda Expressions. Timothy A. Budd. Oregon State University. Corvallis, Oregon.

A Parallel Intermediate Representation based on. Lambda Expressions. Timothy A. Budd. Oregon State University. Corvallis, Oregon. A Parallel Intermediate Representation based on Lambda Expressions Timothy A. Budd Department of Computer Science Oregon State University Corvallis, Oregon 97331 budd@cs.orst.edu September 20, 1994 Abstract

More information

Revised version, February 1991, appeared in Information Processing Letters 38 (1991), 123{127 COMPUTING THE MINIMUM HAUSDORFF DISTANCE BETWEEN

Revised version, February 1991, appeared in Information Processing Letters 38 (1991), 123{127 COMPUTING THE MINIMUM HAUSDORFF DISTANCE BETWEEN Revised version, February 1991, appeared in Information Processing Letters 38 (1991), 123{127 COMPUTING THE MINIMUM HAUSDORFF DISTANCE BETWEEN TWO POINT SETS ON A LINE UNDER TRANSLATION Gunter Rote Technische

More information

University oftoronto. Queens University. Abstract. This paper gives an overview of locality enhancement techniques

University oftoronto. Queens University. Abstract. This paper gives an overview of locality enhancement techniques Locality Enhancement for Large-Scale Shared-Memory Multiprocessors Tarek Abdelrahman 1, Naraig Manjikian 2,GaryLiu 3 and S. Tandri 3 1 Department of Electrical and Computer Engineering University oftoronto

More information

B.S., University of Nebraska{Lincoln, 1986 THESIS. Submitted in partial fulllment of the requirements. Urbana, Illinois

B.S., University of Nebraska{Lincoln, 1986 THESIS. Submitted in partial fulllment of the requirements. Urbana, Illinois EVALUATION OF PROGRAMS AND PARALLELIZING COMPILERS USING DYNAMIC ANALYSIS TECHNIQUES BY PAUL MARX PETERSEN B.S., University of Nebraska{Lincoln, 1986 M.S., University of Illinois at Urbana{Champaign, 1989

More information

Data Dependences and Parallelization

Data Dependences and Parallelization Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]

More information

Parallelizing While Loops for Multiprocessor Systems

Parallelizing While Loops for Multiprocessor Systems CSRD Technical Report 1349 Parallelizing While Loops for Multiprocessor Systems Lawrence Rauchwerger and David Padua Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

More information

Draft. Debugging of Optimized Code through. Comparison Checking. Clara Jaramillo, Rajiv Gupta and Mary Lou Soa. Abstract

Draft. Debugging of Optimized Code through. Comparison Checking. Clara Jaramillo, Rajiv Gupta and Mary Lou Soa. Abstract Draft Debugging of Optimized Code through Comparison Checking Clara Jaramillo, Rajiv Gupta and Mary Lou Soa Abstract We present a new approach to the debugging of optimized code through comparison checking.

More information

Machine-Independent Optimizations

Machine-Independent Optimizations Chapter 9 Machine-Independent Optimizations High-level language constructs can introduce substantial run-time overhead if we naively translate each construct independently into machine code. This chapter

More information

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanfordedu) February 6, 2018 Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 In the

More information

Parallelizing While Loops for Multiprocessor Systems y

Parallelizing While Loops for Multiprocessor Systems y Submitted to IEEE Transactions on Parallel and Distributed Systems. (CSRD Technical Report 1349.) Parallelizing While Loops for Multiprocessor Systems y Lawrence Rauchwerger and David Padua Center for

More information

SYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science.

SYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science. ALEWIFE SYSTEMS MEMO #12 A Synchronization Library for ASIM Beng-Hong Lim (bhlim@masala.lcs.mit.edu) Laboratory for Computer Science Room NE43-633 January 9, 1992 Abstract This memo describes the functions

More information

In C.-H. Huang, P.Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua

In C.-H. Huang, P.Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua In C.-H. Huang, P.Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua (Editors), Languages and Compilers for Parallel Computing, pages 269-288. Springer-Verlag, August 1995. (8th International

More information

Null space basis: mxz. zxz I

Null space basis: mxz. zxz I Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence

More information

UMIACS-TR December, CS-TR-3250 Revised March, Static Analysis of Upper and Lower Bounds. on Dependences and Parallelism

UMIACS-TR December, CS-TR-3250 Revised March, Static Analysis of Upper and Lower Bounds. on Dependences and Parallelism UMIACS-TR-94-40 December, 1992 CS-TR-3250 Revised March, 1994 Static Analysis of Upper and Lower Bounds on Dependences and Parallelism William Pugh pugh@cs.umd.edu Institute for Advanced Computer Studies

More information

Compiler Construction 2010/2011 Loop Optimizations

Compiler Construction 2010/2011 Loop Optimizations Compiler Construction 2010/2011 Loop Optimizations Peter Thiemann January 25, 2011 Outline 1 Loop Optimizations 2 Dominators 3 Loop-Invariant Computations 4 Induction Variables 5 Array-Bounds Checks 6

More information

Lecture Notes on Cache Iteration & Data Dependencies

Lecture Notes on Cache Iteration & Data Dependencies Lecture Notes on Cache Iteration & Data Dependencies 15-411: Compiler Design André Platzer Lecture 23 1 Introduction Cache optimization can have a huge impact on program execution speed. It can accelerate

More information

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.

More information

Convolutional Neural Networks for Object Classication in CUDA

Convolutional Neural Networks for Object Classication in CUDA Convolutional Neural Networks for Object Classication in CUDA Alex Krizhevsky (kriz@cs.toronto.edu) April 16, 2009 1 Introduction Here I will present my implementation of a simple convolutional neural

More information

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T. Document Image Restoration Using Binary Morphological Filters Jisheng Liang, Robert M. Haralick University of Washington, Department of Electrical Engineering Seattle, Washington 98195 Ihsin T. Phillips

More information

To appear: Proceedings of Supercomputing '92 1. Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu

To appear: Proceedings of Supercomputing '92 1. Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu To appear: Proceedings of Supercomputing '9 Compiler Code Transformations for Superscalar-Based High-Performance Systems Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu Center for Reliable

More information

Principle of Complier Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Principle of Complier Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Principle of Complier Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Lecture - 20 Intermediate code generation Part-4 Run-time environments

More information

of Perceptron. Perceptron CPU Seconds CPU Seconds Per Trial

of Perceptron. Perceptron CPU Seconds CPU Seconds Per Trial Accelerated Learning on the Connection Machine Diane J. Cook Lawrence B. Holder University of Illinois Beckman Institute 405 North Mathews, Urbana, IL 61801 Abstract The complexity of most machine learning

More information

Searching Algorithms/Time Analysis

Searching Algorithms/Time Analysis Searching Algorithms/Time Analysis CSE21 Fall 2017, Day 8 Oct 16, 2017 https://sites.google.com/a/eng.ucsd.edu/cse21-fall-2017-miles-jones/ (MinSort) loop invariant induction Loop invariant: After the

More information

Compiler Construction 2016/2017 Loop Optimizations

Compiler Construction 2016/2017 Loop Optimizations Compiler Construction 2016/2017 Loop Optimizations Peter Thiemann January 16, 2017 Outline 1 Loops 2 Dominators 3 Loop-Invariant Computations 4 Induction Variables 5 Array-Bounds Checks 6 Loop Unrolling

More information

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No. # 10 Lecture No. # 16 Machine-Independent Optimizations Welcome to the

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information

Algebraic Properties of CSP Model Operators? Y.C. Law and J.H.M. Lee. The Chinese University of Hong Kong.

Algebraic Properties of CSP Model Operators? Y.C. Law and J.H.M. Lee. The Chinese University of Hong Kong. Algebraic Properties of CSP Model Operators? Y.C. Law and J.H.M. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong SAR, China fyclaw,jleeg@cse.cuhk.edu.hk

More information

Single-Pass Generation of Static Single Assignment Form for Structured Languages

Single-Pass Generation of Static Single Assignment Form for Structured Languages 1 Single-Pass Generation of Static Single Assignment Form for Structured Languages MARC M. BRANDIS and HANSPETER MÖSSENBÖCK ETH Zürich, Institute for Computer Systems Over the last few years, static single

More information