Center for Supercomputing Research and Development. recognizing more general forms of these patterns, notably
|
|
- Clement Wheeler
- 6 years ago
- Views:
Transcription
1 Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign 1308 West Main Street, Urbana, Illinois Abstract The elimination of induction variables and the parallelization of reductions in FORTRAN programs have been shown to be integral to performance improvement on parallel computers [7, 8]. As part of the Polaris project [5], compiler passes that recognize these idioms have been implemented andevaluated. Developing these techniques to the point necessary to achieve signicant speedups on real applications has prompted solutions to problems that have not been addressed in previous reports on idiom recognition techniques. These include analysis techniques capable of disproving zero-trip loops, symbolic handling facilities to compute closed forms of recurrences, and interfaces to other compilation passes such as the data-dependence test. In comparison, the recognition phase of solving induction variables, which has received most attention so far, has in fact turned out to be relatively straightforward. This paper provides an overview of techniques described in more detail in [12]. 1 Introduction The translation of sequential programs into parallel form is a well-established discipline. An introduction to this topic can be found in [2]. Parallelizing translation techniques deal mostly with loops whose iterations can be executed in parallel if no iteration produces a data value that is consumed by another iteration. In some circumstances such true dependences can be removed, for example if the value to be consumed can be expressed directly as a function of the loop index variable(s). We call this an induction variable and the direct expression its closedform. Another example of removing a true dependence involves an associative and commutative operation performed across loop iterations. Since the order of such operations can be changed, iterations need not wait for the previous iteration to perform its operation, as long as we ensure exclusive access to the operand. Important operations of this type are reductions, e.g., accumulating values across loop iterations. Both induction and reduction idioms are recognized by available parallelizing compilers if they have simple forms. This work is supported by the National Security Agency and by U.S. Army contract #DABT63-92-C This work is not necessarily representative of the positions or policies of the Army orthe Government. In our previous experiments [7, 8] we have seen the need for recognizing more general forms of these patterns, notably induction variables in triangular loops and in multiplicative expressions, and reduction operations on arrays in histogram form. We discuss these variants below. We have implemented new and powerful idiom recognition and solution techniques in the Polaris compiler [5, 6] which contribute towards substantial performance improvements over previous generations of parallelizing compilers. 2 Induction Variable Substitution The following example shows a triangular induction variable that occurs inside a triangular loop nest (the inner loop bound depends on the outer loop index). The nest can be translated into parallel form as shown. Available compilers typically are able to substitute the induction variable in the inner loop only. iv =0 do j =1 i do parallel j =1 m a(iv) = ) a(j +(i 2 ; i)=2 ; 1) = iv = iv +1 Our generalized induction variable algorithm performs the complete translation in two steps. The rst step recognizes induction variables that match the statement pattern iv = iv + inc expression where inc expression can contain outer loop indices, other induction variables, and loop-invariant terms. Induction variables may also appear in more than one induction statement, at dierent levels of a loop nest, as exemplied in section 2.2. The closed form of the induction variable is computed in the second step. In the following discussion we consider a loop nest containing an induction variable iv. oop has n iterations and the variable iter denotes the current iteration. For the sake of simplicity we treat iteration spaces from 1 to n (this is easily derived from the actual loop bounds and step). The rst and last statements of loop are labeled loopentry and loopexit, respectively. Denition 1 Given a loop, the total increment iv (iter) of the additive induction variable iv in a single iteration of the body of is dened as:
2 X loop exit iv (iter) = increments to iv in s s=loop entry Here s iterates over the statements in loop from loop entry to loop exit. The increment toiv is zero if s is not an induction statement. Inner loops are considered a single (compound) statement and their increment is determined using Denition 3. Denition 2 The increment at the beginning of iteration i of the additive induction variable iv is dened as: i@i iv = Xi;1 iter=1 iv (iter) Here iv (iter) is summed over the iteration space of from the rst to the i ; 1th iteration. Denition 3 The increment of the additive induction variable iv upon completion of loop is dened as: i@n iv = nx iter=1 iv (iter) Here we are summing iv (iter) over the entire iteration space of from 1 to n. Similar denitions hold for multiplicative induction variables. We refer the reader to the more detailed description[12]. Initially, is computed by a lexical scan which forms a symbolic sum of the right hand side increments at all induction sites of a given induction variable iv. Once has been computed for a given iv, the symbolic expressions i@i and i@n are computed using a symbolic sum function based on Bernoulli numbers [13]. The algorithm proceeds recursively when an inner loop is encountered. The current value for i@i in the enclosing loop is used as the initial value of the iv at entry to the nested loop. The sum is then determined for the inner loop (recursing again as necessary), and the expression i@i is formed for the inner loop. In other words, when iv occurs inside a nested loop, can only be partially calculated for outer loops and becomes fully known in these outer loops only as i@n is recursively returned from inner loop sums. As the recursion unwinds and loop exits are encountered, the expressions i@n are calculated. These are the last values which will be used outside the loop if the induction variables' live range extends beyond the loop exit. The overall picture is one of a statement-wise traversal of a given loop body, rst summing increments, then multiplying across iteration spaces, all done recursively as necessary. The nal substitution step is straightforward. Briey, traversing from outside in, each loop header is annotated with the value of the induction variable at entry to the loop. The same annotation is made at the site of each use of the induction variable, including any increments encountered in the body of the loop up to the point of use. Our current implementation does a direct substitution of the closed form. The algorithm works similarly for multiplicative induction variables. Instead of sum operations, however, products are formed. Again, we refer the reader to [12] for additional detail. 2.1 Zero Trip Test If the upper bound UB ofaloopislessthanthelower bound B, then the loop is not executed, and as a result the computation of the closed form of an induction variable must include analysis techniques capable of detecting zero-trip loops. Our algorithm handles this problem with new techniques termed Symbolic Range Propagation [4]. We prove in all cases save one of our application test suite that loops do not have zero trips and hence the transformation is correct. The remaining case requires the use of interprocedural range propagation, which has not yet been implemented. In addition, we have determined that in fact it is sucient that loops meet the weaker requirement UB B;1. Expressed in terms of Denition 3 above, for example, UB B ;1 ) that i@n iv is correct. In words this means simply that the last value iv last of induction variable iv in loop is correct given UB B ; 1. Consider the following example: iv =0 do j =1 m iv = iv +2 ) do parallel j =1 m iv = iv +1 a(iv) = a(i +2im) = Intuitively, one concludes that the condition m 1 insures the correctness of the transformation in fact the transformation is still correct for m =0. Wehave termed this an exact zero trip and have found it important in our application test suite. 2.2 Wrap-Around Variables A wrap-around variable is classically dened as a variable that takes on the value of an induction variable after one iteration of a loop [11]. There is one important case in our application test suite where the recognition of wrap-around loop bounds is a necessary precursor to the solution of an induction variable: m =0 do j =1 i lb = j ub = i do k = i n do l = lb ub m = m +1 a(m)= lb =1 ub = k +1 m = m + i Note the presence of the loop-variant wrap-around variables lb and ub on the l loop, which must be solved in order to determine the closed form for the induction variable m. After the induction transformation we have: m =0 do parallel j =1 i do parallel l = j i private m m = l +(i +(;9i 2 ; 3i 4 +6i +6i 3 ; 6in;
3 6in 2 +6ni 2 +6i 2 n 2 )=4 ; 3n; 3j 2 ; n 2 +2i 3 +3j +3i 2 ; 3ij; 3ji 2 +3jn +3jn 2 )=6 ; 2i +2ij a(m)= do parallel k =1+i n do parallel l =1 k private m m = l +((;9i 2 ; 3i 4 +6i +6i 3 ; 6in; 6in 2 +6ni 2 +6i 2 n 2 )=4 ; 3k ; 3n; 3j 2 ; 3n 2 ; 2i +2i 3 +3j +3k 2 ; 3ij ; 3ji 2 +3jn +3jn 2 )=6 ; i +2ij a(m)= Employing powerful symbolic manipulation, the induction pass rst recognized and then removed the wrap-around loop bounds lb and ub by peeling the rst iteration of the k loop, allowing the solution of the induction on m within the outermost i loop. Note that the k loop now executes an exact zero trip when i = n, and that the correctness of this transformation has been proven using the techniques described in section 2.1. As noted in section 2, the value of m has been substituted directly on the right hand side of each remaining induction site, and m is now a loop private variable 1. As is clear from the example, direct substitution introduces unnecessary overhead in inner loops. For this reason we are currently implementing an on-demand substitution algorithm which incorporates strength reduction of complex expressions of this nature. 3 Reduction Recognition The following example represents an important pattern of reduction operations that we have found in the analysis of real application programs: do j =1 n; i fx(i)=fx(i)+exp1 fx(i + j)=fx(i + j)+exp2 The loop nest accumulates into the array FX. Different iterations accumulate into dierent array elements. This is what we termahistogram reduction. ThePolaris- transformed code is shown below: do j =1 cpus fx0(i j)=0 private p = procid do j =1 n; i fx0(i p)=fx0(i p)+exp1 fx0(i + j p)=fx0(i + j p)+exp2 1 Note also that the third induction site just inside the outermost i loop has been automatically removed do j =1 cpus fx(i) =fx(i)+fx0(i j) 3.1 Recognition Pass The algorithm for recognizing reductions searches for assignment statements within a given loop of the form: A(1 2 )=A(1 2 )+ where represents an arbitrary expression and A may be a multi-dimensional array with subscript vector f1 2 g which may contain both loop-variant and invariant terms. Neither i nor may contain a reference to A, anda must not be referenced elsewhere in the loop outside other reduction statements. Of course f1 2 g may benull (i.e., A is a scalar variable). The algorithm recognizes potential reductions which fall into the two classes of histogram reductions (where one of the subscript dimensions, e.g., 1, is loop-variant) and singleaddress reductions (where all dimensions are loop-invariant). The reduction recognition pass of Polaris is based on powerful pattern matching primitives that are part of the Polaris FORBO environment [14]. In the recognition pass, variables that match the above pattern are agged as potential reduction variables. 3.2 Data Dependence Pass The data dependence pass analyzes candidate reduction variables. If it can prove independence, then it removes the reduction ag. This situation occurs if all loop iterations accumulate into separate elements of an array. 3.3 Transformation Pass In the transformation stage, three dierent types of parallel reductions can be generated. They are termed blocked, privatized, and expanded. All three schemes take advantage of the fact that the sum operation is mathematically commutative and associative, and thus the accumulation statement sequence can be reordered. Although it is well known that this can lead to roundo errors, we have not found this to be a problem in our experiments. Polaris includes a switch that allows the user to disable the transformation, which is the common way of dealing with this issue in many compilers. The rst scheme, termed blocked, involves the insertion of synchronization primitives around each reduction statement, making the sum operation atomic. In our example the reduction statements could simply be enclosed by lock/unlock pairs, allowing the loop to be executed in parallel. This is an elegant solution where the architecture provides fast synchronization primitives. However, in the machines used in our experiments the synchronization overhead was high, and as a result we focussed our eorts on the remaining methods. In privatized reductions, duplicates of the reduction variable that are private to each processor are created and used in the reduction statements in place of the original variable. The partial sums accumulated by these variables are initialized to zero at loop entry. In this way, the loop can now be executed in a parallel doall fashion without the need for synchronization other than the sum across processors executed prior to nal loop exit.
4 Scheme three is used in our example and it is termed expanded reductions. It is similar to the second scheme, but collects the partial sums in a shared array that has an additional dimension equal to the number of processors executing in parallel. All reduction variables are replaced by references to this new, global array, and the newly created dimension is indexed by the processor number executing the current iteration. The transformed loop may be executed fully in parallel. The partial sums are combined in a separate loop following the original one. For histogram reductions this loop can be executed in parallel as well, as shown in the example above. One thorny issue dealt with in implementing methods two and three was the determination of the size of the partial sum variables for histogram reductions. Often it is not possible to determine this size from the code surrounding the loop. Because of this we again took good advantage of Symbolic Range Propagation [4] to determine these sizes. 4 Performance Results Table 1 shows the speedups we have obtained on an 8- processor set on an SGI Challenge (R4400), relative to the serial execution speed for three codes in our application suite. We measured the overall speedup due to Polaris optimizations, and the diminished speedups after disabling the induction and reduction transformations, respectively. For comparison purposes, the benchmarks under discussion were also compiled and executed using the commercially available SGI parallelizer, Power Fortran Accelerator (PFA), with additional non-default optimization options -ag=a and -r=3. MDG TRFD TURB3D serial PFA Polaris (no reduction) (no induction) Table 1: Overall Program Speedups In all programs the Polaris compiler achieves signicant speedups. In TURB3D this is about 9% greater than PFA, and in the other programs PFA achieves little or no speedup 2. The speedups are a result of the transformations discussed in this paper in conjunction with additional techniques described in [6]. As shown by the table entries (no reduction) and (no induction) above, in all cases the performance drops substantially when these compiler capabilities are switched o. The performance gures become more clear when considering the most time-consuming loops of the test suite. Table 2 summarizes the idiom recognition techniques applied to these loops and the resulting loop speedups. These results were obtained using a loop-by-loop instrumentation facility built into the Polaris compiler. Column % serial shows the percentage time that each loop takes in the serial program execution. All of these loops would be serial without the new capabilities described in this paper. In Table 2, HR means a histogram reduction was solved in this loop, R, a reduction IV, an induction variable GIV, a generalized induction variable and W, a wrap-around variable. 2 In fact, results from more recent runsofpolaris which incorporate full inlining bring the speedup gures for TURB3D up to 6 loop transformation speedup % serial MDG interf do1000 IV, HR poteng do2000 IV, R TRFD olda do100 GIV olda do300 GIV, W TURB3D enr do2 IV, R Table 2: Key oop Timings In the program MDG, subroutine interf, loopdo1000 contains multiple induction variables as well as histogram reductions. Polaris rst solved the inductions, then applied expanded reductions to generate parallel code for the loop. Interf do1000 is by far the most time consuming loop in the program. Another loop, poteng do2000, is similar in nature however,asnoted,itcontains single-address reduction operations. Polaris agged these reductions with directives, which were then processed by the back end compiler on the SGI. In TRFD subroutine ODA, two loops dominate program execution, do 100 and do 300. Both loops exercise Polaris' generalized induction variable substitution capabilities. Do 300 contains an induction variable that is dependent on a second, triangular induction variable. We have termed this a coupled induction [13]. The coupled induction variable occurs at several nesting levels of nested triangular loops (the innermost site is in a quadruply nested, doubly triangular loop). As a result, the subscript expressions introduced by the substitution pass are non-linear and cannot be handled by previously known data-dependence tests. The new Polaris data dependence test that handles such expressions is described in [3]. Olda do300 further contains two wrap-around variables that need to be recognized in order to solve the inductions properly. This transformation was described in section 2.2. Both loops also contain reduction operations however, they occur in inner loops that were not parallelized due to the single-level parallelism supported by the SGI FORTRAN compiler. oop ENR do2 contains a single-address reduction in a quadruply nested loop. Similar to poteng do2000, this reduction is agged by Polaris and solved via privatization by the back end SGI compiler. 5 Related Work Three notable related contributions to idiom recognition techniques deal with the recognition of induction variables. Haghighat and Polychronopoulos [10] symbolically execute loops and inspect the sequence of values assumed by variables. By taking the dierence of consecutive values, the dierence of the dierence, etc., they can interpolate a polynomial representing the closed form of an induction variable. Harrison and Ammarguellat [1] do abstract interpretation of loops to capture the eect of a single iteration on each variable assigned in the loop. The eect is then matched with pre-dened templates that express recurrence relations, including induction variables of various forms. Gerlek et al recognize induction variables by nding certain patterns in the Static Single Assignment representation of the program [9]. They compute the closed form of an induction variable based on a classication of the specic SSA pattern.
5 All three approaches provide some of the functionality necessary to handle the generalized forms of induction variables that we have seen in real programs. Potentially, they can recognize complex control structures, such as induction variables assigned in dierent branches of IF statements, which we have not implemented in our algorithm. However, Polaris is the only compiler that has delivered the proof of concept in that it has demonstrated that new induction and reduction variable techniques can contribute to real speedups of real programs. In doing this we have identied necessary parts of both induction and reduction variable transformation techniques that others have ignored or missed: symbolic manipulation tools for generating closed forms of induction variables, program analysis capabilities for determining and disproving zero-trip loops, combining induction variable recognition with wrap-around variable removal, symbolically determining ranges of histogram reductions, and complementing non-linear data-dependence tests. Some of these capabilities are supported by other passes available in the Polaris compiler. Dening proper interfaces between induction and reduction variable handling and these passes is another distinguishing feature of the compiler. 6 Conclusion The new Polaris compiler is able to gain signicant speedups in real programs, and demonstrates the parallelization technology necessary to do so. Two crucial techniques have been described in this paper: the recognition and transformation of induction variables and the parallelization of reduction operations. Our implementation and measurements conrm the great importance of these idiom recognition techniques. We have demonstrated that the implementation of the necessary generalized forms of induction variable and reduction handling techniques is feasible in a parallelizing compiler and that it is possible to integrate them in such away that the resulting compiler actually achieves substantial performance results on real programs. This demonstration has not been made previously, and it is particularly signicant in that some of the transformations introduce complexities in the generated code (i.e., non-linear expressions) that cannot be readily processed by other compilation passes. In our evaluation we have also determined that the issue of recognizing complex recurrence patterns and control- ow structures, which was given much attention in related work, was not of primary concern. Instead we have found other issues to be of critical importance, namely, the provision of symbolic computation capabilities for handling closed forms of recurrences, determining iteration counts and variable ranges, and the integration of the idiom recognition pass with other compilation techniques. References [1] Zahira Ammarguellat and uddy Harrison. Automatic Recognition of Induction & Recurrence Relations by Abstract Interpretation. Proceedings of Sigplan 1990, Yorktown Heights, 25(6):283{295, June [2] Utpal Banerjee, Rudolf Eigenmann, Alexandru Nicolau, and David Padua. Automatic Program Parallelization. Proceedings of the IEEE, 81(2), February [3] William Blume and Rudolf Eigenmann. The Range Test: A Dependence Test for Symbolic, Non-linear Expressions. Proceedings of Supercomputing '94, November 1994, Washington D.C., pages 528{537, November [4] William Blume and Rudolf Eigenmann. Symbolic Range Propagation. Proceedings of the 9th International Parallel Processing Symposium, April 1995, October [5] William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeinger, David Padua, Paul Petersen, Bill Pottenger, awrence Rauchwerger, Peng Tu, and Stephen Weatherford. Polaris: Improving the Eectiveness of Parallelizing Compilers. Proceedings of the Seventh Workshop on anguages and Compilers for Parallel Computing, Ithaca, New York also: ecture Notes in Computer Science 892, Springer-Verlag, pages 141{ 154, August [6] William Blume, Rudolf Eigenmann, Jay Hoeinger, David Padua, Paul Petersen, awrence Rauchwerger, and Peng Tu. Automatic Detection of Parallelism: A Grand Challenge for High-Performance Computing. IEEE Parallel and Distributed Technology, 2(3):37{47, Fall [7] Rudolf Eigenmann, Jay Hoeinger, Zhiyuan i, and David Padua. Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs. ecture Notes in Computer Science 589. Proceedings of the Fourth Workshop on anguages and Compilers for Parallel Computing, Santa Clara, CA, pages 65{83, August [8] Rudolf Eigenmann, Jay Hoeinger, and David Padua. On the Automatic Parallelization of the Perfect Benchmarks. Technical Report 1392, Univ. of Illinois at Urbana-Champaign, Cntr. for Supercomputing Res. & Dev., December [9] Michael P. Gerlek, Eric Stoltz, and Michael Wolfe. Beyond induction variables: Detecting and classifying sequences using a demand-driven SSA form. To appear in TOPAS 95. [10] Mohammad R. Haghighat and Constantine D. Polychronopoulos. Symbolic Program Analysis and Optimization for Parallelizing Compilers. Presented at the 5th Annual Workshop on anguages and Compilers for Parallel Computing, New Haven, CT, August 3-5, [11] D. Padua and M. Wolfe. Advanced Compiler Optimization for Supercomputers. CACM, 29(12):1184{1201, December, [12] Bill Pottenger and Rudolf Eigenmann. Parallelization in the Presence of Generalized Induction and Reduction Variables. Technical Report 1396, Univ. of Illinois at Urbana-Champaign, Cntr. for Supercomputing Res. & Dev., January [13] William Morton Pottenger. Induction Variable Substitution and Reduction Recognition in the Polaris Parallelizing Compiler. Master's thesis, Univ of Illinois at Urbana-Champaign, Cntr for Supercomputing Res & Dev, December [14] Stephen Weatherford. High-evel Pattern-Matching Extensions to C++ for Fortran Program Manipulation in Polaris. Master's thesis, Univ. of Illinois at Urbana- Champaign, Cntr. for Supercomputing Res. & Dev., May 1994.
Peng Tu and David Padua. Center for Supercomputing Research and Development.
Gated SSA Based Demand-Driven Symbolic Analysis Peng Tu and David Padua Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign 1308 W. Main Street, Urbana, Illinois
More informationWilliam Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeinger, David Padua, Paul
Polaris: Improving the Eectiveness of Parallelizing Compilers William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeinger, David Padua, Paul Petersen, William Pottenger, Lawrence Rauchwerger,
More informationGeneralized Iteration Space and the. Parallelization of Symbolic Programs. (Extended Abstract) Luddy Harrison. October 15, 1991.
Generalized Iteration Space and the Parallelization of Symbolic Programs (Extended Abstract) Luddy Harrison October 15, 1991 Abstract A large body of literature has developed concerning the automatic parallelization
More informationBill Blume Rudolf Eigenmann Keith Faigin John Grout. Jay Hoeinger David Padua Paul Petersen Bill Pottenger
Polaris: The Next Generation in Parallelizing Compilers Bill Blume Rudolf Eigenmann Keith Faigin John Grout Jay Hoeinger David Padua Paul Petersen Bill Pottenger Lawrence Rauchwerger Peng Tu Stephen Weatherford
More informationi=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)
Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer
More informationCenter for Supercomputing Research and Development. September 20, Abstract
Symbolic Range Propagation William Blume blume@csrd.uiuc.edu Rudolf Eigenmann eigenman@csrd.uiuc.edu Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign 1308 W.
More informationAbstract. The limited ability of compilers to nd the parallelism in programs is a signicant barrier to the
Abstract The limited ability of compilers to nd the parallelism in programs is a signicant barrier to the use of high performance computers. It forces programmers to resort to parallelizing their programs
More informationWhen this code is executed, the iteration space j = search neqns will be partitioned amongst the p processors participating in the computation.
The Role of Associativity and Commutativity in the Detection and Transformation of Loop-Level Parallelism William M. Pottenger Department of Computer Science University of Illinois at Urbana-Champaign
More informationINLINE EXPANSION FOR THE POLARIS RESEARCH COMPILER JOHN ROBERT GROUT. B.S., Worcester Polytechnic Institute, 1981 THESIS
INLINE EXPANSION FOR THE POLARIS RESEARCH COMPILER BY JOHN ROBERT GROUT B.S., Worcester Polytechnic Institute, 1981 THESIS Submitted in partial fulllment of the requirements for the degree of Master of
More informationWILLIAM MORTON POTTENGER. B.S., University of Alaska, Fairbanks, B.A., Lehigh University, 1980 THESIS
INDUCTION VARIABLE SUBSTITUTION AND REDUCTION RECOGNITION IN THE POLARIS PARALLELIZING COMPILER BY WILLIAM MORTON POTTENGER B.S., University of Alaska, Fairbanks, 1989 B.A., Lehigh University, 1980 THESIS
More informationRudolf Eigenmann, Jay Hoeinger, Greg Jaxon, Zhiyuan Li, David Padua. Center for Supercomputing Research & Development. Urbana, Illinois 61801
Restructuring Fortran Programs for Cedar Rudolf Eigenmann, Jay Hoeinger, Greg Jaxon, Zhiyuan Li, David Padua Center for Supercomputing Research & Development University of Illinois at Urbana-Champaign
More informationA Data Dependence Graph in Polaris. July 17, Center for Supercomputing Research and Development. Abstract
A Data Dependence Graph in Polaris Yunheung Paek Paul Petersen July 17, 1996 Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign Urbana, Illinois 61801 Abstract
More informationOn the Automatic Parallelization of the Perfect Benchmarks
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 1, JANUARY 1998 5 On the Automatic Parallelization of the Perfect Benchmarks Rudolf Eigenmann, Member, IEEE, Jay Hoeflinger, and David
More informationInterprocedural Analysis for Parallelization. Mary W. Hally, ycomputer Science Dept. California Institute of Technology. Pasadena, CA 91125
Interprocedural Analysis for Parallelization Mary W. Hally, Brian R. Murphy, Saman P. Amarasinghe, Shih-Wei Liao, Monica S. Lam Computer Systems Laboratory Stanford University Stanford, CA 94305 ycomputer
More informationA taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA
A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA
More informationPerformance Analysis of Parallelizing Compilers on the Perfect. Benchmarks TM Programs. Center for Supercomputing Research and Development.
Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks TM Programs William Blume and Rudolf Eigenmann Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign
More informationNon-Linear and Symbolic Data Dependence Testing. Abstract. One of the most crucial qualities of an optimizing compiler is its ability to detect when
Non-Linear and Symbolic Data Dependence Testing William Blume Hewlett-Packard, California bblume@hpclopt.cup.hp.com Rudolf Eigenmann Purdue University, Indiana eigenman@ecn.purdue.edu August 25, 1998 Abstract
More informationUMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742
UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College
More informationParallelization System. Abstract. We present an overview of our interprocedural analysis system,
Overview of an Interprocedural Automatic Parallelization System Mary W. Hall Brian R. Murphy y Saman P. Amarasinghe y Shih-Wei Liao y Monica S. Lam y Abstract We present an overview of our interprocedural
More informationUMIACS-TR November, Univ. of Maryland, College Park, MD Abstract
UMIACS-TR-94-123 November, 1994 CS-TR-3372 Nonlinear Array Dependence Analysis William Pugh pugh@cs.umd.edu Institute for Advanced Computer Studies Dept. of Computer Science David Wonnacott davew@cs.umd.edu
More informationSymbolic Evaluation of Sums for Parallelising Compilers
Symbolic Evaluation of Sums for Parallelising Compilers Rizos Sakellariou Department of Computer Science University of Manchester Oxford Road Manchester M13 9PL United Kingdom e-mail: rizos@csmanacuk Keywords:
More informationAbstract. To eectively parallelize real programs, parallelizing compilers need powerful symbolic
Demand-driven, Symbolic Range Propagation? William Blume Hewlett Packard, Cupertino, California Rudolf Eigenmann Purdue University, West Lafayette, Indiana Abstract. To eectively parallelize real programs,
More informationGated SSA-Based Demand-Driven Symbolic Analysis for Parallelizing. Compilers. Peng Tu and David Padua
Gated SSA-Based Demand-Driven Symbolic Analysis for Parallelizing Compilers Peng Tu and David Padua Center for Supercomputing Research and Development Coordinated Science Laboratory, University of Illinois
More informationChapter 1: Interprocedural Parallelization Analysis: A Case Study. Abstract
Chapter 1: Interprocedural Parallelization Analysis: A Case Study Mary W. Hall Brian R. Murphy Saman P. Amarasinghe Abstract We present an overview of our interprocedural analysis system, which applies
More informationto automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu
Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu
More informationExact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26
Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Technical Report ANU-TR-CS-92- November 7, 992 Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer
More informationNon-Linear and Symbolic Data Dependence Testing. April 19, Abstract
Non-Linear and Symbolic Data Dependence Testing William Blume Hewlett-Packard, California bblume@hpclopt.cup.hp.com Rudolf Eigenmann Purdue University, Indiana eigenman@ecn.purdue.edu April 19, 1996 Abstract
More informationInterprocedural Symbolic Range Propagation for Optimizing Compilers
Interprocedural Symbolic Range Propagation for Optimizing Compilers Hansang Bae and Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 47907 {baeh,eigenman}@purdue.edu
More informationThe Value Evolution Graph and its Use in Memory Reference Analysis
The Value Evolution Graph and its Use in Memory Reference Analysis Silvius Rus Texas A&M University silviusr@cs.tamu.edu Dongmin Zhang Texas A&M University dzhang@cs.tamu.edu Lawrence Rauchwerger Texas
More informationHybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger
Hybrid Analysis and its Application to Thread Level Parallelization Lawrence Rauchwerger Thread (Loop) Level Parallelization Thread level Parallelization Extracting parallel threads from a sequential program
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationInternational Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA
International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI
More informationFinding a winning strategy in variations of Kayles
Finding a winning strategy in variations of Kayles Simon Prins ICA-3582809 Utrecht University, The Netherlands July 15, 2015 Abstract Kayles is a two player game played on a graph. The game can be dened
More informationCenter for Supercomputing Research and Development. were identied as time-consuming: The process of creating
Practical Tools for Optimizing Parallel Programs Rudolf Eigenmann Patrick McClaughry Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign Abstract This paper describes
More informationDynamically Adaptive Parallel Programs
Dynamically Adaptive Parallel Programs Michael Voss and Rudolf Eigenmann Purdue University, School of Electrical and Computer Engineering Abstract. Dynamic program optimization is the only recourse for
More informationLecture Notes on Loop Optimizations
Lecture Notes on Loop Optimizations 15-411: Compiler Design Frank Pfenning Lecture 17 October 22, 2013 1 Introduction Optimizing loops is particularly important in compilation, since loops (and in particular
More informationAN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1
AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University
More informationA Source-to-Source OpenMP Compiler
A Source-to-Source OpenMP Compiler Mario Soukup and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4
More informationBriki: a Flexible Java Compiler
Briki: a Flexible Java Compiler Michał Cierniak Wei Li Department of Computer Science University of Rochester Rochester, NY 14627 fcierniak,weig@cs.rochester.edu May 1996 Abstract We present a Java compiler
More informationAnalysis and Transformation in an. Interactive Parallel Programming Tool.
Analysis and Transformation in an Interactive Parallel Programming Tool Ken Kennedy Kathryn S. McKinley Chau-Wen Tseng ken@cs.rice.edu kats@cri.ensmp.fr tseng@cs.rice.edu Department of Computer Science
More informationNovember 8, Abstract. Shared-Memory Parallel computers (SMPs) have become a major force in the market of parallel
Compiling for the New Generation of High-Performance SMPs Insung Park Michael J. Voss Rudolf Eigenmann November 8, 1996 Abstract Shared-Memory Parallel computers (SMPs) have become a major force in the
More informationINTERPROCEDURAL PARALLELIZATION USING MEMORY CLASSIFICATION ANALYSIS BY JAY PHILIP HOEFLINGER B.S., University of Illinois, 1974 M.S., University of I
cflcopyright by Jay Philip Hoeflinger 2000 INTERPROCEDURAL PARALLELIZATION USING MEMORY CLASSIFICATION ANALYSIS BY JAY PHILIP HOEFLINGER B.S., University of Illinois, 1974 M.S., University of Illinois,
More informationTHEORY, TECHNIQUES, AND EXPERIMENTS IN SOLVING RECURRENCES IN COMPUTER PROGRAMS BY WILLIAM MORTON POTTENGER M.S., University of Illinois at Urbana-Cha
ccopyright by WILLIAM MORTON POTTENGER 1997 THEORY, TECHNIQUES, AND EXPERIMENTS IN SOLVING RECURRENCES IN COMPUTER PROGRAMS BY WILLIAM MORTON POTTENGER M.S., University of Illinois at Urbana-Champaign,
More informationInterprocedural Dependence Analysis and Parallelization
RETROSPECTIVE: Interprocedural Dependence Analysis and Parallelization Michael G Burke IBM T.J. Watson Research Labs P.O. Box 704 Yorktown Heights, NY 10598 USA mgburke@us.ibm.com Ron K. Cytron Department
More informationA Compiler-Directed Cache Coherence Scheme Using Data Prefetching
A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Hock-Beng Lim Center for Supercomputing R & D University of Illinois Urbana, IL 61801 hblim@csrd.uiuc.edu Pen-Chung Yew Dept. of Computer
More informationValue Range Analysis of Conditionally Updated Variables and Pointers
Value Range Analysis of Conditionally Updated Variables and Pointers Johnnie Birch Robert van Engelen Kyle Gallivan Department of Computer Science and School of Computational Science and Information Technology
More informationDepartment of. Computer Science. Uniqueness and Completeness. Analysis of Array. Comprehensions. December 15, Colorado State University
Department of Computer Science Uniqueness and Completeness Analysis of Array Comprehensions David Garza and Wim Bohm Technical Report CS-93-132 December 15, 1993 Colorado State University Uniqueness and
More informationExtra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987
Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is
More informationA Portable Parallel N-body Solver 3. Abstract. We present parallel solutions for direct and fast n-body solvers written in the ZPL
A Portable Parallel N-body Solver 3 E Christopher Lewis y Calvin Lin y Lawrence Snyder y George Turkiyyah z Abstract We present parallel solutions for direct and fast n-body solvers written in the ZPL
More informationDepartment of. Computer Science. Uniqueness Analysis of Array. Omega Test. October 21, Colorado State University
Department of Computer Science Uniqueness Analysis of Array Comprehensions Using the Omega Test David Garza and Wim Bohm Technical Report CS-93-127 October 21, 1993 Colorado State University Uniqueness
More informationThe LRPD Test: Speculative Run Time Parallelization of Loops with Privatization and Reduction Parallelization
The LRPD Test: Speculative Run Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David Padua University of Illinois at Urbana-Champaign Abstract Current
More informationAutomatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology
Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291
More informationAn Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst
An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst
More informationc Copyright by Yunheung Paek 1997
ccopyright by Yunheung Paek 1997 COMPILING FOR DISTRIBUTED MEMORY MULTIPROCESSORS BASED ON ACCESS REGION ANALYSIS BY YUNHEUNG PAEK B.S., Seoul National University, 1988 M.S., Seoul National University,
More informationAppears in the proceedings of the ACM/IEEE Supercompuing 94
Appears in the proceedings of the ACM/IEEE Supercompuing 94 A Compiler-Directed Cache Coherence Scheme with Improved Intertask Locality Lynn Choi Pen-Chung Yew Center for Supercomputing R & D Department
More informationProgram Behavior Characterization through Advanced Kernel Recognition
Program Behavior Characterization through Advanced Kernel Recognition Manuel Arenaz, Juan Touriño, and Ramón Doallo Computer Architecture Group Department of Electronics and Systems University of A Coruña,
More informationThis broad range of applications motivates the design of static analysis techniques capable of automatically detecting commuting operations commutativ
On the Complexity of Commutativity Analysis Oscar Ibarra y,pedro Diniz z and Martin Rinard x Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fibarra,pedro,marting@cs.ucsb.edu
More informationUniversity of Ghent. St.-Pietersnieuwstraat 41. Abstract. Sucient and precise semantic information is essential to interactive
Visualizing the Iteration Space in PEFPT? Qi Wang, Yu Yijun and Erik D'Hollander University of Ghent Dept. of Electrical Engineering St.-Pietersnieuwstraat 41 B-9000 Ghent wang@elis.rug.ac.be Tel: +32-9-264.33.75
More informationRearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction
Rearrangement of DNA fragments: a branch-and-cut algorithm 1 C. E. Ferreira 1 C. C. de Souza 2 Y. Wakabayashi 1 1 Instituto de Mat. e Estatstica 2 Instituto de Computac~ao Universidade de S~ao Paulo e-mail:
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationExtending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11
Extending CRAFT Data-Distributions for Sparse Matrices G. Bandera E.L. Zapata July 996 Technical Report No: UMA-DAC-96/ Published in: 2nd. European Cray MPP Workshop Edinburgh Parallel Computing Centre,
More informationAllowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs
To appear in: Int. Conf. on Parallel and Distributed Systems, ICPADS'96, June 3-6, 1996, Tokyo Allowing Cycle-Stealing Direct Memory Access I/O Concurrent with Hard-Real-Time Programs Tai-Yi Huang, Jane
More informationUnconstrained Optimization
Unconstrained Optimization Joshua Wilde, revised by Isabel Tecu, Takeshi Suzuki and María José Boccardi August 13, 2013 1 Denitions Economics is a science of optima We maximize utility functions, minimize
More informationZhiyuan Li. University of Minnesota. [PER89] and WHAMS3D [BT82], a program for structural. writing conicts result when the iterations are executed
Array Privatization for Parallel Execution of Loops Zhiyuan Li Department of Computer Science University of Minnesota 200 Union St. SE, Minneapolis, Minnesota 55455 li@cs.umn.edu Abstract In recent experiments,
More informationIdentifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1
Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1 Yuan-Shin Hwang Department of Computer Science National Taiwan Ocean University Keelung 20224 Taiwan shin@cs.ntou.edu.tw
More informationProfiling Dependence Vectors for Loop Parallelization
Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw
More informationtime using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o
Reconstructing a Binary Tree from its Traversals in Doubly-Logarithmic CREW Time Stephan Olariu Michael Overstreet Department of Computer Science, Old Dominion University, Norfolk, VA 23529 Zhaofang Wen
More informationPPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France.
: A Pipeline Path-based Scheduler Maher Rahmouni Ahmed A. Jerraya Laboratoire TIMA/lNPG,, Avenue Felix Viallet, 80 Grenoble Cedex, France Email:rahmouni@verdon.imag.fr Abstract This paper presents a scheduling
More informationExtended SSA with factored use-def chains to support optimization and parallelism
Oregon Health & Science University OHSU Digital Commons CSETech June 1993 Extended SSA with factored use-def chains to support optimization and parallelism Eric Stoltz Michael P. Gerlek Michael Wolfe Follow
More informationNetwork. Department of Statistics. University of California, Berkeley. January, Abstract
Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,
More informationICC++ Language Denition. Andrew A. Chien and Uday S. Reddy 1. May 25, 1995
ICC++ Language Denition Andrew A. Chien and Uday S. Reddy 1 May 25, 1995 Preface ICC++ is a new dialect of C++ designed to support the writing of both sequential and parallel programs. Because of the signicant
More informationA Parallel Intermediate Representation based on. Lambda Expressions. Timothy A. Budd. Oregon State University. Corvallis, Oregon.
A Parallel Intermediate Representation based on Lambda Expressions Timothy A. Budd Department of Computer Science Oregon State University Corvallis, Oregon 97331 budd@cs.orst.edu September 20, 1994 Abstract
More informationRevised version, February 1991, appeared in Information Processing Letters 38 (1991), 123{127 COMPUTING THE MINIMUM HAUSDORFF DISTANCE BETWEEN
Revised version, February 1991, appeared in Information Processing Letters 38 (1991), 123{127 COMPUTING THE MINIMUM HAUSDORFF DISTANCE BETWEEN TWO POINT SETS ON A LINE UNDER TRANSLATION Gunter Rote Technische
More informationUniversity oftoronto. Queens University. Abstract. This paper gives an overview of locality enhancement techniques
Locality Enhancement for Large-Scale Shared-Memory Multiprocessors Tarek Abdelrahman 1, Naraig Manjikian 2,GaryLiu 3 and S. Tandri 3 1 Department of Electrical and Computer Engineering University oftoronto
More informationB.S., University of Nebraska{Lincoln, 1986 THESIS. Submitted in partial fulllment of the requirements. Urbana, Illinois
EVALUATION OF PROGRAMS AND PARALLELIZING COMPILERS USING DYNAMIC ANALYSIS TECHNIQUES BY PAUL MARX PETERSEN B.S., University of Nebraska{Lincoln, 1986 M.S., University of Illinois at Urbana{Champaign, 1989
More informationData Dependences and Parallelization
Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]
More informationParallelizing While Loops for Multiprocessor Systems
CSRD Technical Report 1349 Parallelizing While Loops for Multiprocessor Systems Lawrence Rauchwerger and David Padua Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign
More informationDraft. Debugging of Optimized Code through. Comparison Checking. Clara Jaramillo, Rajiv Gupta and Mary Lou Soa. Abstract
Draft Debugging of Optimized Code through Comparison Checking Clara Jaramillo, Rajiv Gupta and Mary Lou Soa Abstract We present a new approach to the debugging of optimized code through comparison checking.
More informationMachine-Independent Optimizations
Chapter 9 Machine-Independent Optimizations High-level language constructs can introduce substantial run-time overhead if we naively translate each construct independently into machine code. This chapter
More informationLecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1
CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanfordedu) February 6, 2018 Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 In the
More informationParallelizing While Loops for Multiprocessor Systems y
Submitted to IEEE Transactions on Parallel and Distributed Systems. (CSRD Technical Report 1349.) Parallelizing While Loops for Multiprocessor Systems y Lawrence Rauchwerger and David Padua Center for
More informationSYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science.
ALEWIFE SYSTEMS MEMO #12 A Synchronization Library for ASIM Beng-Hong Lim (bhlim@masala.lcs.mit.edu) Laboratory for Computer Science Room NE43-633 January 9, 1992 Abstract This memo describes the functions
More informationIn C.-H. Huang, P.Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua
In C.-H. Huang, P.Sadayappan, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua (Editors), Languages and Compilers for Parallel Computing, pages 269-288. Springer-Verlag, August 1995. (8th International
More informationNull space basis: mxz. zxz I
Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence
More informationUMIACS-TR December, CS-TR-3250 Revised March, Static Analysis of Upper and Lower Bounds. on Dependences and Parallelism
UMIACS-TR-94-40 December, 1992 CS-TR-3250 Revised March, 1994 Static Analysis of Upper and Lower Bounds on Dependences and Parallelism William Pugh pugh@cs.umd.edu Institute for Advanced Computer Studies
More informationCompiler Construction 2010/2011 Loop Optimizations
Compiler Construction 2010/2011 Loop Optimizations Peter Thiemann January 25, 2011 Outline 1 Loop Optimizations 2 Dominators 3 Loop-Invariant Computations 4 Induction Variables 5 Array-Bounds Checks 6
More informationLecture Notes on Cache Iteration & Data Dependencies
Lecture Notes on Cache Iteration & Data Dependencies 15-411: Compiler Design André Platzer Lecture 23 1 Introduction Cache optimization can have a huge impact on program execution speed. It can accelerate
More informationThe Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor
IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.
More informationConvolutional Neural Networks for Object Classication in CUDA
Convolutional Neural Networks for Object Classication in CUDA Alex Krizhevsky (kriz@cs.toronto.edu) April 16, 2009 1 Introduction Here I will present my implementation of a simple convolutional neural
More informationDocument Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.
Document Image Restoration Using Binary Morphological Filters Jisheng Liang, Robert M. Haralick University of Washington, Department of Electrical Engineering Seattle, Washington 98195 Ihsin T. Phillips
More informationTo appear: Proceedings of Supercomputing '92 1. Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu
To appear: Proceedings of Supercomputing '9 Compiler Code Transformations for Superscalar-Based High-Performance Systems Scott A. Mahlke William Y. Chen John C. Gyllenhaal Wen-mei W. Hwu Center for Reliable
More informationPrinciple of Complier Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore
Principle of Complier Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Lecture - 20 Intermediate code generation Part-4 Run-time environments
More informationof Perceptron. Perceptron CPU Seconds CPU Seconds Per Trial
Accelerated Learning on the Connection Machine Diane J. Cook Lawrence B. Holder University of Illinois Beckman Institute 405 North Mathews, Urbana, IL 61801 Abstract The complexity of most machine learning
More informationSearching Algorithms/Time Analysis
Searching Algorithms/Time Analysis CSE21 Fall 2017, Day 8 Oct 16, 2017 https://sites.google.com/a/eng.ucsd.edu/cse21-fall-2017-miles-jones/ (MinSort) loop invariant induction Loop invariant: After the
More informationCompiler Construction 2016/2017 Loop Optimizations
Compiler Construction 2016/2017 Loop Optimizations Peter Thiemann January 16, 2017 Outline 1 Loops 2 Dominators 3 Loop-Invariant Computations 4 Induction Variables 5 Array-Bounds Checks 6 Loop Unrolling
More informationCompiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore
Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No. # 10 Lecture No. # 16 Machine-Independent Optimizations Welcome to the
More informationCache-Oblivious Traversals of an Array s Pairs
Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious
More informationAlgebraic Properties of CSP Model Operators? Y.C. Law and J.H.M. Lee. The Chinese University of Hong Kong.
Algebraic Properties of CSP Model Operators? Y.C. Law and J.H.M. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong SAR, China fyclaw,jleeg@cse.cuhk.edu.hk
More informationSingle-Pass Generation of Static Single Assignment Form for Structured Languages
1 Single-Pass Generation of Static Single Assignment Form for Structured Languages MARC M. BRANDIS and HANSPETER MÖSSENBÖCK ETH Zürich, Institute for Computer Systems Over the last few years, static single
More information