Null space basis: mxz. zxz I
|
|
- Marvin Gaines
- 5 years ago
- Views:
Transcription
1 Loop Transformations Linear Locality Enhancement for
2 ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence Transformed code can be generated using ILP calculator. Story so far transformation on the iteration space of the loop nest. linear Legality of permutation can be determined from the 2
3 for permutations applies to other loop transformations that Theory be modeled as linear transformations: skewing, reversal,scaling. can Dependence matrix of transformed program: T:D Transformation matrix: T (a non-singular matrix) matrix: D Dependence in which each column is a distance/direction vector Matrix Legality: T:D Small complication with code generation if scaling is included. 3
4 Linear loop transformations: permutation, skewing, reversal,scaling ompositions of linear loop transformations: matrix viewpoint Locality matrix: model for temporal and spatial locality Two key ideas: Height reduction. Making a loop nest fully permutable 2. A unied algorithm for improving cache performance Lecture based on work we did for HPs compiler product line Overview of lecture 4
5 is easy to nd a basis for null space of matrix in column-echelon It form. Find unimodular matrix U so that MU = L is in form. column-echelon If x is a vector in null space of L, Ux is in null space of M. If is basis for null space of L, U is basis for null space of M. Key algorithm: nding basis for null space of an integer matrix Denition: A vector x is in null space of a matrix M if Mx =. Null space basis: mxz zxz I m z General matrix M: 5
6 @ = A Example: M*U = L 2 ;2 ; ; ; asis for null space of M = U* ( ) = (-2 - ) 6
7 We have used reduction to column-echelon form: MU = L. We can also view this as a matrix decomposition: M = LQ Q is inverse of U from previous step. where Analogy: QR factorization of real matrices. matrices are like orthogonal matrices in reals. Unimodular Special form of this decomposition: M = LQ where all diagonal of L are non-negative. elements This is called a psuedo-hermite normal form of matrix. Some observations: 7
8 between two iterations Dependence iterations touch the same location => J I Motivating example: Wavefront code DO I =,N DO J =,N X(I,J) = X(I-,J+)... Dependence matrix = - => potential for exploiting data reuse! iteration points between executions of dependent iterations. N we schedule dependent iterations closer together? an 8
9 now, focus only on reuse opportunities between dependent For iterations. iterations that read from the same location: input dependences spatial locality This exploits only one kind of temporal locality. There are other kinds of locality that are important: oth are easy to add to basic model as we will see later. 9
10 J Exploiting temporal locality in wavefront I J I Permutation is illegal! Tiling is illegal! We have studied two transformations: permutation and tiling. Permutation and tiling are both illegal.
11 Height Reduction
12 Transformation is legal. Dependent iterations are scheduled close together, so good for J H One solution: schedule iterations along 45 degree planes! I G Note: locality. an we view this in terms of loop transformation? Loop skewing followed by loop reversal. 2
13 Loop Skewing: a linear loop transformation J V I J = U V Skewing of inner loop by outer loop: I k U (k is some fixed integer) Skewing of inner loop by an outer loop: always legal New dependence vectors: compute T*D In this example, D = T*D = - This skewing has changed dependence vector but it has not brought dependent iterations closer together... 3
14 Skewing outer loop by inner loop J V I U I J = U V Outer loop skewing: k Skewing of outer loop by inner loop: not necessarily legal In this example, D = T*D = incorrect - - Dependent iterations are closer together (good) but program is illegal (bad). How do we fix this?? 4
15 5-5 Loop Reversal:a linear loop transformation I U U = [-][I] DO I =, N DO U = -N,- X(I) = I+2 X(-U) = -U +2 Transformation matrix = [-] Another example: 2-D loop, reverse inner loop U V = - I J Legality of loop reversal: Apply transformation matrix to all dependences verify lex +ve ode generation: easy 5
16 Need for composite transformations J V I J = U V I U Transformation: skewing followed by reversal In final program, dependent iterations are close together! omposition of linear transformations = another linear transformation! omposite transformation matrix is H - U V = G H G - * = - How do we synthesize this composite transformation?? 6
17 Transformation matrices for permutation/reversal/skewing are unimodular. Any composition of these transformations can be represented a unimodular matrix. by Any unimodular matrix can be decomposed into product of matrices. permutation/reversal/skewing Legality of composite transformation T : check that T:D. T 3 (T 2 (T D)) = (T 3 T 2 T ) D.) (Proof: ode generation algorithm: New bounds: compute from A T ; U b Some facts about permutation/reversal/skewing Original bounds: A I b Transformation: U = T I 7
18 composite transformations using matrix-based Synthesizing approaches Rather than reason about sequences of transformations, we can about the single matrix that represents the composite reason transformation. Enabling abstraction: dependence matrix 8
19 Height reduction: move reuse into inner loops!. Dependence vector is ; Prefer not to have dependent iterations in dierent outer loop iterations.! So dependence vector in transformed program should look like.??!! So T = ;??! This says rst row of T is orthogonal to. ; So rst row of T can be chosen to be ( ). What about second row? 9
20 s should be linearly independent of rst row of T (non-singular matrix). T D should be a lexicographically positive vector. Determinant of matrix should be + or - (unimodularity). Second row of T (call it s) should satisfy the following properties:! One choice: ( ; ), giving T =. ; 2
21 How do we choose the rst few rows of the transformation. to push reuses down? matrix How do we ll in the rest of the matrix to get unimodular 2. matrix? General questions 2
22 Let be a basis for null space of D. Use transpose of basis vectors as rst few rows of hoosing rst few rows of transformation matrix: For now, assume dependence matrix D has only distances in it. Algorithm: transformation matrix. Example: study running example in previous slides. 22
23 ; rst row of transformation be (r r 2 r 3 r 4 ). Let this is orthogonal to D, it is clear that r and r 2 must be zero. If st and 2nd rows of D and nd basis for null space of Ignore rows of D : ( ) remaining picture: knock out rows of D with direction entries, use General for distance vectors, and pad resulting null space vectors algorithm Extension to direction vectors: easy + onsider D = 2 ; ;2 Pad with zeros to get partial transformation: ( ) with s. 23
24 Turns out to be easier if we drop unimodularity requirement ompletion procedure: generate a non-singular matrix T given few rows of this matrix rst Problem: how do we generate code if transformation matrix is Filling in rest of matrix: a general non-singular matrix? 24
25 to generate a complete transformation from a partially Goal: one specied a non-singular matrix T that extends P and that satises all generate dependences in D (TD ). the rows of P are linearly independent P D (no dependences violated yet) ompletion Procedure!!! = ;???? Given: a dependence matrix D a partial transformation P (rst few rows) Precondition: 25
26 From D, delete all dependences d for which Pd.. If D is now empty, use null space basis determination to add 2. rows to P to generate a non-singular matrix. more Otherwise, let k be the rst row of (reduced) D that has a 3. non-zero entry. Use e k everywhere else) as the next row of the transformation. zero Repeat from step (i). 4. Proof: Need to show that e k of partial transformation, and does not violate any rows Easy (see Li/Pingali paper). dependences. ompletion algorithm: iterative algorithm (unit row vector with in column k and is linearly independent of existing 26
27 A dependence is satised by partial transformation. No row of transformation = ( ) A Example: + + = A?????? ; ; New partial dependence matrix is + +??? 27
28 dependences are now satised, so we are left with the problem All completing transformation with a row independent of existing A ones. asis for null space of partial transformation = ( -) T = ; 28
29 transformation matrix T we obtain is non-singular but Problem: not be unimodular. may solutions: based on T = LQ pseudo-hermite normal form Two where L is lower-triangular with positive diagonal decomposition Develop code generation technology for non-singular matrices. Use Q as transformation matrix. How do we know this is legal, and is good for Question: locality? let us understand what transformations are modeled by First, matrices that are not unimodular. non-singular elements 29
30 k Loop scaling:change step size of loop DO I =, vs DO U = 2,2,2 Y(I) = I Y(U/2) = U/2 Scaling matrix: non-unimodular transformation! 3
31 Scaling, like reversal, is not important by itself. Nonsingular linear loop transformations: permutation/skewing/reversal/scaling Any non-singular integer matrix can be decomposed into of permutation/skewing/reversal/scaling matrices. product Standard decompositions: (pseudo)-hermite form T = L Q Q is unimodular and L is lower triangular with positive where elements, (pseudo)-smith normal form T = UDV diagonal U V are unimodular and D is diagonal with positive where elements. Including scaling in repertoire of loop transformations makes it to synthesize transformations but complicates code easier generation. 3
32 ode generation with a non-singular transformation T Original bounds: A I b ode generation for unimodular matrix U: Transformation: J = U I New bounds: compute from A U ; J b Key problem here: T ; is not necessarily an integer matrix. 32
33 Solution: use Hermite normal form decomposition T = L Q Diculties: J -2 4 I J = U V I U DO I =,3 DO J =,3 A(4J-2I+3,I+J) = J; DO U = -2,,2 DO V = -U/2 +3 max(,ceil(u/2+/2)), -U/2 + 3min(3,floor(U/2+3/2)),3 A[U+3,V] = (U+2V)/6 Key problems: How do we determine integer lower and upper bounds? (rounding up from rational bounds is not correct) How do we determine step sizes? 33
34 Running example: factorize T = L Q J - 2 Auxiliary Iteration Space DO I =,3 DO J =,3 A(4J-2I+3,I+J) = J; Initial Iteration Space I 2-3 U DO P = -,5 DO Q = max(,ceil(p+/2)),min(3,floor(p+3/2) A(2P+3,-P+Q) = Q -2 4 = U Final Iteration Space DO U = -2,,2 DO V = -U/2 +3 max(,ceil(u/2+/2)), -U/2 + 3min(3,floor(U/2+3/2)),3 A[U+3,V] = (U+2V)/6 34
35 Given T = L Q where L is triangular with positive diagonal and Q is unimodular. elements Use Q as the transformation to generate bounds for auxiliary space. Read o bounds for nal iteration space from L non-singular matrix T as transformation matrix. Using example: Running -2 4 = ; p 5 max( ceil((p + )=2)) q min(3 floor((p + 3)=2))!!! p u 2 = q v ; 3 ;2 u ;u=2 + 3max( ceil((u=2 + )=2)) v ;u=2 + 3 min(3 floor((u=2 + 3) 35
36 Good to use strength reduction to replace the integer divisions by and multiplications. additions transformation with non-singular matrix is needed for omplete code for distributed-memory parallel machines and similar generating line: transformation synthesis procedure can focus on producing ottom matrix. Unimodularity is not important. non-singular Step sizes of loops: diagonal entries of L hange of variables in body!!! ;=6 4=6 i u = =6 2=6 j v problems (see Wolfes book). 36
37 2 to code generation with non-singular transformation Solution matrices If L is a n n triangular matrix with positive diagonal Lemma: L maps lexicographically positive vectors into elements, L does not change the order in which iterations are done Intuition: just renumbers them! but onsider d = L where is a lexicographically positive Proof: and show that d is lexicographically positive as well. vector Auxiliary space iterations are performed in same order as iterations in nal space. corresponding Memory system behavior of auxiliary program is same as system behavior of nal program! memory lexicographically positive vectors. 37
38 memory system optimization, we can use a non-singular For matrix T as follows: compute Hermite normal form transformation of T = L Q and use Q as transformation matrix! 38
39 Find rows orthogonal to space spanned by dependence vectors, and them as the partial transformation. In this example, the use subspace is spanned by the vector ( ). orthogonal Apply completion procedure to partial transformation to generate J I Putting it all together for wavefront example: DO I =,N DO J =,N X(I,J) = X(I-,J+)... Dependence matrix = - full transformation. We would add row ( ). So transformation is! 39
40 Exploit dependences + spatial locality + read-only data reuse a dependence matrix D with dependence vectors. a locality matrix L D with locality vectors 2. matrix has additional vectors to account for spatial Locality and read-only data reuse. locality we would like to scan along directions in L in innermost Intuitively, provided dependences permit it. loops, General picture: 4
41 spatial locality: scan iteration space points so that we get Modeling stride accesses (or some multiple thereof) unit arrays are stored in column-major order, we should interchange the If loops. How do we determine this in general? two Example: from Assignment DO I =.. DO J =... [I,J] = A[I,J] + [J,I] 4
42 we do iteration i and then iteration i 2. Suppose element accessed in iteration i = X[Ai + a] Array locality matrix: matrix containing a basis for null space of Spatial A Example: array stored in column-major order, reference X[Ai + a] Array element accessed in iteration i 2 = X[Ai 2 + a] We want these two accesses to be within a column of A. So c A (i ; i 2 ) = ::: Let A = A with rst row deleted. So we want to solve A R = where R is "reuse direction". Solution: R is any vector in null space of A. 42
43 what direction(s) can we move so we get accesses along rst In of A for the most part? A Example: for 2 2 A = A Spatial locality matrix is ; 43
44 from Assignment : exploiting spatial locality in MVM Example accesses A. for I =,N for J =,N Y(I) = Y(I) + A(I,J)*X(J) Reference matrix is. Truncated matrix is So spatial locality matrix is 44
45 spatial locality: perform height reduction on locality A For our example, we get the transformation A A That is, we would permute the loops which is correct. 45
46 I = I 2 K = K 2 = I 2 ; I 2 = J 2 ; J 3 = K 2 ; K Read-only data reuse: add reuse directions to locality matrix DO I =,N DO J =,N DO K =,N [I,J] = [I,J] + A[I,K]*[K,J] onsider reuse vector for reference A[I,K]: I I 2 J J 2 K K 2 N (I J K ) (I 2 J 2 K 2 ) 46
47 @. A + Reuse vector is onsidering all the references, we get the following locality matrix:
48 ompute the locality matrix L which contains all vectors along we would like to exploit locality (dependences + which the rst few rows of the transformation. as all the completion algorithm to generate a complete and use that T transformation. Do a pseudo-hermite form decomposition of transformation Flexibility: if null space of L T good to drop some of the "less important" read-only reuse be and spatial locality vectors from consideration. vectors General Algorithm for Height Reduction data reuse + spatial locality) read-only Determine a basis for the null space of L T matrix and use the unimodular part as transformation. has only the zero vector in it, it may 48
49 some codes, height reduction may fail since locality vectors span In space. entire MMM as shown in previous slide (actually even MVM if Example: want to exploit locality in both matrix A and vectors x,y) we We can still exploit most of the locality provided we can tile loops. Tiling is not always legal... Solution: apply linear loop transformations to enable tiling. 49
50 Linear loop transformations to enable tiling 5
51 is legal if loops are fully permutable (all permutations of Tiling are legal). loops an we always convert a perfectly nested loop into a fully loop nest? permutable When we can, how do we do it? J In general, tiling is not legal D = I Tiling is illegal! Tiling is legal if all entries in dependence matrix are non-negative. 5
52 If all dependence vectors are distance vectors, we can Theorem: entire loop nest into a fully permutable loop nest. ; A. matrix of transformed program must have all positive Dependence entries. rst row of transformation can be ( ). So row of transformation (m ) (for any m > ). Second idea: skew inner loops by outer loops suciently to make General negative entries non-negative. all Example: wavefront Dependence matrix is 52
53 to make rst row with negative entries into row Transformation non-negative entries with p2... p m -n p3... -k row a row b first row with negative entries (a) for each negative entry in the first row with negative entries, find the first positive number in the corresponding column assume the rows for these positive entries are a,b etc as shown above (b) skew the row with negative entries by appropriate multiples of rows a,b... For our example, multiple of row a = ceiling(n/p2) multiple of row b = ceiling(max(m/p,k/p3)) Transformation: I.. ceiling(n/p2) ceiling(max(m/p,k/p3))... I 53
54 all entries in dependence matrix are non-negative, done. If Otherwise, Apply algorithm on previous slide to rst row with. entries. non-negative Generate new dependence matrix. 2. If no negative entries, done. 3. General algorithm for making loop nest fully permutable: 4. Otherwise, go step (). 54
55 as nice as height reduction solution, but it will work ne for Not enhancement except at tile boundaries (but boundary locality J I J I Result of tiling transformed wavefront Original loop Tiled fully permutable loop Tiling generates a 4-deep loop nest. points small compared to number of interior points). 55
56 @ A try to make outermost loops fully permutable, so we would We the second and third loops, and then tile the rst two interchange What happens with direction vectors? In general, we cannot make loop nest fully permutable. + ; Example: D = + est we can do is to make some of the loops fully permutable. loops only. Idea: algorithm will nd bands of fully permutable loop nests. 56
57 row has -ve direction which cannot be knocked out. ut we Second interchange fth row with second row to continue band. can Example for general algorithm: + ; + ;2 ;3 ; ; + A 3 4 egin rst band of fully permutable loops with rst loop. 57
58 second loop can be part of rst fully permutable band. New out -ve distances in third row by adding 2*second row to Knock 3 4 ;2 ;3 ; ; + A + ; + third row. 58
59 make band any bigger, so terminate band, drop all satised annot and start second band. dependences, this case, all dependences are satised, so last two loops form In band of fully permutable loops. second ; + A + ; + 59
60 do we determine transformation matrix for performing this How operations? time you apply a transformation to the dependence matrix, Every same transformation to T. apply Proof sketch: (U 2 (U D)) = (U 2 (U I)) D If you have n loops, start with T = I nn. At the end, T is your transformation matrix. 6
61 the previous example, we get A Interchange the second and fth rows to A 6
62 Add 2*second row to third. A 62
63 ompute dependence matrix D. urrent-row = row.. egin new fully permutable band. 2. If all entries in D are non-negative, add all remaining loops to 3. band and exit. current Otherwise, nd rst row r with one or more negative entries. 4. negative entry must be guarded by a positive entry above Each in its dependence vector. Add all loops between current-row it row (r-) to current band. and If row does not both positive and negative directions, x it as 5. follows. all direction entries are negative, negate row. If this point, all direction entries must be positive, and any At negative entries must be distances. If there are remaining entries remaining, determine appropriate multiple of negative onversion to fully permutable bands 63
64 rows to be added to that row (treating + guard entries guard ) to convert all negative entries in row r to strictly positive. as transformation matrix and add row r to current band. Update = r+ urrent-row to step 3. Go Otherwise, see if a row below current row without both positive 6. negative directions can be interchanged legally with and row. current so, perform interchange, update transformation matrix and If to step 4. go terminate current fully permutable band, drop all Otherwise, dependence vectors, and go to step 2. satised 64
65 ompute locality matrix L.. Perform height reduction by nding a basis for null space of 2. important locality vectors. less ompute resulting dependence matrix after height reduction. 3. all entries are non-negative, declare loop nest to be fully If permutable. Otherwise, apply algorithm to convert to bands of fully 4. loop nests. permutable Perform pseudo-hermite normal form decomposition of 5. transformation matrix, and use unimodular extract to resulting code transformation. perform Tile fully permutable bands choosing appropriate tile sizes. 6. Ye Grande Locality Enhancement Algorithm L T. If null space is too small, you have the option of dropping 65
66 Height reduction can also be viewed as outer-loop if no dependences are carried by outer loops, parallelization: loops are parallel! outer First paper on transformation synthesis in this style: Leslie (969) on multiprocessor scheduling. Lamport Algorithm here is based on Wei Lis ornell PhD thesis (Wei is a compiler hot-shot at Intel!). now A version of this algorithm was implemented by us in HPs product line. compiler It is easy to combine height reduction and conversion to fully band form into a single transformation matrix permutable but you can work it out for yourself. synthesis, Summary 66
Linear Loop Transformations for Locality Enhancement
Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation
More informationLegal and impossible dependences
Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique
More informationTransforming Imperfectly Nested Loops
Transforming Imperfectly Nested Loops 1 Classes of loop transformations: Iteration re-numbering: (eg) loop interchange Example DO 10 J = 1,100 DO 10 I = 1,100 DO 10 I = 1,100 vs DO 10 J = 1,100 Y(I) =
More informationA Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality
A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse
More informationPrinciple of Polyhedral model for loop optimization. cschen 陳鍾樞
Principle of Polyhedral model for loop optimization cschen 陳鍾樞 Outline Abstract model Affine expression, Polygon space Polyhedron space, Affine Accesses Data reuse Data locality Tiling Space partition
More informationLoop Transformations, Dependences, and Parallelization
Loop Transformations, Dependences, and Parallelization Announcements HW3 is due Wednesday February 15th Today HW3 intro Unimodular framework rehash with edits Skewing Smith-Waterman (the fix is in!), composing
More informationControl flow graphs and loop optimizations. Thursday, October 24, 13
Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange
More informationProgram Transformations for the Memory Hierarchy
Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California
More informationA Layout-Conscious Iteration Space Transformation Technique
IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO 12, DECEMBER 2001 1321 A Layout-Conscious Iteration Space Transformation Technique Mahmut Kandemir, Member, IEEE, J Ramanujam, Member, IEEE, Alok Choudhary, Senior
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationIntroduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop
Introduction to Programming in C Department of Computer Science and Engineering Lecture No. #16 Loops: Matrix Using Nested for Loop In this section, we will use the, for loop to code of the matrix problem.
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationComputational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science
Computational Methods CMSC/AMSC/MAPL 460 Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Zero elements of first column below 1 st row multiplying 1 st
More informationMATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.
MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. Row echelon form A matrix is said to be in the row echelon form if the leading entries shift to the
More informationAdministration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers
Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting
More informationCS 293S Parallelism and Dependence Theory
CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law
More informationLECTURES 3 and 4: Flows and Matchings
LECTURES 3 and 4: Flows and Matchings 1 Max Flow MAX FLOW (SP). Instance: Directed graph N = (V,A), two nodes s,t V, and capacities on the arcs c : A R +. A flow is a set of numbers on the arcs such that
More information2.7 Numerical Linear Algebra Software
2.7 Numerical Linear Algebra Software In this section we will discuss three software packages for linear algebra operations: (i) (ii) (iii) Matlab, Basic Linear Algebra Subroutines (BLAS) and LAPACK. There
More informationLoop Transformations! Part II!
Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow
More information6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT
6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel
More informationLecture Notes on Loop Transformations for Cache Optimization
Lecture Notes on Loop Transformations for Cache Optimization 5-: Compiler Design André Platzer Lecture Introduction In this lecture we consider loop transformations that can be used for cache optimization.
More informationLecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur
Lecture 5: Matrices Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur 29 th July, 2008 Types of Matrices Matrix Addition and Multiplication
More informationOptimizing MMM & ATLAS Library Generator
Optimizing MMM & ATLAS Library Generator Recall: MMM miss ratios L1 Cache Miss Ratio for Intel Pentium III MMM with N = 1 1300 16 32/lock 4-way 8-byte elements IJ version (large cache) DO I = 1, N//row-major
More informationCSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1
CSE P 501 Compilers Loops Hal Perkins Spring 2018 UW CSE P 501 Spring 2018 U-1 Agenda Loop optimizations Dominators discovering loops Loop invariant calculations Loop transformations A quick look at some
More informationCompiling for Advanced Architectures
Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have
More informationCoarse-Grained Parallelism
Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and
More informationReview. Loop Fusion Example
Review Distance vectors Concisely represent dependences in loops (i.e., in iteration spaces) Dictate what transformations are legal e.g., Permutation and parallelization Legality A dependence vector is
More informationLecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1
CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanfordedu) February 6, 2018 Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 In the
More informationModule 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops
The Lecture Contains: Cycle Shrinking Cycle Shrinking in Distance Varying Loops Loop Peeling Index Set Splitting Loop Fusion Loop Fission Loop Reversal Loop Skewing Iteration Space of The Loop Example
More informationCS6015 / LARP ACK : Linear Algebra and Its Applications - Gilbert Strang
Solving and CS6015 / LARP 2018 ACK : Linear Algebra and Its Applications - Gilbert Strang Introduction Chapter 1 concentrated on square invertible matrices. There was one solution to Ax = b and it was
More informationNumerical Linear Algebra
Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,
More informationUnconstrained Optimization
Unconstrained Optimization Joshua Wilde, revised by Isabel Tecu, Takeshi Suzuki and María José Boccardi August 13, 2013 1 Denitions Economics is a science of optima We maximize utility functions, minimize
More informationCSC D70: Compiler Optimization Memory Optimizations
CSC D70: Compiler Optimization Memory Optimizations Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry, Greg Steffan, and
More information4 Integer Linear Programming (ILP)
TDA6/DIT37 DISCRETE OPTIMIZATION 17 PERIOD 3 WEEK III 4 Integer Linear Programg (ILP) 14 An integer linear program, ILP for short, has the same form as a linear program (LP). The only difference is that
More information15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018
15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018 In this lecture, we describe a very general problem called linear programming
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationA Loop Transformation Theory and an Algorithm to Maximize Parallelism
452 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 2, NO. 4, OCTOBER 1991 A Loop Transformation Theory and an Algorithm to Maximize Parallelism Michael E. Wolf and Monica S. Lam, Member, IEEE
More informationSystems of Inequalities
Systems of Inequalities 1 Goals: Given system of inequalities of the form Ax b determine if system has an integer solution enumerate all integer solutions 2 Running example: Upper bounds for x: (2)and
More information2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into
2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into the viewport of the current application window. A pixel
More informationModule 10. Network Simplex Method:
Module 10 1 Network Simplex Method: In this lecture we shall study a specialized simplex method specifically designed to solve network structured linear programming problems. This specialized algorithm
More informationi=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)
Vectorization Using Reversible Data Dependences Peiyi Tang and Nianshu Gao Technical Report ANU-TR-CS-94-08 October 21, 1994 Vectorization Using Reversible Data Dependences Peiyi Tang Department of Computer
More informationArray Optimizations in OCaml
Array Optimizations in OCaml Michael Clarkson Cornell University clarkson@cs.cornell.edu Vaibhav Vaish Cornell University vaibhav@cs.cornell.edu May 7, 2001 Abstract OCaml is a modern programming language
More informationLARP / 2018 ACK : 1. Linear Algebra and Its Applications - Gilbert Strang 2. Autar Kaw, Transforming Numerical Methods Education for STEM Graduates
Triangular Factors and Row Exchanges LARP / 28 ACK :. Linear Algebra and Its Applications - Gilbert Strang 2. Autar Kaw, Transforming Numerical Methods Education for STEM Graduates Then there were three
More informationMultiple View Geometry in Computer Vision
Multiple View Geometry in Computer Vision Prasanna Sahoo Department of Mathematics University of Louisville 1 Projective 3D Geometry (Back to Chapter 2) Lecture 6 2 Singular Value Decomposition Given a
More informationLecture 11 Loop Transformations for Parallelism and Locality
Lecture 11 Loop Transformations for Parallelism and Locality 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter 11 11.3, 11.6 11.7.4, 11.9-11.9.6 1 Shared Memory
More informationData-centric Transformations for Locality Enhancement
Data-centric Transformations for Locality Enhancement Induprakas Kodukula Keshav Pingali September 26, 2002 Abstract On modern computers, the performance of programs is often limited by memory latency
More information(Sparse) Linear Solvers
(Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert
More information: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016
ETH login ID: (Please print in capital letters) Full name: 263-2300: How to Write Fast Numerical Code ETH Computer Science, Spring 2016 Midterm Exam Wednesday, April 20, 2016 Instructions Make sure that
More informationExact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26
Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Technical Report ANU-TR-CS-92- November 7, 992 Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer
More informationCOMP 558 lecture 19 Nov. 17, 2010
COMP 558 lecture 9 Nov. 7, 2 Camera calibration To estimate the geometry of 3D scenes, it helps to know the camera parameters, both external and internal. The problem of finding all these parameters is
More informationAn Improved Measurement Placement Algorithm for Network Observability
IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 16, NO. 4, NOVEMBER 2001 819 An Improved Measurement Placement Algorithm for Network Observability Bei Gou and Ali Abur, Senior Member, IEEE Abstract This paper
More informationComputational Geometry
Lecture 12: Lecture 12: Motivation: Terrains by interpolation To build a model of the terrain surface, we can start with a number of sample points where we know the height. Lecture 12: Motivation: Terrains
More informationFinding two vertex connected components in linear time
TO 1 Finding two vertex connected components in linear time Guy Kortsarz TO 2 ackward edges Please read the DFS algorithm (its in the lecture notes). The DFS gives a tree. The edges on the tree are called
More informationComputing and Informatics, Vol. 36, 2017, , doi: /cai
Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering
More information2. Mathematical Notation Our mathematical notation follows Dijkstra [5]. Quantication over a dummy variable x is written (Q x : R:x : P:x). Q is the q
Parallel Processing Letters c World Scientic Publishing Company ON THE SPACE-TIME MAPPING OF WHILE-LOOPS MARTIN GRIEBL and CHRISTIAN LENGAUER Fakultat fur Mathematik und Informatik Universitat Passau D{94030
More informationParallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
Spring 2 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Outline Why Parallelism Parallel Execution Parallelizing Compilers
More informationAN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1
AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University
More informationInteger Programming Theory
Integer Programming Theory Laura Galli October 24, 2016 In the following we assume all functions are linear, hence we often drop the term linear. In discrete optimization, we seek to find a solution x
More informationLecture 5: Duality Theory
Lecture 5: Duality Theory Rajat Mittal IIT Kanpur The objective of this lecture note will be to learn duality theory of linear programming. We are planning to answer following questions. What are hyperplane
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationLecture 18 Restoring Invariants
Lecture 18 Restoring Invariants 15-122: Principles of Imperative Computation (Spring 2018) Frank Pfenning In this lecture we will implement heaps and operations on them. The theme of this lecture is reasoning
More informationOutline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities
Parallelization Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Moore s Law From Hennessy and Patterson, Computer Architecture:
More informationc Nawaaz Ahmed 000 ALL RIGHTS RESERVED
LOCALITY ENHANCEMENT OF IMPERFECTLY-NESTED LOOP NESTS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulllment of the Requirements for the Degree of Doctor
More informationExploring Parallelism At Different Levels
Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationLecture 3: Camera Calibration, DLT, SVD
Computer Vision Lecture 3 23--28 Lecture 3: Camera Calibration, DL, SVD he Inner Parameters In this section we will introduce the inner parameters of the cameras Recall from the camera equations λx = P
More informationAn Ecient Approximation Algorithm for the. File Redistribution Scheduling Problem in. Fully Connected Networks. Abstract
An Ecient Approximation Algorithm for the File Redistribution Scheduling Problem in Fully Connected Networks Ravi Varadarajan Pedro I. Rivera-Vega y Abstract We consider the problem of transferring a set
More informationEssential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2
Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2
More informationComputational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science
Computational Methods CMSC/AMSC/MAPL 460 Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Some special matrices Matlab code How many operations and memory
More informationThe Polyhedral Model (Transformations)
The Polyhedral Model (Transformations) Announcements HW4 is due Wednesday February 22 th Project proposal is due NEXT Friday, extension so that example with tool is possible (see resources website for
More information15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs
15.082J and 6.855J Lagrangian Relaxation 2 Algorithms Application to LPs 1 The Constrained Shortest Path Problem (1,10) 2 (1,1) 4 (2,3) (1,7) 1 (10,3) (1,2) (10,1) (5,7) 3 (12,3) 5 (2,2) 6 Find the shortest
More informationDual-fitting analysis of Greedy for Set Cover
Dual-fitting analysis of Greedy for Set Cover We showed earlier that the greedy algorithm for set cover gives a H n approximation We will show that greedy produces a solution of cost at most H n OPT LP
More informationLecture Notes for Chapter 2: Getting Started
Instant download and all chapters Instructor's Manual Introduction To Algorithms 2nd Edition Thomas H. Cormen, Clara Lee, Erica Lin https://testbankdata.com/download/instructors-manual-introduction-algorithms-2ndedition-thomas-h-cormen-clara-lee-erica-lin/
More informationRanking Functions. Linear-Constraint Loops
for Linear-Constraint Loops Amir Ben-Amram 1 for Loops Example 1 (GCD program): while (x > 1, y > 1) if x
More informationColumn and row space of a matrix
Column and row space of a matrix Recall that we can consider matrices as concatenation of rows or columns. c c 2 c 3 A = r r 2 r 3 a a 2 a 3 a 2 a 22 a 23 a 3 a 32 a 33 The space spanned by columns of
More informationTilings of the Euclidean plane
Tilings of the Euclidean plane Yan Der, Robin, Cécile January 9, 2017 Abstract This document gives a quick overview of a eld of mathematics which lies in the intersection of geometry and algebra : tilings.
More informationIntroduction to Software Engineering
Introduction to Software Engineering (CS350) Lecture 17 Jongmoon Baik Testing Conventional Applications 2 Testability Operability it operates cleanly Observability the results of each test case are readily
More informationCalibrating a Structured Light System Dr Alan M. McIvor Robert J. Valkenburg Machine Vision Team, Industrial Research Limited P.O. Box 2225, Auckland
Calibrating a Structured Light System Dr Alan M. McIvor Robert J. Valkenburg Machine Vision Team, Industrial Research Limited P.O. Box 2225, Auckland New Zealand Tel: +64 9 3034116, Fax: +64 9 302 8106
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationAM 121: Intro to Optimization Models and Methods Fall 2017
AM 121: Intro to Optimization Models and Methods Fall 2017 Lecture 10: Dual Simplex Yiling Chen SEAS Lesson Plan Interpret primal simplex in terms of pivots on the corresponding dual tableau Dictionaries
More informationThe Matrix-Tree Theorem and Its Applications to Complete and Complete Bipartite Graphs
The Matrix-Tree Theorem and Its Applications to Complete and Complete Bipartite Graphs Frankie Smith Nebraska Wesleyan University fsmith@nebrwesleyan.edu May 11, 2015 Abstract We will look at how to represent
More informationNotes for Lecture 20
U.C. Berkeley CS170: Intro to CS Theory Handout N20 Professor Luca Trevisan November 13, 2001 Notes for Lecture 20 1 Duality As it turns out, the max-flow min-cut theorem is a special case of a more general
More informationSegmentation: Clustering, Graph Cut and EM
Segmentation: Clustering, Graph Cut and EM Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More informationReducing Memory Requirements of Nested Loops for Embedded Systems
Reducing Memory Requirements of Nested Loops for Embedded Systems 23.3 J. Ramanujam Λ Jinpyo Hong Λ Mahmut Kandemir y A. Narayan Λ Abstract Most embedded systems have limited amount of memory. In contrast,
More informationIncreasing Parallelism of Loops with the Loop Distribution Technique
Increasing Parallelism of Loops with the Loop Distribution Technique Ku-Nien Chang and Chang-Biau Yang Department of pplied Mathematics National Sun Yat-sen University Kaohsiung, Taiwan 804, ROC cbyang@math.nsysu.edu.tw
More informationLecture 57 Dynamic Programming. (Refer Slide Time: 00:31)
Programming, Data Structures and Algorithms Prof. N.S. Narayanaswamy Department of Computer Science and Engineering Indian Institution Technology, Madras Lecture 57 Dynamic Programming (Refer Slide Time:
More informationTheory and Algorithms for the Generation and Validation of Speculative Loop Optimizations
Theory and Algorithms for the Generation and Validation of Speculative Loop Optimizations Ying Hu Clark Barrett Benjamin Goldberg Department of Computer Science New York University yinghubarrettgoldberg
More informationChapter 3. Quadric hypersurfaces. 3.1 Quadric hypersurfaces Denition.
Chapter 3 Quadric hypersurfaces 3.1 Quadric hypersurfaces. 3.1.1 Denition. Denition 1. In an n-dimensional ane space A; given an ane frame fo;! e i g: A quadric hypersurface in A is a set S consisting
More informationCluster algebras and infinite associahedra
Cluster algebras and infinite associahedra Nathan Reading NC State University CombinaTexas 2008 Coxeter groups Associahedra and cluster algebras Sortable elements/cambrian fans Infinite type Much of the
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationCatalan Numbers. Table 1: Balanced Parentheses
Catalan Numbers Tom Davis tomrdavis@earthlink.net http://www.geometer.org/mathcircles November, 00 We begin with a set of problems that will be shown to be completely equivalent. The solution to each problem
More informationCOT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748
COT 6936: Topics in Algorithms! Giri Narasimhan ECS 254A / EC 2443; Phone: x3748 giri@cs.fiu.edu http://www.cs.fiu.edu/~giri/teach/cot6936_s12.html https://moodle.cis.fiu.edu/v2.1/course/view.php?id=174
More informationLinear Equations in Linear Algebra
1 Linear Equations in Linear Algebra 1.2 Row Reduction and Echelon Forms ECHELON FORM A rectangular matrix is in echelon form (or row echelon form) if it has the following three properties: 1. All nonzero
More information/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang
600.469 / 600.669 Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang 9.1 Linear Programming Suppose we are trying to approximate a minimization
More informationModule 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.
The Lecture Contains: Amdahl s Law Induction Variable Substitution Index Recurrence Loop Unrolling Constant Propagation And Expression Evaluation Loop Vectorization Partial Loop Vectorization Nested Loops
More information