ADJUSTING A PROGRAM TRANSFORMATION FOR LEGALITY

Similar documents
Polyhedral Compilation Foundations

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Vectorization in the Polyhedral Model

Efficient Code Generation for Automatic Parallelization and Optimization

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints

An Optimal Algorithm for Prufer Codes *

Loop Transformations, Dependences, and Parallelization

Hermite Splines in Lie Groups as Products of Geodesics

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Solving two-person zero-sum game by Matlab

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

CMPS 10 Introduction to Computer Science Lecture Notes

GSLM Operations Research II Fall 13/14

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Mathematics 256 a course in differential equations for engineering students

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

A Facet Generation Procedure. for solving 0/1 integer programs

LLVM passes and Intro to Loop Transformation Frameworks


Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Parallel matrix-vector multiplication

A Binarization Algorithm specialized on Document Images and Photos

Module Management Tool in Software Development Organizations

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Programming in Fortran 90 : 2017/2018

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Lecture 5: Multilayer Perceptrons

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

Load Balancing for Hex-Cell Interconnection Network

S1 Note. Basis functions.

The Codesign Challenge

Assembler. Building a Modern Computer From First Principles.

5 The Primal-Dual Method

Concurrent Apriori Data Mining Algorithms

A New Approach For the Ranking of Fuzzy Sets With Different Heights

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Improving Low Density Parity Check Codes Over the Erasure Channel. The Nelder Mead Downhill Simplex Method. Scott Stransky

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Cluster Analysis of Electrical Behavior

Support Vector Machines

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Analysis of Continuous Beams in General

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances

Smoothing Spline ANOVA for variable screening

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Private Information Retrieval (PIR)

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

Meta-heuristics for Multidimensional Knapsack Problems

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

REDUCING hardware design time is more than ever a

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

The relation between diamond tiling and hexagonal tiling

Support Vector Machines

Solitary and Traveling Wave Solutions to a Model. of Long Range Diffusion Involving Flux with. Stability Analysis

CHAPTER 2 DECOMPOSITION OF GRAPHS

F Geometric Mean Graphs

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

Related-Mode Attacks on CTR Encryption Mode

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Collaboratively Regularized Nearest Points for Set Based Recognition

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

TN348: Openlab Module - Colocalization

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Array transposition in CUDA shared memory

Lecture 4: Principal components

Constructing Minimum Connected Dominating Set: Algorithmic approach

A Geometric Approach for Multi-Degree Spline

Problem Set 3 Solutions

Ontology Generator from Relational Database Based on Jena

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

AADL : about scheduling analysis

Reducing Frame Rate for Object Tracking

Querying by sketch geographical databases. Yu Han 1, a *

High-Boost Mesh Filtering for 3-D Shape Enhancement

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Reading. 14. Subdivision curves. Recommended:

Intro. Iterators. 1. Access

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Transcription:

Parallel Processng Letters c World Scentfc Publshng Company ADJUSTING A PROGRAM TRANSFORMATION FOR LEGALITY CÉDRIC BASTOUL Laboratore PRSM, Unversté de Versalles Sant Quentn 45 avenue des États-Uns, 785 Versalles Cedex, France cedrc.bastoul@prsm.uvsq.fr and PAUL FEAUTRIER LIP, École Normale Supéreure de Lyon 46 Allée d Itale, 664 Lyon, France paul.feautrer@ens-lyon.fr Receved (August th 24) Revsed (December 2th 24) Communcated by (Marco Danelutto, Domenco Laforenza, Marco Vannesch) ABSTRACT Program transformatons are one of the most valuable compler technques to mprove parallelsm or data localty. However, restructurng complers have a hard tme copng wth data dependences. A typcal soluton s to focus on program parts where the dependences are smple enough to enable any transformaton. For more complex problems s only addressed the queston of checkng whether a transformaton s legal or not. In ths paper we propose to go further. Startng from a transformaton wth no guarantee on legalty, we show how we can correct t for dependence satsfacton. Two drectons are explored: frst when transformaton propertes can be explctly expressed and second when they are mplct as n the data localty transformaton case. Generatng code havng the best propertes s a drect applcaton of ths result. Keywords: Program transformatons, legalty, dependences, polyhedral model, localty.. Introducton The task of optmzng compute-bound programs s crucal for present day supercomputers f we notce that most of these machnes run at a few percent of ther peak performance. The problem can be stated as a combnatoral optmzaton problem, but due to the complexty of real-lfe programs and computers, ths approach s not practcal. Most of the tme, we start from a frst mplementaton, and try to mprove ts performance by successve transformatons. Besde mprovng the performances, a transformaton must be legal,.e. must not change the fnal results of the program. Ths s usually enforced by usng only transformatons that respect dependences [22]. Whle selectng an optmzng transformaton s not too dffcult for an experenced programmer, adustng ths transformaton for legalty s a tedous and error-prone process.

Parallel Processng Letters To bypass the dependence problem, most of the exstng methods apply only to perfect loop nests n whch dependences are non-exstent or have a specal form (fully permutable loop nests) [27]. To enlarge ther applcaton doman some preprocessng, e.g. loop skewng or code snkng, may be appled [27,,4]. More ambtous technques do not lay down any requrement on dependences, but are lmted to propose soluton canddates then to check them for legalty [7,8]. If the canddate s proved to volate dependences, then the proposed transformaton s dscarded and another canddate, perhaps havng less nterestng propertes s studed. In ths paper, we present a method that goes beyond checkng by adustng f possble a transformaton for dependence satsfacton, wthout modfyng ts optmzng propertes. Ths technque can be used to correct a transformaton canddate as well as to replace preprocessng. The technque has been desgned n the context of localtymprovng transformatons, but can be appled n many other cases. On the other hand, only transformatons whch can be represented as affne transformatons n teraton space can be corrected n ths way. Ths paper s organzed as follows. In secton 2 we outlne the background of ths work. Secton deals wth the transformatons n the polyhedral model and focuses on ther dependences constrants. Secton 4 shows how t s possble to correct a transformaton for legalty. Secton 5 compares our proposal to prevous work n the feld of localty enhancement then secton 6 concludes and dscusses future work. 2. Background and Notatons A loop n an mperatve language lke C or FORTRAN can be represented usng a n-entry column vector called ts teraton vector: x (, 2... n ) T, where k s the k th loop ndex and n s the nnermost loop. The surroundng loops and condtonals of a statement defne ts teraton doman. The statement s executed once for each element of the teraton doman. When loop bounds and condtonals only depend on surroundng loop counters, formal parameters and constants, the teraton doman can be specfed by a set of lnear nequaltes defnng a polyhedron [8]. The term polyhedron wll be used n a broad sense to denote a convex set of ponts n a lattce (also called Z-polyhedron or lattce-polyhedron),.e. a set of ponts n a Z vector space bounded by affne nequaltes [24]. A maxmal set of consecutve statements n a program wth such polyhedral teraton domans s called a statc control part (SCoP) [7]. Fgure llustrates the correspondence between statc control and polyhedral domans. Each ntegral pont of the polyhedron corresponds to an operaton,.e. an nstance of the statement. The notaton S( x) refers to the nstance of the statement S wth teraton vector x. The executon of the operatons follows lexcographc order. Ths means n a n-dmensonal polyhedron, the operaton correspondng to the ntegral pont defned by the coordnates (a...a n ) s executed before those correspondng to the coordnates (b...b n ) ff, < n, (a...a ) (b...b ) a + < b +. We wll use and for the strct and non strct lexcographc order, respectvely.

Adustng a Program Transformaton for Legalty do, n do, n f (<n+2-) S: B[+][2*+]... n+2 n 2 > <n 2 n <n > <n+2 n+2 2 6 4 7 5 + B @ n n n + 2 C A (a) surroundng control of S (b) teraton doman of S Fgure : Statc control and correspondng teraton doman Each statement may nclude one or several references to arrays (scalars are zero-dmensonal arrays). When the subscrpt functon f( x) of a reference s affne, we can wrte t f( x) F x + a where F s called the subscrpt matrx and a s a constant vector. For nstance, the reference to the array B n fgure (a) s B[f( x)] wth f» 2 +. In ths paper, matrces are always denoted by captal letters, vectors and functons n vector spaces are not. When an element s statement-specfc, t s subscrpted lke A S ; the subscrpt may be omtted when t s clear from the context.. Affne Transformatons.. Formulaton The goal of a transformaton s to modfy the orgnal executon order of the operatons. A convenent way to express the new order s to gve for each operaton an executon date. However, defnng all the executon dates separately would usually requre very large schedulng systems. Thus optmzng complers buld schedules at the statement level by fndng a functon specfyng an executon tme for each nstance of the correspondng statement. These functons are chosen affne for multple reasons: ths s the only case where we are able to decde exactly the transformaton legalty and where we know how to generate the target code. Thus, schedulng functons have the followng shape: θ S ( x S ) T S x S + t S, () where x S s the teraton vector, T S s a constant transformaton matrx and t S s

Parallel Processng Letters a constant vector (possbly ncludng structure parameters). It has been extensvely shown that lnear transformatons can express most of the useful transformatons. In partcular, loop transformatons (such as loop reversal, permutaton or skewng) can be modeled as a smple partcular case called unmodular transformatons (the T S matrx has to be square and has determnant ±) [5,25]. Complex transformatons such as tlng [26] can be acheved usng lnear transformatons as well [28]. These transformatons modfy the source polyhedra nto target polyhedra contanng the same ponts, but wth a new lexcographc order. Consderng an orgnal polyhedron defned by the system of affne constrants A x+ c and the transformaton functon θ leadng to the target ndex y T x, we deduce that the transformed polyhedron can be defned by (AT ) y + c (there exts more convenent way to descrbe the target polyhedron as dscussed n [6]). For nstance, let us consder the polyhedron n fgure 2(a) and the transformaton functon θ. The correspondng transformaton s a well» known teraton space skewng and the resultng polyhedron s shown n fgure 2(c). 2 2 4 5 6 2 6 7 B 4 5 + C @ A h 2 2 4 5 6 2 6 7 B 4 5 + C @ A (a) orgnal polyhedron (b) transformaton functon (c) target polyhedron A x + c y T x (AT ) y + c Fgure 2: A skewng transformaton.2. Legalty In general, applyng an arbtrary transformaton to a program wll change ts semantcs. Two operatons are sad to be n dependence f they share a varable (memory cell) and at least one the operatons modfes t. Ths defnton was suggested by Bernsten [9] and s the most wdely used one n works about program transformaton. It s a suffcent condton for parallelsm, but s by no means necessary, as the well known case of reductons shows. Many tests have been desgned for dependence checkng. Most of these are based on suffcent condtons for ndependence. They gve an approxmate, conservatve answer. The best known examples are the GCD-test [], and the Baneree test [4]. On the other hand, one can use classcal algorthms from Lnear Integer program-

Adustng a Program Transformaton for Legalty mng to get exact answers, as n the Omega-test [2] and the Smplex-Gomory test [2]. In the same way many dependence representatons are possble, from the smplest ones as dependence levels [2] to the most precse as dependence polyhedra [6]. We chose n ths paper to use the most precse representaton of dependences: the dependence polyhedra. However, many authors have notced that approxmate dependences are specal cases of dependence polyhedra. Hence, our method apples whatever representaton s chosen, provded the approxmaton s conservatve. In secton.2., we recall how dependences n a SCoP can be expressed exactly usng lnear (n)equaltes. Then we show n secton.2.2 how to buld the legal transformaton space..2.. Dependence Graph A convenent way to represent the schedulng constrants s the dependence graph. In ths drected graph, each program statement s represented usng a unque vertex, and the exstng dependence relatons are represented usng edges. Each vertex s labelled wth the teraton doman of the correspondng statement and the edges wth the dependence polyhedron descrbng the dependence. The dependence relaton can be defned n the followng way: Defnton A statement R depends on a statement S (wrtten SδR) f there exts an operaton S( x ), an operaton R( x 2 ) and a memory locaton m such that:. S( x ) and R( x 2 ) refer the same memory locaton m, and at least one of them wrtes to that locaton; 2. x and x 2 respectvely belong to the teraton doman of S and R;. n the orgnal sequental order, S( x ) s executed before R( x 2 ). From ths defnton follows the descrpton of the dependence polyhedron by affne (n)equaltes. The constrants systems have the followng components:. Same memory locaton: assumng that m s an array locaton, ths constrant s the equalty of the subscrpt functons of a par of references to the same array: F S x S + a S F R x R + a R. 2. Iteraton domans: both S and R teraton domans can be descrbed usng affne nequaltes, respectvely A S x S + c S and A R x R + c R.. Precedence order: ths constrant can be separated nto a dsuncton of as many parts as there are common loops to both S and R. Each case corresponds to a common loop depth and s called a dependence level. For each dependence level l, the precedence constrants are the equalty of the loop ndex varables at depth lesser to l: x R, x S, for < l and x R,l > x S,l f l s less than the common nestng level. Otherwse, there are no addtonal constrants and the dependence only exsts f S s textually before R. Such constrants can be wrtten usng lnear nequaltes: P S x S P R x R + b.

Parallel Processng Letters Thus, the dependence polyhedron for SδR at a gven level l and for a gven par of references p can be descrbed usng the followng system of (n)equaltes: [ ( ) FS F D SδR,l,p : D xs x R + d R ] ( ) A S xs x R + (2) A R P S P R a S a R c S c R b There s a dependence SδR f there exsts an ntegral pont nsde D SδR,l,p. Ths can be easly checked wth some lnear nteger programmng tool lke PpLb a []. If ths polyhedron s not empty, there s an edge n the dependence graph from the vertex correspondng to S up to the one correspondng to R, labelled wth D SδR,l,p. For the sake of smplcty we wll gnore subscrpts l and p and refer n the followng to D SδR as the only dependence polyhedron descrbng SδR..2.2. Legal Transformaton Space Consderng the transformatons as schedulng functons, the tme nterval n the target program between the executons of two operatons R( x R ) and S( x S ) s R,S ( xs x R ) θ R ( x R ) θ S ( x S ). () ( ) If there exsts a dependence SδR,.e. f D SδR s not empty, then xs R,S x R must be lexcopostve n D SδR (ntutvely, the tme nterval between two operatons R( x R ) and S( x S ) such that R( x R ) depends on S( x S ) must be at least (,...,, ) T, the smallest tme nterval: ths guarantees that the operaton R( x R ) s executed after S( x S ) n the target program). Ths condton represents as many constrants as there are ponts n R,S. Fortunately, all these constrants can be compacted n a small set of affne constrants wth the help of Farkas Lemma []. Lemma (Affne form of Farkas Lemma [24]) Let D be a nonempty polyhedron defned by the nequaltes A x+ b. Then any affne functon f( x) s nonnegatve everywhere n D ff t s a postve affne combnaton: λ and λ T f( x) λ + λ T (A x + b), wth λ and λ T. are called Farkas multplers. R,S s a vector. For t to be lexcopostve, some of ts components must be constraned to be ether non negatve or strctly postve. Let us apply Farkas Lemma to one of the constraned components. We can fnd a non-negatve scalar λ and a non-negatve vector λ T such that: T R, x R + t R (T S, x S + t S ) δ λ + ( ( ) λ T D xs x R + d ) (4) In ths formula, T R, and T S, are correspondng rows n the T R and T S matrces, and δ s zero or one accordng to the poston of the rows. a PpLb s freely avalable at http://www.prsm.uvsq.fr/ cedb

Adustng a Program Transformaton for Legalty Ths formula can be splt n as many equaltes as there are ndependent varables ( x S and x R components and parameters) by equatng ther coeffcents n both sdes of the formula. The Farkas multplers can be elmnated by usng the Fourer- Motzkn proecton algorthm [24]. The result s a system of affne constrants on the coeffcents of the transformaton (the elements of T R, and T S, ). The mportant pont s that ths system s the same for all rows of the schedulng matrces and depends only on the dependence to be satsfed. Furthermore, t depends lnearly on the value of δ. These systems completely characterze the legal transformatons of a program, and can be computed once and for all as soon as the dependences are known. 4. Correctng Transformatons Both optmzng complers and programmers have a tendency to thnk of performances frst and to check legalty afterward. The basc framework s frst to fnd the best transformaton (e.g. n the case of data localty mprovement, whch references carry the most reuse and necesstate new access patterns, whch rank constrants should be respected by the correspondng transformaton functons, etc.), then to check f a canddate transformaton s legal or not b. If the check fals, buld and test another canddate, and so on. The maor advantage of such a framework s to focus frstly on the most nterestng propertes, and the man drawback s to forsake these propertes f a legal transformaton s not drectly found after a smple check of a canddate soluton. In ths secton we wll show how t s possble to correct a canddate transformaton for dependences, frstly when t can be descrbed usng explct constrants as dscussed n secton 4.. Then n secton 4.2 we study the specal case of data localty mprovement where the transformaton propertes are hdden. 4.. Transformatons Wth Explct Propertes Experts or optmzng complers have a wde choce of optmzng transformatons for a gven program. Each transformaton has a more or less precse cost model whch helps n decdng whether to apply the transformaton or not. In the polyhedral framework, many transformatons are related to well chosen schedulng functons [5,,]. For nstance, generalzed loop nterchange s assocated to schedules whose matrx s a permutaton matrx [5]. Tryng to use these transformatons as they are may result n a negatve dependence test. Let us consder the code n Fgure. An expert or an optmzng compler may decde that movng the -loop nnermost would result n better localty. But because of complex dependences, usng drectly the loop nterchange transformaton θ S s not legal. Ths wll lead usually to reectng the transformaton.» b Ths can be done easly by nstantatng the transformaton functons n the space of all affne transformaton as defned n secton.2, then checkng whether t belong to the legal subset usng any lnear algebra tool.

Parallel Processng Letters D S do, n do, n S: a(+) / * (a()+a(+)+a(+2)) D SδS,,<a(),a( +)> D SδS,,<a(+),a( )> D SδS,,<a(+),a( +)> D SδS,,<a(+),a( +2)> D SδS,,<a(+2),a( +)> D SδS,2,<a(+),a( )> D SδS,2,<a(+2),a( +)> Fgure : Orgnal Hyperbolc-PDE program and dependence graph The polyhedral model allow more flexblty when defnng such transformatons. Moreover, one can work wth an ncompletely specfed transformatons and use the legalty constrants as a way of solvng for the mssng coeffcents. The method conssts n statng the constrants the transformaton has to satsfy, then solvng these constrants and the legalty constrants (4), usng a lnear algebra tool as PpLb. If the system does not have a soluton, we conclude that there s no legal nstance of the proposed transformaton. For example, nnermostng the -loop n the code n Fgure means that we are lookng for a transformaton functon θ S» T, T,2 T 2, T 2,2 t + t2 wth as only constrants T2, and T 2,2. By solvng the system we fnd the soluton θ S» Expressed usng classcal transformaton technques, t s a combnaton of loop skewng and loop nterchange. It leads to the target program n Fgure 4 and as expected to a better cache behavor (on a 86 GHz system wth 28KB L cache memory and n the number of cache msses of the orgnal program s 68M but 4M for the target one).. do 2, 2*n do max( -n,), mn( -,n) - ; ; S: a(+) / * (a()+a(+)+a(+2)) Fgure 4: Fnal Hyperbolc-PDE program 4.2. Transformatons Wth Implct Propertes: Data Localty Cache are used n most computer systems to compensate for the msmatch

Adustng a Program Transformaton for Legalty between processor and memory performance (at the tme of wrtng, factors of to are commonplace). Caches work by explotng localty,.e. the fact that accesses to each memory cell and ts close neghbors have a tendency to cluster n the program code. Whle ths s found to work very well for ordnary programs, t fals for compute-bound codes where large datasets are accessed accordng to very regular patterns. The basc framework for ncreasng cache ht rates s to move references to a gven memory cell (or cache lne) to neghborng teratons of some nnermost loop. Ths reduces the elapsed tme between two accesses and hence decreases the probablty that the cell has been evcted from the cache. Such a transformaton usually changes the executon order of the program, hence t must be checked for legalty before beng appled. Another way of expressng the same ntuton s to assgn an executon date (a schedule) to each operaton, and to take care that the date of accesses to the same memory cell are almost equal. Ths s usually obtaned by requrng that the outer components of the schedule are equal. The methods contnues by applyng a completon procedure to acheve an nvertble transformaton functon (see [27] for references). For nstance, let us consder self-temporal localty and a reference B[f( x)] to an array B wth the affne subscrpt functon f( x) F x + a. Two nstances of ths reference, B[f( x )] and B[f( x 2 )] refers the same memory locaton ff f( x ) f( x 2 ), that s when F x + a F x 2 + a, then ff F x r wth x r x x 2. Thus there s self-temporal reuse when x r ker F. The bass vectors of ker F gve the reuse drectons for the reference B[f( x)]; f ker F s trval, there s no self-temporal reuse for the correspondng reference. Reuse can be exploted f the transformed teraton order follows one of the reuse drectons. Then we have to fnd a vector orthogonal to the chosen reuse drecton to be the frst part of the transformaton matrx T. If ths partal transformaton does not volate dependences, we have many choces for the completon procedure n order for the transformaton functon to be one-toone ether by consderng artfcal dependences [2,5] or not [6]. As an example, consder the followng pseudo-code: do, n do, n S:... B[]... the subscrpt functon of the reference B[] s f», the kernel of the subscrpt matrx s then ker F span {(, )}. Thus there s reuse generated by the reference B[], and we can explot t thank to a transformaton matrx bult wth an orthogonal vector to the reuse drecton, e.g. [ ] and ts completon to a unmodular transformaton matrx as descrbed n [5]: T». The transformaton functon would be θ,.e. a loop nterchange» (the reader may care to verfy that ths soluton do explot the reuse of the reference

Parallel Processng Letters B[]). It s easy to generalze ths method to several references by consderng not only a reuse drecton vector, but a reuse drecton space (bult wth one bass vector per reference). It appears that there are a lot of degrees of freedom when lookng for a transformaton mprovng self-temporal localty, snce t s possble to choose the reuse drecton space, the completon method and the constant vector of the transformaton functon. Let us consder self-temporal localty and a transformaton canddate before completon θ Sc ( x S ) T Sc x S. Ths functon has the property that, modfed n the followng way: θ S ( x S ) C S T Sc x S + t S, (5) where C S s an nvertble matrx and t S s a constant vector, the localty propertes are left unmodfed for each tme step. Intutvely, f θ Sc gves the same executon date for x and x 2, then the transformed functon θ S does t as well. In the same way f the dates are dfferent wth θ Sc, then the transformed functon θ S returns dfferent dates. But whle the values of C S and t S do not change the self-temporal localty propertes c, they can change the transformaton from an llegal to a legal one. It s clearly not possble to check all these transformatons for legalty. In the followng we study another way: we show how to fnd, when possble, the unknown components C S T Sc and t S of formula 5 n order to construct a legal transformaton. Correctng formulae smlar to (5) and havng the same type of degrees of freedom can be used to acheve every type of localty (self or group - temporal or spatal) [25,8]. The challenge s, consderng the canddate transformaton matrces T Sc, to fnd the corrected matrces C S T Sc and the constant vectors t S n order for the transformaton system to be legal for dependences. Ths problem can be solved n an teratve way, each dmenson beng consdered as a stand-alone transformaton. Each row of C S T Sc s a lnear combnaton of the rows of T Sc. Thus, the unknown n the th algorthm teraton are, for each statement, the lnear combnaton coeffcents buldng the th row of C S T Sc from T Sc and the constant factor of the correspondng t S entry. After each teraton, we have to update the dependence graph snce, by a property of lexcographc order, there s no need to consder the already satsfed dependences. Thus, to fnd a soluton s easer as the algorthm terates. The algorthm s shown n fgure 5. Let us llustrate how the algorthm works usng the example n fgure 6. Suppose that an optmzng compler would lke to explot the data reuse generated by the references to the array A of the program n fgure 6(a) and that t suggests the transformaton canddates n fgure 6(b). As shown by the graph descrbng the resultng operaton executon order, where each arrow represents a dependence relaton and each backward arrow s a dependence volaton, the transformaton system s not c Ths amount to notcng that the amount of localty n the transformed program s lnked to the rank of T Sc. For a more formal dscusson, see [8].

Adustng a Program Transformaton for Legalty Correcton Algorthm: adust a transformaton system to respect dependences Input: a dependence graph DG, the transformaton canddates θ Sc ( x S ) T Sc x S. Output: the legal transformatons θ S ( x S ) C S T Sc x S + t S.. for dmenson to maxmum dmenson of T Sc (a) buld the legal transformaton space wth: for each edge n DG, the constrants of (4) for the th row of T Rc and T Sc the constrants equatng the th row entres of each C S T Sc wth a lnear combnaton of T Sc entres whose coeffcents are unknown (b) for each statement, remove from the soluton space the trval soluton where the lnear combnaton coeffcent of the th row of T Sc s null (c) f the soluton space s empty, return, else. pck the soluton gvng for each statement the mnmum values for the entres of the th row of C S T Sc and the th element of t S. update DG: for each edge n DG, add to the dependence polyhedron the constrant equatng the th dmenson of C S T Sc x S + t S of the statements labellng the source and destnaton vertces (ths may empty the polyhedron for ntegral solutons). f every dependence polyhedra n DG are empty, goto 2 v. for each statement, update the canddate transformaton T Sc : replace a row such that the correspondng lnear combnaton coeffcent s not null wth the th row replace the th row wth the th row of C S T Sc 2. return the transformaton functons θ S ( x S ) C S T Sc x S + t S. Fgure 5: Algorthm to correct the transformaton functons legal. The correcton algorthm modfes successvely each transformaton dmenson. Each stand-alone transformaton splts up the operatons nto sets such that there are no backward arrows between sets. The algorthm stops when there are no more backward arrows or when every dmenson has been corrected. Then any polyhedral code generator, lke CLooG d [6], can generate the target code. Choosng transformaton coeffcents as small as possble (step (c)) s a heurstc helpng code generators to avod control overhead. The correctness of the algorthm comes from two propertes: () the target d CLooG s freely avalable at http://www.prsm.uvsq.fr/ cedb

Parallel Processng Letters transformatons are legal, (2) the C S matrces are nvertble. The legalty s acheved because each transformaton part s chosen n the legal transformaton space (step a). The second property follows from the updatng polcy (step (c)v): at start the C S matrces are denttes. Durng each teraton, we exchange ther rows, multply some rows by non null constants (as guaranteed by step b) and add to these rows a lnear combnaton of the other rows. Each of these transformatons does not modfy the nvertblty property. 5. Related Work Snce they cannot deal wth (complex) dependences, the earlest works on localty mprovement dscuss enablng transformatons to modfy the program n such a way that the proposed method can apply. Wolf and Lam [25] proposed n ther semnal data localty optmzng algorthm to use skewng and reversal to enable tlng as n prevous works on automatc parallelzaton. McKnley et al. [2] proposed a technque based on a detaled cost model that drves the use of fuson and dstrbuton manly to enable loop permutaton. Such methods are lmted by the set of drectves they use (lke fuse or skew) and because they have to apply them n a defnte order. We clam that proposng (and correctng) schedulng functons s more complete and has better compostonalty propertes. A sgnfcant step on preprocessng technques to produce fully permutable loop nests has been acheved by Ahmed et al. []. They use Farkas Lemma to fnd a vald code snkng-lke transformaton f t exsts. But ths transformaton s stll ndependent from the optmzaton tself and t s lmted to produce a fully permutable loop nest. The method proposed n ths paper may fnd solutons even when t s not possble to extract such a loop nest. The method of Grebl et al. [4] s qute dfferent. Ther am s to mnmze the amount of communcaton n a dstrbuted program, whch s ndeed a knd of localty optmzaton. They frst take care of dependences by fndng a legal spacetme transformaton (.e. a schedule and a placement) and then tle n space-tme to acheve the optmal granularty. Adaptng these deas to cache optmzaton seems by no mean obvous, although t s an nterestng subect for further research. Reasonng drectly on schedulng functons, L and Pngal proposed a completon algorthm to buld a non-unmodular transformaton functon from a partal matrx, such that startng from a legal transformaton, the completed transformaton stay legal for dependences [2]. In the same sprt, Grebl et al. [5] extended an arbtrary matrx descrbng a legal transformaton to a square nvertble matrx. In contrast, we show n ths paper how to fnd the vald functons before completon. 6. Concluson and Future Work In ths paper we presented a general method correctng a program transformaton for legalty wth no consequence on ts propertes. It can be appled ether

Adustng a Program Transformaton for Legalty do, n do, n do k, n S: A(,k) A(,k) + B(,,k) / A(,k-) S2: c A(n,n) + (a) Orgnal program θ Sc @ k A 2 4 2 5 4 2 5 4 k 5 + @ A ; θ S2c @ A S2 S S S S S S S S,,2 2,,2,, 2,,,2,2 2,2,2,2, 2,2, (b) Transformaton functon canddates θ Sc @ k A 2 4 2 5 4 2 5 4 k 5 + @ A ; θ S2c @ n A S S S S S S S S,,2 2,,2,, 2,,,2,2 2,2,2,2, 2,2, S2 (c) Frst correcton teraton θ Sc @ k A 2 4 2 5 4 2 5 4 k 5 + @ A ; θ S2c @ n n A S S S S S S S S,, 2,,,,2 2,,2,2, 2,2,,2,2 2,2,2 S2 (d) Second and last correcton teraton do, n do k, n do, n S: A(,k) A(,k) + B(,,k) / A(,k-) S2: c A(n,n) + (e) Target program Fgure 6: Iteratve transformaton correcton prncple (n 2 for graphs)

Parallel Processng Letters when the propertes can be explctly expressed as affne constrants, ether when they are carred mplctly as data localty propertes. It has been mplemented n the Chunky prototype [8], advantageously replacng usual enablng preprocessng technques and savng a sgnfcant amount of nterestng transformatons from beng gnored. It could be used combned wth a wde range of exstng optmzng technques and n partcular for data localty mprovement methods, for the sngle processor case as well as for parallel systems usng space-tme mappngs [9]. Further mplementaton work s necessary to handle real-lfe benchmarks n our prototype and to provde full statstcs on corrected transformatons. Moreover, the queston of scalablty s left open snce, for several tenth of deeply nested statements, the number of unknown n the constrant systems can become embarrassngly large. Splttng up the problem accordng to the dependence graph s a soluton under nvestgaton. Acknowledgements The authors would lke to thank the CC Internatonal Conference on Compler Constructon anonymous revewers for havng nspred ths paper by manfestng ther nterest on ths part of our work. We also wsh to thank the Euro-Par anonymous revewers for ther help n mprovng the qualty of the paper. References [] N. Ahmed, N. Mateev, and K. Pngal. Tlng mperfectly-nested loop nests. In SC 2 Hgh Performance Networkng and Computng, Dallas, November 2. [2] J. Allen and K. Kennedy. Automatc translaton of FORTRAN programs to vector form. ACM Transactons on Programmng Languages and Systems, 9(4):49 542, october 987. [] U. Baneree. Data dependence n ordnary programs. Master s thess, Dept. of Computer Scence, Unversty of Illnos at Urbana-Champagn, November 976. [4] U. Baneree. Dependence Analyss for Supercomputng. Kluwer Academc, 988. [5] U. Baneree. Unmodular transformatons of double loops. In Advances n Languages and Complers for Parallel Processng, pages 92 29, Irvne, August 99. [6] C. Bastoul. Effcent code generaton for automatc parallelzaton and optmzaton. In ISPDC IEEE Internatonal Symposum on Parallel and Dstrbuted Computng, pages 2, Lublana, October 2. [7] C. Bastoul, A. Cohen, S. Grbal, S. Sharma, and O. Temam. Puttng polyhedral transformatons to work. In LCPC 6 Internatonal Workshop on Languages and Complers for Parallel Computers, LNCS 2958, pages 29 225, College Staton, October 2. [8] C. Bastoul and P. Feautrer. Improvng data localty by chunkng. In CC 2 Internatonal Conference on Compler Constructon, LNCS 2622, pages 2 5, Warsaw, Aprl 2. [9] A. Bernsten. Analyss of programs for parallel processng. IEEE Transactons on Electronc Computers, 5(5):757 76, October 966. [] A. Cohen, S. Grbal, and O. Temam. A polyhedral approach to ease the composton of program transformatons. In Euro-Par Internatonal Euro-Par Conference,

Adustng a Program Transformaton for Legalty Psa, August 24. [] P. Feautrer. Parametrc nteger programmng. RAIRO Recherche Opératonnelle, 22():24 268, 988. [2] P. Feautrer. Dataflow analyss of scalar and array references. Internatonal Journal of Parallel Programmng, 2():2 5, February 99. [] P. Feautrer. Some effcent solutons to the affne schedulng problem: one dmensonal tme. Internatonal Journal of Parallel Programmng, 2(5): 48, October 992. [4] M. Grebl, P. Faber, and C. Lengauer. Space-tme mappng and tlng a helpful combnaton. Concurrency and Computaton: Practce and Experence, 6():22 246, march 24. [5] M. Grebl, C. Lengauer, and S. Wetzel. Code generaton n the polytope model. In PACT 98 Internatonal Conference on Parallel Archtectures and Complaton Technques, pages 6, 998. [6] F. Irgon and R. Trolet. Computng dependence drecton vectors and dependence cones wth lnear systems. Techncal Report ENSMP-CAI-87-E94, Ecole des Mnes de Pars, Fontanebleau (France), 987. [7] I. Kodukula, N. Ahmed, and K. Pngal. Data-centrc mult-level blockng. In ACM SIGPLAN 97 Conference on Programmng Language Desgn and Implementaton, pages 46 57, Las Vegas, June 997. [8] D. Kuck. The Structure of Computers and Computatons. John Wley & Sons, Inc., 978. [9] C. Lengauer. Loop parallelzaton n the polytope model. In Internatonal Conference on Concurrency Theory, LNCS 75, pages 98 46, Hldeshem, August 99. [2] W. L and K. Pngal. A sngular loop transformaton framework based on non-sngular matrces. Internatonal Journal of Parallel Programmng, 22(2):8 25, Aprl 994. [2] K. McKnley, S. Carr, and C. Tseng. Improvng data localty wth loop transformatons. ACM Transactons on Programmng Languages and Systems, 8(4):424 45, July 996. [22] D. Padua and M. Wolf. Advanced compler optmzatons for supercomputers. Communcatons of the ACM, 29(2):84 2, December 996. [2] W. Pugh. The omega test: a fast and practcal nteger programmng algorthm for dependence analyss. In Proceedngs of the thrd ACM/IEEE conference on Supercomputng, pages 4, Albuquerque, August 99. [24] A. Schrver. Theory of lnear and nteger programmng. John Wley & Sons, 986. [25] M. Wolf and M. Lam. A data localty optmzng algorthm. In ACM SIGPLAN 9 Conference on Programmng Language Desgn and Implementaton, pages 44, New York, June 99. [26] M. Wolfe. Iteraton space tlng for memory herarches. In rd SIAM Conference on Parallel Processng for Scentfc Computng, pages 57 6, December 987. [27] M. Wolfe. Hgh performance complers for parallel computng. Addson-Wesley Publshng Company, 995. [28] J. Xue. On tlng as a loop transformaton. Parallel Processng Letters, 7(4):49 424, 997.