Interconnect Optimization for High-Level Synthesis of SSA Form Programs

Similar documents
Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

An Optimal Algorithm for Prufer Codes *

Mathematics 256 a course in differential equations for engineering students

Hermite Splines in Lie Groups as Products of Geodesics

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Meta-heuristics for Multidimensional Knapsack Problems

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Report on On-line Graph Coloring

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

Constructing Minimum Connected Dominating Set: Algorithmic approach

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Concurrent Apriori Data Mining Algorithms

Parallel matrix-vector multiplication

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Module Management Tool in Software Development Organizations

A Binarization Algorithm specialized on Document Images and Photos

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

Storage Binding in RTL synthesis

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Active Contours/Snakes

5 The Primal-Dual Method

the nber of vertces n the graph. spannng tree T beng part of a par of maxmally dstant trees s called extremal. Extremal trees are useful n the mxed an

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

The Codesign Challenge

Lecture 5: Multilayer Perceptrons

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

CMPS 10 Introduction to Computer Science Lecture Notes

Hierarchical clustering for gene expression data analysis

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Assembler. Building a Modern Computer From First Principles.

Smoothing Spline ANOVA for variable screening

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

CS 534: Computer Vision Model Fitting

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Simulation Based Analysis of FAST TCP using OMNET++

CSE 326: Data Structures Quicksort Comparison Sorting Bound

NOVEL CONSTRUCTION OF SHORT LENGTH LDPC CODES FOR SIMPLE DECODING

Wishing you all a Total Quality New Year!

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Machine Learning: Algorithms and Applications

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Solving two-person zero-sum game by Matlab

An efficient iterative source routing algorithm

Scheduling with Integer Time Budgeting for Low-Power Optimization

The stream cipher MICKEY-128 (version 1) Algorithm specification issue 1.0

LECTURE : MANIFOLD LEARNING

Gradual Relaxation Techniques with Applications to Behavioral Synthesis *

Needed Information to do Allocation

ARTICLE IN PRESS. Signal Processing: Image Communication

Support Vector Machines

RADIX-10 PARALLEL DECIMAL MULTIPLIER

Fitting: Deformable contours April 26 th, 2018

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

Routability Driven Modification Method of Monotonic Via Assignment for 2-layer Ball Grid Array Packages

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

User Authentication Based On Behavioral Mouse Dynamics Biometrics

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Graph-based Clustering

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Conditional Speculative Decimal Addition*

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Feature Reduction and Selection

Problem Set 3 Solutions

VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES

ELEC 377 Operating Systems. Week 6 Class 3

CHAPTER 4 PARALLEL PREFIX ADDER

Solving Route Planning Using Euler Path Transform

Virtual Machine Migration based on Trust Measurement of Computer Node

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

An Entropy-Based Approach to Integrated Information Needs Assessment

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Flexible ASIC: Shared Masking for Multiple Media Processors

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Load-Balanced Anycast Routing

Optimizing Document Scoring for Query Retrieval

Private Information Retrieval (PIR)

Loop Transformations, Dependences, and Parallelization

Load Balancing for Hex-Cell Interconnection Network

Chapter 1. Introduction

Related-Mode Attacks on CTR Encryption Mode

Ecient Computation of the Most Probable Motion from Fuzzy. Moshe Ben-Ezra Shmuel Peleg Michael Werman. The Hebrew University of Jerusalem

Power-Aware Mapping for Network-on-Chip Architectures under Bandwidth and Latency Constraints

Transcription:

Interconnect Optmzaton for Hgh-Level Synthess of SSA Form Programs Phlp Brsk Aay K. Verma Paolo Ienne Processor Archtecture Laboratory Swss Federal Insttute of Technology (EPFL) Lausanne, Swtzerland {phlp.brsk, aaykumar.verma, paolo.enne}@epfl.ch Mad Sarrafzadeh Department of Computer Scence Unversty of Calforna, Los Angeles (UCLA) Los Angeles, CA 995 mad@cs.ucla.edu ABSTRACT Regster Allocaton for Programs n Statc Sngle Assgnment (SSA) Form has a Polynomal-Tme Soluton because nterference graphs for procedures n ths representaton are chordal graphs. Ths paper explores a complementary problem whch s NP- Complete: the assgnment of regsters to varables n order to mnmze nterconnect costs. In partcular, we attempt to mnmze the sze of the multplexers placed on the nput to each regster. Ths s partcularly mportant for FPGA-based desgn flows where multplexers have a hgh cost n terms of both area and delay. An effcent greedy heurstc for color assgnment s presented and compared aganst smulated annealng. Categores and Subect Descrptors B.5.2 [Hardware]: Desgn Ads automatc synthess, optmzaton. General Terms Algorthms, Performance, Desgn. Keywords Regster Allocaton and Assgnment, Connectvty Bndng, Interconnect Allocaton, Statc Sngle Assgnment (SSA) Form. INTRODUCTION In hgh-level synthess, regster allocaton determnes how many regsters should be allocated to the desgn. If G = (V, E) s the nterference graph of the program, and χ(g) s the chromatc number, then at least χ(g) regsters must be allocated. Determnng χ(g) s NP-Complete for general graphs; however, nterference graphs for Statc Sngle Assgnment (SSA) form programs belong to the class of chordal graphs [2, 4, 9], for whch χ(g) can be computed n O( V + E ) tme [8]. For graph G, there may be many dfferent χ(g)-colorngs. Gven G and χ(g), regster assgnment s the problem of determnng the best χ(g)-colorng of G. Gven a color assgnment, the allocaton of nterconnect resources wres and multplexers s then derved determnstcally. The goal of regster assgnment s to mnmze the overall cost of the allocated nterconnect resources. Ths problem, connectvty bndng, has been proven NP- Complete for applcatons whose nterference graphs are nterval graphs [6]; snce nterval graphs are a subset of the chordal graphs, the problem remans NP-Complete for applcatons that are syntheszed drectly from SSA Form. To the best of our knowledge, ths s the frst paper to study regster assgnment for SSA-form programs n the context of synthess. We have developed two heurstcs for ths problem. The frst s a greedy heurstc that modfes Gavrl s optmal algorthm for chordal colorng [8], whch does not optmze nterconnect resources. The second approach uses smulated annealng to produce locally optmal solutons. The annealng heurstc that we have developed s talored specfcally to chordal graphs. The results of these heurstcs are compared on a set of large chordal nterference graphs taken from real-world benchmarks. The result shows that tradtonal methods for chordal colorng generally perform qute poorly; however, when the chordal colorng algorthm s modfed to account for ϕ-functons, the area of multplexers allocated to the desgn s reduced by 24.8%, on average. Smulated annealng, n contrast, reduces the area by 25.9%, on average, but at a sgnfcant runtme cost. 2. SSA FORM SSA Form [3, 6-7] s a compler ntermedate representaton that has been used for numerous analyses and optmzatons. More recently, several technques for drect synthess of SSA-form programs have been proposed [4, 2-3]. Any operaton x s a defnton of x; any operaton x s a use of x. A procedure P s n SSA Form f: () every varable s defned exactly once; and (2) every use of a varable corresponds to ts defnton. Ensurng unque defntons of each varable s trval. Each defnton of x s replaced wth defntons of x, x 2,, etc. Ensurng that each use corresponds to one defnton, however, s a bt more complcated. In Fg. (a), varable x s defned on both sdes of a condton, and then used at the on pont followng the condton. In Fg. (b), the defntons of x are replaced wth defntons of x and x 2 ; however, the use of x cannot be changed to a use of x or x 2 wthout changng the semantcs of the applcaton. To rectfy ths stuaton, a ϕ-functon: x 3 ϕ(x, x 2 ) s ntroduced at the on pont, and the use of x s replaced wth a use of x 3. The semantcs of the ϕ-functon are as follows: f the path on the left s taken, then x 3 receves ts value from x ; f the path on the rght s taken, then x 3 receves ts value from x 2. ϕ-functons are placed at confluence ponts n the procedure, where multple control flow paths merge; the descrpton above s easly generalzed to an arbtrary number of convergng paths. At a confluence pont, there may be multple ϕ-functons, each correspondng to a dfferent varable n the procedure. Whenever a block contanng the ϕ-functons s executed, t s assumed that all of the ϕ-functons execute concurrently. Ths s not a problem for drect synthess of SSA Form programs, but presents a sgnfcant challenge to complers, that must translate the program out of SSA Form before the program can be executed.

x x x x 2 (a) x (b) x? (c) x x 2 x 3 ϕ(x, x 2 ) x 3 Fgure. Illustraton of SSA Form Technques to construct SSA Form have been descrbed by Cytron et al. [6] and Brggs et al. [3]; throughout ths paper, we assume that Pruned SSA Form [7] s always used. One mportant advantage of SSA s that all copy operatons can be elmnated durng ts constructon [3]. 3. RELATED WORK Connectvty bndng subsumes the process of assgnng regsters to the dfferent varables represented n an applcaton to be syntheszed. Typcally, the applcaton has already been scheduled, and the operatons have already been bound to a set of functonal resources that have been allocated. If a value produced by resource A s assgned to regster R, then t s necessary to allocate a wre that connects the output of A to the nput of R. Snce there may be many resources that produce values wrtten to R, a multplexer must be placed on R s nput. The overall goal of connectvty bndng s to mnmze an obectve functon that encompasses both the number of wres and the cost of the multplexers that are allocated durng ths stage. The problem has been proven NP-Complete by Pangrle [6]. Connectvty bndng has been studed n the past, but not for SSA-form programs. Huang et al. [] used a bpartte weghted matchng heurstc to mprove an ntal assgnment of regsters. Rm et al. [7] formulated the problem as an nteger-lnear program and solved t optmally, but n exponental worst-case tme. Km and Lu [4] allocated extra regsters n order to reduce the cost of allocatng multplexers, thereby sacrfcng an ntally optmal allocaton of regsters. Zhu and Jong [9] developed an approach usng network flows, but never compared ther technque to any other work. Most recently, Chen and Cong [5] used a k-cofamly based algorthm, and optmzed port assgnment as a post-processng phase. In many, but not all cases, Chen and Cong s technque outperformed Huang et al. s. The ϕ-functons n SSA Form create a new challenge for connectvty bndng. Consder a ϕ-functon y ϕ(..., x, ). If x and y are bound to regsters r x and r y respectvely, then an wre from the output of r x to the nput of r y s requred to facltate the data transfer. On the other hand, f x and y are assgned to the same regster r, then no data transfer, s requred. Ths ssue does not affect synthess of non-ssa-form applcatons. Avakan and Ouass [] studed regster bndng for FPGAs usng smulated annealng. Unlke our work, they dd not use SSA Form for synthess. Another dfference s that they focus on optmzng multplexers on the nputs to functonal unts, whereas ours focuses on mnmzng the cost of regster-to-regster transfers. Ths approach dd not consder applcatons represented n SSA Form. The obectve functon dd not consder the effects of ϕ- functons on swappng the regsters bound to each varable. Ther mplementaton of smulated annealng s lmted n two respects. Frst, t assumes that the nterference graph s an nterval graph. Interval graphs are the nterference graphs for straght-lne applcatons wth no control flow; nterval graphs also arse for applcatons wth condtonals that have been elmnated va fconverson, but no loops. Snce any procedure can be converted to SSA Form, chordal graphs are much more general. The second lmtaton s that t only swaps the colors of two varables whose lfetmes are dentcal. Our mplementaton of smulated annealng, n contrast, does not mpose any restrcton on the varables whose colors are swapped. 4. PROBLEM STATEMENT 4. Overvew Varable v s lve at pont p n a program f there s a path from the defnton of v to p and a path from p to a use of v. Two varables nterfere wth one another f there s at least one pont n the applcaton where both are lve. Two nterferng varables cannot resde n the same regster. In SSA Form, an nterference graph s a graph G = (V, E, E ϕ ), where V s a set of vertces, each correspondng to an SSA varable; an edge (u, v) E s placed between every par of varables u and v that nterfere; fnally, E ϕ = {(u, v) E u v and there s a ϕ-functon v ϕ(, u, )}. Edges belongng to E ϕ are called ϕ-edges. Edges n E are undrected; ϕ-edges n E ϕ, are drected because the data transfer always orgnates at the parameter of a ϕ-functon. Let f(x) be the color (.e. regster) assgned to varable x. A color assgnment s legal f for each edge (u, v) E, f(u) f(v). Gven a legal color assgnment, nterconnect allocaton s determnstc, as dscussed n the next secton. 4.2 Interconnect Allocaton In SSA Form, the nterference graph G s chordal, so χ(g) can be computed n polynomal tme and χ = χ(g) regsters are allocated. Let the set of regsters allocated to the desgn be R = {R,, R χ }. Each regster R has a multplexer havng on ts nput. Let m be the number of nputs bts to ths multplexer. If m =, then there are no connectons to R ; f m =, then the multplexer has only one nput, and t c an be replaced by a wre; f m >, then a multplexer wth s = log 2 m selecton bts s requred. The result of the color assgnment s the set M = {m,, m χ }. Next, we derve M from a legal color assgnment. Let Φ = {(u, v) E ϕ f(u) =, f(v) = }. Φ s the set of ϕ-edges that necesstate a connecton from regster R to R. A connecton s requred f c = Φ >. Let b = f c = and f c >. Let B = {b } be a bt-matrx representng all of the b values. The th column of B represents all of the connectons from other regsters to regster R. From the th column, the value of m s: χ m = b + b () = = +

v v 2 v 4 v 6 v 5 v v 3 Interference Edge v 8 v 7 v 2 v v v 9 (a) f(v ) = f(v ) = f(v 2 ) = 2 f(v 3 ) = f(v 4 ) = f(v 5 ) = 2 f(v 6 ) = f(v 7 ) = f(v 8 ) = 2 f(v 9 ) = f(v ) = 3 f(v ) = f(v 2 ) = 2 (b) (c) C M 3 2 2 3 φ-edge Fgure 2. Example chordal extended nterference graph (a) wth legal 4-color assgnment (b) and c and m values (c) There s no need to count b, the connecton from R to tself. Fg. 2 shows an example chordal nterference graph (a) wth color assgnment (b). Based on the color assgnment the matrx C = {c } and vector M = {m } are shown n Fg. 2 (c). 4.3 Evaluatng the Color Assgnment In ths paper, the goal s to mnmze the aggregate area of multplexers allocated to the desgn. A secondary goal could be to mnmze the delay of the largest multplexer; ths goal s somewhat dubous because there s no way to tell whether or not that multplexer wll actually le on the crtcal path of the fnal crcut. Let A(m ) and D(m ) be the area and delay, respectvely, of a multplexer wth m nputs. Then the overall area and delay through the multplexers are as follows: χ = ( ) Area = A m (2) Delay { D( m )} χ = max = (3) 5. REGISTER ASSIGNMENT HEURISTIC 5. Optmal Chordal Colorng Let G = (V, E) be an undrected graph. An elmnaton order (EO) of G s a one-to-one and onto functon α: V {,, V }. Let v V be the vertex such that α(v ) =, and V = {v V < }. V s the empty set, V V = V, and let G = (V, E ) be the subgraph of G nduced by V. The basc dea of an elmnaton order s that t can be used to ncrementally construct G startng wth an empty graph. Inductvely, f we have computed G, we construct G + by addng vertex v + to V, and addng all edges to E connectng v + to vertces n V. The result s the sets V + and E +. Let N(v) be the set of vertces adacent to v. N (v) = N(v) V s the set of vertces adacent to v wth EO ndces at most. An EO s a perfect elmnaton order (PEO) f N (v ) s a clque for all. G s defned to be a chordal graph f and only f G has a PEO; there are several other provably equvalent defntons of chordal graphs, but they are not needed here. A PEO can be computed n O( V + E ) tme usng an algorthm called Maxmum Cardnalty Search (MCS). Gven a PEO, a mnmum colorng of a chordal graph can be computed optmally n O( V + E ) tme usng a greedy algorthm by Gavrl [8]. Inductvely, assume that an optmal colorng has been computed for G. Now, consder G +, and n partcular, vertex v +. Snce G has a PEO, N + (v + ) s a clque. Therefore, t suffces to assgn the smallest color not assgned to a vertex n N + (v + ) to v +. The maxmal color assgned among all vertces s the chromatc number: χ = χ(g). We should also note that N [v ] = N(v ) {v } s also a clque. The clque N [v ] such that N [v ] > N [v ], < < V s the maxmal clque n the nterference graph. The cardnalty of the maxmal clque s equal to χ. The maxmal ndependent set and the mnmal clque partton of a chordal graph can be computed n O( V + E ) tme [8]. 5.2 Interconnect Optmzaton The chordal colorng algorthm descrbed n Secton 5. computes a mnmal colorng of an nterference graph, but does not try to optmze Eq. (2) or (3). In ths secton, we extend the algorthm to optmze area (Eq. (2)). We begn wth an nterference graph G = (V, E, E ϕ ), as descrbed n Secton 4.. G s chordal snce we are syntheszng an applcaton n SSA Form. Frst, we run Gavrl s algorthm (Secton 5.) to compute χ and we allocate a set R of regsters to the desgn, such that R > χ; there s no absolute requrement a mnmum regster allocaton s necessary, and t s possble that allocatng more regsters can lead to an overall reducton n area once multplexer optmzaton has been accounted for. Second, we ntalze an R R matrx, B, as descrbed n Secton 4.2, whch s ntally empty. Thrd, we process the vertces of G n PEO order. Any color n the range.. R not assgned to a vertex n N(v ) s a potental canddate for v. Let Free(v ) denote ths set of colors; recall that f(v) s the color assgned to vertex v. Free ( v ) = {.. R } U f ( v ) (4) v N ( v ) We use the ϕ-edges ncdent on v to help us decde the best color to assgn to v. ϕ n (v ) and ϕ out (v ) are the set of vertces adacent to v va ϕ-edges that have already been assgned colors. The sets of colors assgned to vertces n ϕ n (v ) and ϕ out (v ), respectvely, that are avalable for v, are denoted f n (v ) and f out (v ). ϕ n (v ) and f n (v ) s shown as follows; ϕ out (v ) and f out (v ) are analogous:

( v ) = {( v v ) Eϕ } ( v ) = U f ( v ) Free ϕ, (5) f n < n ϕ ( v ) v n ( v ) We compute a cost, F(c k ), for each color c k Free(v ), and the color wth lowest cost s assgned to v ; tes are broken arbtrarly. The cost functon that we have selected attempts to mnmze the number of new connectons that are created by assgnng color c k to vertex v. The cost functon that we have selected has two components: the frst s the number of wres that wll be allocated to transfer data from other regsters to r k f c k s assgned to v ; lkewse, the second s the number of wres that wll be allocated to transfer data from regster r k to other regsters. ( c ) = F k b k bk (7) c f ( v ) c f ( v ) n c ck + out c ck If color c k s selected for v, then we must update the matrx B to account for the new wres that have been allocated. Let B - denote the matrx pror to assgnng a color to v, and B be the matrx afterward; b - k and b k represent ndvdual elements. Let row k [B] and col k [B] represent the k th row and column of B respectvely. These are the only elements of B that wll be updated. Let b - k row k [B - ] and b - k col k [B - ]. Then: b b k = k b b k = k c f ( v ) out otherwse c f ( v ) n otherwse (6) k (8) k 5.3 Complexty The tme complexty of the algorthm descrbed n Secton 5.2 s O( R 2 V + E ). The complexty of computng the PEO s O( V + E ). Lke chordal colorng, the complexty of processng vertces n PEO order and elmnatng colors assgned to vertces contaned n N (v ) from consderaton for v s also O( V + E ). In the new algorthm, we consder up to R colors for each vertex. The complexty of evaluatng F(c k ), per color, s O( R ) as well, yeldng a complexty of O( R 2 ), per vertex. Once color c k has been selected for vertex v, the cost of updatng the k th row and column of B - s O( R ) f f n and f out are represented as bt-vectors. The overall tme complexty s therefore O( R 2 V + E ). 6. SIMULATED ANNEALING Smulated annealng s an teratve mprovement heurstc that provdes locally optmal solutons to classcally hard problems. Due to space lmtatons, we assume that the reader s famlar wth smulated annealng; f not, please refer to the paper by Johnson et al. [] for an overvew. 6. Representaton Smulated annealng begns wth a problem nstance and constructs an ntal soluton. An obectve functon evaluates the qualty of the soluton. The obectve functon s nonnegatve, wth smaller values defned as beng superor to larger ones. (9) v Dummy v Vertex C v 4 v 3 v 2 A MOVE operaton s then appled to the ntal soluton. The move s a small perturbaton for example, swappng the colors of two vertces. If the move mproves the current soluton, then the move s accepted wthout queston; otherwse, the move s ether accepted or reected based on a randomzed computaton. A clque s a subset of vertces whose nduced subgraph s complete.e. there s an edge between every par of vertces n the subgraph. A clque partton parttons the nterference graph nto a set of non-overlappng clques. A clque partton of a chordal graph can be computed n O( V + E ) tme [8]. κ(g) s defned to be the clque partton number of G,.e. the smallest number of clques that can partton G; κ wll denote κ(g) n order to smplfy notaton. Let C = {C,, C κ } be the clque partton. Let Cl(v) = f v C. An edge (u, v) E s an ntra-clque edge f Cl(u) = Cl(v) and an nter-clque edge otherwse. The MOVE operaton s restrcted to swappng the colors of two vertces belongng to the same clque. Therefore, the colorng constrant mposed by each ntra-clque edge s satsfed trvally. Usng ths move operaton, ntra-clque edges can be removed from the graph, thus reducng the number of constrants. If C < χ, then t wll not be possble for some vertex v C to receve every possble color n the nterference graph. To rectfy ths, χ - C dummy vertces are added to each clque. A dummy vertex s adacent to each vertex n C (ncludng other dummes) but no other vertces n the nterference graph. A dummy vertex s smply a placeholder for a color not used by a clque. After addng dummes, there wll be exactly κχ vertces n the graph. Fg. 3 shows Fg. 2 after clque parttonng. The ϕ-edges from Fg. 2 are not shown n Fg. 3. The number of edges s reduced from 22 to. 6.2 MOVE Operaton A vertex v s an llegal vertex f there s some edge (u, v) E such that f(u) = f(v).e. the color assgnment s llegal. A vertex v s defned to be a sub-optmal vertex f v s not llegal and there s at least one edge (v, w) E ϕ such that f(w) f(v). The MOVE operaton s defned as follows: () Randomly select an llegal vertex v; randomly select another vertex u from v s clque and swap ther colors. v 6 v 5 C 2 C 3 Fgure 3. v 9 v 2 v v 7 v 8 v C 4 C 5 The graph from Fg. 2(a) wth ntra-clque edges removed and dummy vertces; ϕ-edges are not shown.

(2) If there are no llegal vertces, randomly select a sub-optmal vertex v; randomly select another vertex w from v s clque and swap ther colors. (3) If there are no llegal or sub-optmal vertces, then the current color assgnment s optmal; annealng termnates. In Fg. 2, the soluton that would result from swappng the colors of v 8 and v would yeld the matrx C = {c } shown below n Eq. (). The updated vector, M becomes [2,,, 2]. Colors 2 and 3 are swapped, and the correspondng rows and columns of Care modfed. 3 = 2 C () 6.3 Obectve Functon Here, we descrbe the obectve functon used by the smulated annealng heurstc. For legal solutons, our obectve functon s the sum of the areas of the multplexers allocated. Recall m s the number of nputs to regster r that must be multplexed. If a color soluton s legal, then Ob Legal = R = m A ( m ). () Our mplementaton of smulated annealng allows llegal colorng solutons to be accepted. A sequence of llegal solutons could lead to a new area of the search space that was not lkely to be explored otherwse. Snce llegal solutons are of no practcal use, we desre an obectve functon that ensures that all llegal solutons have a hgher value than any legal one. The maxmum value of a legal obectve functon s ( ) MAX Legal = R A R. (2) MAX Legal s the value of Ob Legal that would occur f the largest possble multplexer was placed on the nput to every regster. If the current soluton s llegal, the obectve functon should nfluence the annealng procedure toward legalzaton. An edge (u, v), such that f(u) = f(v) s an llegal edge. Let E be the subset of nterference edges that are llegal based on the current color assgnment. Then the obectve functon when the current soluton s llegal s: Ob + = MAX Legal E (3) If MAX Legal s suffcently large, there s vrtually no chance that a move that causes an llegal soluton to become llegal wll be accepted due to the large dfferental n obectve functon value. To ncrease the lkelhood of acceptng llegal moves, we normalze the obectve functon such that Ob Legal takes values n the range [, ] and Ob takes values n the range (, 2]. The complete obectve functon, Ob*, s therefore: ObLegal MAX Ob* = E + E Legal E E = > (4) 7. EXPERIMENTAL RESULTS The colorng heurstcs descrbed n the precedng sectons were ntegrated nto an expermental SSA-based synthess framework developed by Brsk et al. [4]; ths framework has been bult on top of the Machne SUIF compler [8]. We targeted an Altera ACEXK FPGA. We generated a lbrary of multplexers startng wth the 5 lsted n Table of the paper by Avakan and Ouass []; we then used the algorthm of Mtra and McCluskey [5] to generate a lbrary of multplexers from 2-2 nputs. Due to space lmtatons, the parameters of ths lbrary are not shown. For synthess, we allocated 5 resources: 2 adders, 2 multplers, and one ALU that performs logcal operatons such as AND, OR, etc. We then syntheszed each desgn and studed the effects of nterconnect optmzaton heurstcs descrbed n ths paper. We tested 3 approaches for nterconnect optmzaton: chordal colorng [4, 8] (Secton 5.), optmzed chordal colorng (Secton 5.2), and smulated annealng (Secton 6). Table shows the parameters used for the smulated annealng; the parameters are taken from the paper by Johnson et al. []. Our benchmarks were nterference graphs taken from the paper by Brsk et al. [4]; these graphs were selected because they are large, and present a sgnfcant challenge to the syntheszer. Table 2 shows the results of the experments. The area for each desgn s showed n terms of logc blocks, and the runtme s presented n mllseconds. The chromatc number of each graph s the number of regsters allocated, and each regster (and multplexer) s 32- bts wde. The area results are shown for regsters and multplexers only, because the cost of the other resources s fxed and does not depend on the allocaton. On average, the enhanced color assgnment heurstc of Secton 6.2 reduced the number of logc cells by 7 (24.8%) compared to chordal colorng [8]. On average, smulated annealng yelded an addtonal reducton of 47 cells (25.9%). The average runtmes were 8.87ms for chordal colorng, 39.5ms for optmzed chordal colorng, and 2,662ms for smulated annealng. We beleve that ths reflects favorable on the frst heurstc due to ts relatve compettveness when compared to smulated annealng. N T SIZE_FACTOR CUTOFF Table. Smulated Annealng Parameters Annealng Parameters 2. 2.5 TEMPFACTOR MINPERCENT FREEZE_LIM.9.5 5

Benchmark try_combne smplfy_rtx yyparse fold_rtx expand_expr fold recog_5 ump_optmze Chromatc Number 24 4 6 22 7 23 8 29 Table 2. Area Results and Runtme of the 3 Heurstcs Area (# Logc Cells) Runtme (mllseconds) Chordal Optmzed Annealng Chordal Optmzed Annealng 5,366 4,32 4,32 3.3 43.5 2,83 2,83 2,368 2,368 5.65 8.74,7 2,887 2,752 2,72 8.53.4 5,373 5,48 3,776 3,744 8.4 72.8 2,964 4,66 3,358 3,75 9.49 4.6 95,88 5,76 4,35 4,7 2. 79.4 8,36,48,32,32 3.4 4.57,467 8,84 4,999 4,999.2 55.3 8,843 Average - 4,494 3,378 3,33 8.87 39.5 2,662 Chordal Optmzed Annealng 8 6 4 2 8 6 4 2 5 6 7 8 9 2 3 4 5 Fgure 4. Dstrbuton of multplexers allocated by 3 color assgnment heurstcs for the benchmark fold Fg. 4 shows the dstrbuton of multplexers allocated for the benchmark fold by the 3 colorng heurstcs. It s easy to see that chordal colorng performs poorly, allocatng two -nput and one 5-nput multplexer; n contrast, largest multplexer allocated by both optmzed chordal colorng and smulated annealng has 9- nputs. Smulated annealng also allocates the largest number of 5-nput multplexers, the smallest number of nputs among all allocated multplexers. It s mportant to note that the dstrbuton of multplexers, as shown n Fg. 4, does not completely characterze the soluton to the problem. The area metrc reported n Table 2, for example, would be consderably dfferent f a standard cell desgn was consdered nstead of an FPGA. It s well-known that a multplexer s not easly syntheszed from look-up tables, so ts cost, relatve to other crcut elements, s sgnfcantly hgher n an FPGA than n a standard cell desgn. Thus, the mpetus of optmzng multplexers s consderably greater for FPGAs. A secondary ssue s whether complete or ncomplete multplexers are used. A complete multplexer only has nputs that are even powers of 2, so, for example, a 7-nput multplexer would be mplemented va an 8-nput multplexer, wth one nput never used. If the only multplexers avalable are complete, all of the 5-, 6-, and 7-nput multplexers n Fg. 4 would have the same cost as an 8-nput multplexer, and all 9- to 5-nput multplexers would have the same cost as a 6-nput multplexer; and of course, the mpact of ths decson s consderably greater for an FPGA than for a standard cell desgn. In practce, one would want to generate a large lbrary of ncomplete multplexers usng the technque of Mtra et al. [5], or a comparable approach. Ths must be done pror to runnng the smulated annealng heurstc to ensure a correct estmate of the area of each multplexer; t s not necessary to generate the lbrary n advance f the optmzed chordal colorng heurstc s used, because ts obectve functon s not based on the area of a specfc multplexer.

For datapath crcuts, t s qute lkely that the sze of the largest multplexer could affect the clock frequency. Ignorng the generaton of control sgnals, the regster-to-regster path wth maxmum combnatonal delay wll constran the clock frequency. It s lkely, although not guaranteed, that the largest multplexer wll le along ths path. Consequently, there could be some benefts ganed from attemptng to constran the largest multplexer allocated to the desgn, rather than focusng solely on area optmzaton. Dong so accurately, however, would requre a detaled model of the layout of the datapath porton of the crcut, whch s not lkely to be avalable durng hgh-level synthess. A reasonable model may be avalable f regster allocaton and nterconnect optmzaton are performed as the fnal steps durng hgh-level synthess. At the very least, nterconnect allocaton must occur before logc synthess can optmze the datapath and certanly before the fnal layout; and for FPGA-based desgns, the delays depend on placement and routng as well. Thus, we have focused on area rather than delay as the metrc to study n ths paper; however, we may attempt to optmze the latter as future work. 8. CONCLUSION AND FUTURE WORK The problem of nterconnect optmzaton n hgh-level synthess of SSA Form applcatons has been ntroduced. SSA Form s an deal representaton for synthess because the nterference graph for each procedure s a chordal graph, whch can be colored optmally n O( V + E ) tme. To solve ths problem, we have ntroduced two heurstcs: the frst, an extenson of Gavrl s optmal algorthm for chordal graph colorng and a second based on smulated annealng. Although smulated annealng performs.% better than the other on average, the mprovement requres an average ncrease n runtme of three orders of magntude. The nterconnect optmzaton problem studed n ths paper specfcally focuses on mnmzng the connectons between regsters that arse from ϕ-functons. In the future, we ntend to extend ths deas presented n ths paper so that operaton bndng and port assgnment s performed concurrently. Ths makes sense because as Brsk et al. [4] showed, t s possble to color an nterference graph for an SSA Form procedure by makng a reverse post-order pass over ts domnator tree, and processng the operatons n each node n forward order. The extenson to chordal graph colorng suggested n Secton 5.2 can be mplemented n ths fashon as well, because t augments tradtonal chordal graph colorng wth an mproved method to select the color to assgn to each vertex. We beleve that the color assgnment heurstc can be modfed to make operaton and port bndng decsons as each operaton s processed durng the traversal of the applcaton. Ths would be smlar n prncple to the approach advocated by Cong and Chen [5], whle accountng for ϕ-functons. REFERENCES [] Avakan, A., and Ouass, I. Optmzng regster bndng n FPGAs usng smulated annealng. In Proc. of the Int. Conf. Reconfgurable Computng and FPGAs (ReConFg 5) (Puebla Cty, Mexco, September 28-3, 25) 6. [2] Bouchez, F., Darte, A., Gullon, C., and Rastello, F. Regster Allocaton and Spll Complexty Under SSA. Techncal Report RR25-33, ENS-Lyon, Lyon, France, 25. [3] Brggs, P., Cooper, K. D., Harvey, T. J., and Smpson, L. T. Practcal mprovements to the constructon and destructon of statc sngle assgnment form. Software Practce and Experence, 28, 8, (July, 998), 859-88. [4] Brsk, P., Dabr, F., Jafar, R., and Sarrafzadeh, M. Optmal regster sharng for hgh-level synthess of SSA form programs. IEEE Trans. Computer-Aded Desgn., 25, 5 (May. 26), 772-779. [5] Chen, D., and Cong, J. Regster bndng and port assgnment for multplexer optmzaton. In Proc. of the Asa South Pacfc Desgn Automaton Conf. (ASP-DAC 4) (Yokohama, Japan, 24). 68-73. [6] Cho, J-D., Cytron, R., and Ferrante, J. Automatc constructon of sparse data flow evaluaton graphs. In Proc. of the ACM/SIGPLAN Conf. Prncples of Progr. Languages (POPL 9) (Orlando, FL, USA, 99). 55-66. [7] Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N., and Zadeck, F. K. Effcently computng statc sngle assgnment form and the control dependence graph. ACM Trans. Prog. Lang. and Systems, 3, 4 (October, 99), 45-49. [8] Gavrl, F. Algorthms for mnmum colorng, maxmum clque, mnmum coverng by clques, and maxmum ndependent set of a chordal graph. SIAM J. Computng, 2, (June, 972), 8-87. [9] Hack, S., and Goos, G. Optmal regster allocaton for SSA-form programs n polynomal tme. Informaton Processng Letters, 98, 4 (May, 26), 5-55. [] Huang, C-Y., Chen, Y-S., Ln, Y-L., and Hsu, Y-C. Data path allocaton based on bpartte weghted matchng. In Proc. of the Desgn Automaton Conf. (DAC 9) (Orlando, FL, USA, 99). 499-54. [] Johnson, D. S., Aragon, C. R., McGeoch, L. A., and Schevon, C. Optmzaton by smulated annealng, part : graph colorng and number parttonng. Operatons Research, 39, 3, (May-June, 99), 378-46. [2] Kaplan, A., Brsk, P., and Kastner, R. Data communcaton estmaton and reducton for reconfgurable systems. In Proc. of the Desgn Automaton Conf. (DAC 3) (Anahem, CA, USA, 23). 66-62. [3] Kastner, R., et al. Layout drven data communcaton optmzaton for hgh level synthess. Proceedngs of the Conference on desgn automaton and test n Europe (DATE 6) (Munch, Germany, 26). 85-9. [4] Km, T., and Lu, C. L. An ntegrated data path synthess algorthm based on network flow method. In Proc. of the Custom Integrated Crcuts Conf. (CICC 95) (Santa Clara, CA, USA, 995), 65-68. [5] Mtra, S., Avra, L. J., and McCluskey, E. J. Effcent multplexer synthess technques. IEEE Desgn & Test of Computers, (October- December, 999), 2-9. [6] Pangrle, B. On the complexty of connectvty bndng. IEEE Trans. Computer-Aded Desgn,,, (November, 99), 46-465. [7] Rm, M., Jan, R., and De Leone, R. Optmal allocaton and bndng n hgh-level synthess. In Proc. of the Desgn Automaton Conf. (DAC 92) (Anahem, CA, USA, 992). 2-23. [8] Smth, M. D., and Holloway, G. An Introducton to Machne SUIF and ts Portable Lbrares for Analyss and Optmzaton. Techncal Report, Harvard Unversty, Cambrdge, MA, USA, 22. [9] Zhu, H. W., and Jong, C. C. Interconnecton optmzaton n data path allocaton usng mnmal cost maxmal flow algorthm. Mcroelectroncs Journal, 33, 9, (September, 22), 749-759.