Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances

Similar documents
Ptolemy II in Embedded Signal Processing Architectures: Deriving Process Networks From Matlab

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Loop Transformations, Dependences, and Parallelization

A Binarization Algorithm specialized on Document Images and Photos

The Codesign Challenge

Polyhedral Compilation Foundations

An Optimal Algorithm for Prufer Codes *

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

LLVM passes and Intro to Loop Transformation Frameworks

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Hermite Splines in Lie Groups as Products of Geodesics

Programming in Fortran 90 : 2017/2018

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints

Vectorization in the Polyhedral Model

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Solving two-person zero-sum game by Matlab

CMPS 10 Introduction to Computer Science Lecture Notes

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Memory Modeling in ESL-RTL Equivalence Checking

Assembler. Building a Modern Computer From First Principles.

Module Management Tool in Software Development Organizations

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Parallel matrix-vector multiplication

Wishing you all a Total Quality New Year!

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Support Vector Machines

TN348: Openlab Module - Colocalization

Verification by testing

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Lecture 5: Multilayer Perceptrons

A Facet Generation Procedure. for solving 0/1 integer programs

Concurrent models of computation for embedded software

Related-Mode Attacks on CTR Encryption Mode

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

Mathematics 256 a course in differential equations for engineering students

X- Chart Using ANOM Approach

S1 Note. Basis functions.

High-Boost Mesh Filtering for 3-D Shape Enhancement

Agenda & Reading. Simple If. Decision-Making Statements. COMPSCI 280 S1C Applications Programming. Programming Fundamentals

The stream cipher MICKEY-128 (version 1) Algorithm specification issue 1.0

PHYSICS-ENHANCED L-SYSTEMS

Brave New World Pseudocode Reference

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Computer models of motion: Iterative calculations

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Intro. Iterators. 1. Access

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

3D vector computer graphics

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Wavefront Reconstructor

Algorithm To Convert A Decimal To A Fraction

A HIERARCHICAL SIMULATION FRAMEWORK FOR APPLICATION DEVELOPMENT ON SYSTEM-ON-CHIP ARCHITECTURES. Vaibhav Mathur and Viktor K.

An Entropy-Based Approach to Integrated Information Needs Assessment

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Simulation Based Analysis of FAST TCP using OMNET++

A fault tree analysis strategy using binary decision diagrams

Load Balancing for Hex-Cell Interconnection Network

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Multigranular Simulation of Heterogeneous Embedded Systems

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Petri Net Based Software Dependability Engineering

Model Integrated Computing: A Framework for Creating Domain Specific Design Environments

An Image Fusion Approach Based on Segmentation Region

GSLM Operations Research II Fall 13/14

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

Vectorization of Image Outlines Using Rational Spline and Genetic Algorithm

Edge Detection in Noisy Images Using the Support Vector Machines

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Array transposition in CUDA shared memory

Machine Learning: Algorithms and Applications

Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

Meta-heuristics for Multidimensional Knapsack Problems

Performance Study of Parallel Programming on Cloud Computing Environments Using MapReduce

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

A New Approach For the Ranking of Fuzzy Sets With Different Heights

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

Smoothing Spline ANOVA for variable screening

CS1100 Introduction to Programming

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

The Shortest Path of Touring Lines given in the Plane

Concurrent Apriori Data Mining Algorithms

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Transcription:

In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 Algorthmc Transformaton Technques for Effcent Exploraton of Alternatve Applcaton Instances Todor Stefanov Leden Insttute of Advanced Computer Scence Leden Unversty The Netherlands stefanov@lacs.nl Bart Kenhus Leden Insttute of Advanced Computer Scence Leden Unversty The Netherlands Ed Deprettere Leden Insttute of Advanced Computer Scence Leden Unversty The Netherlands ABSTRACT ollowng the Y-chart paradgm for desgnng a system, an applcaton and an archtecture are modeled separately and mapped onto each other n an explct desgn step. Next, a performance analyss for alternatve applcaton nstances, archtecture nstances and mappngs has to be done, thereby explorng the desgn space of the target system. Dervng alternatve applcaton nstances s not trvally done. Nevertheless, many nstances of a sngle applcaton exst that are worth to be derved for exploraton. In ths paper, we present algorthmc transformaton technques for systematc and fast generaton of alternatve applcaton nstances that express task-level concurrency hdden n an applcaton n some degree of explctness. These technques help a system desgner to speedup sgnfcantly the desgn space exploraton process. Applcaton Specfcaton for = ::N, [x()] = Source(); [y()] = Source2(); for = ::N, [Out()] = Snk(y()); Memory Communcaton Structure Generate P S Snk P P P3 P4 S KPN_5 S Snk P3 P4 S Snk KPN_2 P KPN_4 KPN_ Snk Keywords system-level desgn, desgn space exploraton, applcaton nstances, algorthmc transformatons. INTRODUCTION In system-level desgn of embedded sgnal-processng systems, a system desgner sees the target system as the par Applcaton(s) specfcaton - Archtecture template. An example of such a par s shown n the left part of gure. The applcaton specfcaton provdes the functonal behavor of the system. The archtecture template specfes the organzaton of the resources of the system onto whch the functonal behavor s to be mapped. In ths stage, a desgner has to make some desgn decsons, for example, how to partton the applcaton nto tasks, how to map the tasks onto the archtecture template, what knd of communcaton structure to use n the archtecture template, etc. In order to evaluate dfferent desgn decsons, a system desgner uses a model of the target system and does performance analyss for alternatve applcaton nstances, archtecture nstances and mappngs, thereby explorng the desgn space of the Applcaton - Archtecture par. A general scheme for a desgn space exploraton s the Y-chart PE0 PE PE2 PEn Archtecture Template Map and Explore P S KPN_3 Snk Instances of the Applcaton gure : Alternatve nstances of the applcaton have to be generated, mapped onto the archtecture template and explored n order to evaluate the performance of the Applcaton- Archtecture par. paradgm [4]. Tools lke SPADE [9] and ORAS [6] mplement technques that support the Y-chart paradgm but they focus only on the exploraton of alternatve archtecture nstances and mappngs [8]. In ths paper, however, we focus on technques that support effcent exploraton of alternatve applcaton nstances n system level desgn. An applcaton nstance s every parttonng of an applcaton nto a composton of concurrent tasks. We use the Kahn Process Network (KPN) model of computaton [3] to descrbe applcaton nstances. In the Kahn model, concurrent processes communcate va unbounded IO channels. In gure, we show a smple applcaton and a set of alternatve KPN nstances of ths applcaton (KPN to KPN 5). Each applcaton nstance dffers from the others n the degree of exploted task-level parallelsm. The performance of the Applcaton - Archtecture par can sgnfcantly dep on the applcaton nstance. So, a system desgner needs support to generate and explore a set of nstances of an applcaton n order to evaluate the performance of the system and to choose an applcaton parttonng that satsfes requrements the target system has to meet.

In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 In general, a system desgner s only able to derve at most a few alternatve applcaton nstances. Ths s so because no systematc way to derve an applcaton nstance, let alone alternatves, from an applcaton specfcaton s known, as a result of whch heurstc and tme consumng approaches are taken n practce. Nevertheless, many nstances of a sngle applcaton exst that are worth to be derved for exploraton. We present n ths paper algorthmc transformatons that we have developed and mplemented n order to help a system desgner to derve systematcally and fast alternatve applcaton nstances. These transformatons together wth an aggressve parallel compler called COMPAAN are encapsulated n an Applcaton Transformaton Layer that automatcally generates a set of applcaton nstances. The transformatons and the tools presented n ths paper are not generally applcable n the sense that the applcaton specfcaton has to be an affne nested loop program (NLP). In the next secton we show the poston of the Applcaton Transformaton Layer n the Y-chart paradgm. In Secton 3 two specfc algorthmc transformatons are gven. The COMPAAN tool s brefly descrbed n Secton 4. In Secton 5 we show how our algorthmc transformatons are used n practce. In secton 6 we present a number of experments and assocated results. nally, we dscuss related work and draw conclusons n Secton 7 and Secton 8, respectvely. 2. APPLICATION TRANSORMATION LAYER In ths secton, we dscuss the applcaton transformaton layer n the context of the desgn space exploraton process. We use ths layer as an extenson to the Y-chart envronment [4]. The poston- Applcaton Transformaton Layer Archtecture Template Y chart Envronment Mappng Applcaton n Matlab or C 2 3 Performance New Values Analyss 4 of Parameters Performance Numbers Algorthmc Transformatons Intermedate Matlab or C code Compler Process Networks Intal Values of Parameters gure 2: The Y-chart exted wth the Applcaton Transformaton Layer. or lack of space we confne ourselves to only two such transformatons. We have dentfed and mplemented other transformatons as well, e.g., plane-cuttng, look-ahead, loop transformatons. The approach and technque s unform over all transformatons. ng of the transformaton layer s shown n gure 2. We start wth an applcaton specfcaton wrtten n an mperatve language lke Matlab or C and we have to generate and explore a set of nstances (Kahn Process Networks) functonally equvalent to the applcaton. rst, algorthmc transformatons are appled to the applcaton specfcaton. The transformatons are controlled by a set of parameters. At the begnnng some ntal values are assgned to the parameters depng on the avalable resources n the archtecture template. Wth these values, the orgnal code of the applcaton s automatcally transformed and structured n a partcular way n order to make the parallelsm that s nherently avalable n the applcaton explct or to enhance the task-level parallelsm n the applcaton. Second, the transformed code s converted automatcally to a KPN descrpton by an aggressve parallel compler called COMPAAN. Thrd, we use a Y-chart envronment to map the KPN onto an archtecture template and do performance analyss. The result of ths performance analyss can be used to change the values of the parameters (step 4 n gure 2) f the system performance s not satsfactory. Then, we repeat the procedure descrbed above resultng n a desgn space exploraton of alternatve nstances of the applcaton. Ths s shown n gure 2 as a feed-back arrow to the transformaton layer. By changng the values of the parameters, the applcaton transformaton layer automatcally generates a set of KPNs correspondng to a sngle applcaton. The dfference among the KPNs s the degree of the task-level parallelsm that s exploted. Tll the of ths paper we descrbe n more detals the technques and tools we have developed and ncorporated n the transformaton layer. 3. ALGORITHMIC TRANSORMATIONS In ths secton, we present two algorthmc transformatons, namely Unfoldng and Skewng. These transformatons take as nput an affne nested loop program (NLP) [2] and a set of parameters. The output of the unfoldng transformaton s an affne nested loop program whch s functonally equvalent to the nput program but wth enhanced task-level parallelsm. The skewng transformaton makes the potental parallelsm n the nput affne nested loop program explct. We have developed and mplemented these and other transformatons n a tool box called MATTRANSORM. The transformatons n ths tool box operate drectly on the NLP source code wthout usng some ntermedate representaton lke depence graphs, sgnal-flow graphs or data-flow graphs correspondng to the NLP. rst, we explan what unfoldng and skewng mean n the context of our algorthmc transformatons. Next, we defne the unfoldng and skewng transformatons as procedures that operate on an affne nested loop program. or convenence, n our further explanatons, we assume that affne nested loop programs (NLPs) are expressed n Matlab code. The NLPs could also be expressed n other mperatve programmng languages lke, for example, C. 3. Unfoldng and Skewng Consder the applcaton program (NLP) and ts depence graph (DG) shown n gure 3-a). The DG s a graphcal representaton of the NLP. The nodes n the DG represent the NLP functons that are executed n each loop teraton and the edges represent the data depences between the functons. The NLP has two loops (wth terators, ) whch can be unrolled to yeld the DG. Unlke common approaches, n whch ether the loop control s removed through loop unrollng [0] or the DG s folded [], our new approach to get the desred degree of parallelsm - at the task level - s to copy

In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 for = ::4, for = ::3, for = ::4, f ( mod 2) =, for = ::3, f ( mod 2) = 0, for = ::3, for = 2::4+3, for = max(, 4)::mn(,3), [y(), x( )] = (y(), x( )); y() y(2) y(3) y() y(2) y(3) x() x(2) x(3) x(4) a) Applcaton program (NLP) and ts depence graph b) NLP wth unfolded loop by factor 2 c) NLP wth skewed loop x() x(2) x(3) x(4) y(2) x() x(2) x(3) x(4) y() y(3) gure 3: Smple example llustratng the unfoldng and skewng transformatons. a a number of tmes n such a way that these copes are mutually exclusve. We call ths new approach unfoldng and we have mplemented t n our unfoldng transformaton. An example of our unfoldng s shown n gure 3-b), where the -loop of the program n gure 3-a) s unfolded by a factor of 2. The two peces of code bounded by the f statements n gure 3-b) are mutually exclusve. The mutually exclusveness can be exploted by an aggressve parallel compler to partton the program n gure 3-b) nto two processes (tasks) that can operate n parallel. The graphcal nterpretaton of the unfoldng transformaton s gven by the depence graph n gure 3-b). or ths smple example the unfoldng transformaton parttons the computatonal workload over two parallel processes. The frst process wll execute the nodes bounded by the dashed boxes. The second process wll execute the nodes bounded by the sold boxes. An example of the network connectng these two processes s shown n gure 7 - see KPN. In general, our unfoldng transformaton s used to partton an NLP n processes, where s equal to the unfoldng factor. The process network correspondng to a fully unfolded NLP s equal to the depence graph of ths NLP. Now, consder the same applcaton program (NLP) shown n gure 3-a). The transformaton of skewng s to create a new NLP n whch the bounds of the loops and the ndexes of the varables are changed n a partcular way to make the potental parallelsm n the orgnal NLP explct. or example, skewng the -loop of the program n gure 3-a) leads to the NLP n gure 3-c). The effect of our skewng transformaton s vsualzed by the depence graph (DG) n gure 3-c). Ths DG explctly shows that the nodes nsde a dashed box can be executed n parallel because there are no data depences between these nodes. Ths property can be exploted by an aggressve parallel compler n combnaton wth the unfoldng descrbed above to partton the program nto processes (tasks) that run n parallel. An example of a network of such parallel processes correspondng to the NLP n gure 3-c) s gven n gure 8 - see KPN 4. Moreover, nsde these processes some peces of code can be executed n parallel or n a ppelne fashon because of the UNOLD( ) f ( s empty set) 5 prnt( ); return(); else 0 = frst element of the set ; = frst element of the set ; = take the code from the begnnng of tll the "for" statement wth loop terator, 5 ncludng; = take the body of loop from ; prnt( ); 20 for (k = ; k <= ; k++) prntln("f("+ +"mod"+ +")="+ -k+, ); "! $# = the set wthout the frst element; "! &% 25 = the set wthout the frst element; UNOLD( '! $# "! &% ); 30 35 prntln(""); prntln(""); return(); gure 4: Pseudo code descrbng the UNOLD transformaton. skewng transformaton. Note that n both cases (unfoldng and skewng), the transformatons proceed along the NLP code n gure 3. The depence graphs are only shown to vsualze the effect of the transformatons. 3.2 Unfoldng procedure Let )(* be an N-deep affne nested loop program wth an teraton vector +-,/.0 2 43 56565656 4798. or each 4:-;<+-=0>?,A@ BC565D a parameter E : ;G s assocated. All these parameters form a parameter vector HI,J.0E E 3 2565D5656 E 7 8 whch we call unfoldng vector. We defne a transformaton UNOLD(NLP,U,I) whch s descrbed n gure 4. The pseudo code n gure 4 descrbes the unfoldng transformaton as a recursve procedure. Ths procedure operates on the affne nested loop program )(K* wth ts teraton vector + and the value of the unfoldng vector H. In order to explan the behavor of the procedure UNOLD we consder the followng smple example. Let L(* be the program shown n the left part of gure 5. )(K* has only one loop wth an terator (ndex). Hence, the teraton vector + correspondng to L(* has only one element +-,M.0 N8 and the unfoldng vector H has also one element HG,/.2EO8. In our example the parameter E s equal to 0. ollowng the procedure UNOLD, frst we check whether + s an empty set. In our example we start wth +P,Q.0 N8 whch s not an empty set. Then, we ntalze four varables, see lnes 0,, 3 and 6 n gure 4. As a result we have: varable R takes the character ; varable S = 0; varable TVUWUYX takes the strng Z[UW\] ^,G@`_a@`_C and SUWbCc takes the code n the body of the loop wth terator. Ths code s marked n gure 5 as a rectangle. Lne 8 n gure 4 prnts to the output the varable TUWUYX. The result s shown n gure 5 - the frst lne n the unfolded NLP. Executng lnes 20 tll 32 n gure 4 wll generate the rest of the code of the unfolded NLP n

n qp ut In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 for = ::N, Applcaton program (NLP) U = {0}, I={} UNOLD(NLP, U, I) for = ::N, f ( mod 0) = 9, f ( mod 0) = 8, f ( mod 0) = 0, : : Unfolded NLP gure 5: Smple example llustratng the UNOLD() transformaton shown n gure 4. gure 5. As a result the unfolded NLP n gure 5 has ten copes of the bounded by f statements wth a mod statement makng them mutually exclusve. The example n gure 5 shows that the nput NLP s transformed to a functonally equvalent NLP whch we call an unfolded NLP. The unfolded NLP can be easly converted nto ten tasks that operate n parallel. That s why we say that the unfolded NLP has enhanced task-level parallelsm compared wth the nput NLP. 3.3 Skewng procedure Let )(K* be an N-deep affne nested loop program wth an teraton vector +d,/.2 3 565D5656 7 8. or each : ;<+-=0>e,A@ 'Bf25656 a parameter vector g :,h.0 2 P3 565D5656 P78 s assocated, where each -k<;lg=xl,m@ YBf25656. All parameter vectors form a parameter matrx " r565 7,h.0gdo gdo 3 56565656 gdo 7 8, 5D5 565 565 7 565 7s7 n whch we call skewng matrx. We requre to be unmodular. We defne a transformaton SKEW(NLP,M) as descrbed below: v STEP - Represent the teraton space of )(* as a polytope *w,i.0+<;yx]zp=${ 5 +} ~S08, where { s an ntegral matrx and S s an ntegral vector; v n STE - Use the skewng matrx to transform * as follows: { 5 ng 5 n 5 +G S,ƒ {9 5 +f ) S, where {, { 5 n/ and +, n 5 + ; v STEP3 - Use the ourer-motzkn (M) procedure [] to represent the teraton space, descrbed by { 5 + S, n terms of nested loops. Ths s the new teraton space of )(K* wth teraton vector + ; v STEP4 - Change all ndexes of the varables n )(K* accordng to the equaton +ˆ, n 5 +. The four steps descrbed above are llustrated n gure 6 n the context of a smple example. We start n wth a 2-deep affne nested loop program and a skewng matrx,š @ @. In STEP, the @ ranges of the loop ndexes and are represented as a system of lnear nequaltes { 5 +Ž ms n. Next, we use the skewng matrx to STEP STE STEP3 STEP4 Applcaton program (NLP) for = ::N, 0 0 0 0 A 0 0 A M for = 2::N+K, for = max(, N)::mn(,K), [y( ), x( )] = (y( ), x( )); Skewed NLP * 0 0 * >= * * >= N K I b M I M N K b for = 2::N+K, for = max(, N)::mn(,K), = 0 I M 0 0 0 0 A * I * >= I ==> N K b Substtute: wth wth gure 6: Smple example llustratng the four steps n the SKEW(NLP,M) procedure. do the mathematcal manpulatons descrbed n STE. As a result we have $ a new teraton space for the nput NLP, defned by the loop ndexes and and bounded by the system { 5D $ D o S. The ourer-motzkn (M) procedure s used to represent the new teraton space as nested loops as t s shown n gure 6 - STEP3. After ths step all varables nsde the loops are stll ndexed by the old ndexes and. We have to replace them wth the new ndexes and. In order to do ths we know from STE that o, @ @ 5D @ o, whch mples that ow, @ `@ 5D @ $ o. So, we have to replace ndex wth e and ndex wth n all varables. Ths s llustrated n gure 6 - STEP4. 4. COMPILER In ths secton, we brefly descrbe our aggressve parallel compler COMPAAN whch explots the result of the transformatons presented n Secton 3. COMPAAN (Complaton of Matlab to Process Networks) [7] s a method and tool set (MATPARSER, DGPARSER, PANDA) for transformng affne nested loop programs (NLP) [2] wrtten n Matlab nto a Kahn Process Network (KPN) specfcaton. COMPAAN starts the transformaton by convertng a Matlab specfcaton nto a sngle assgnment code (SAC) specfcaton. SAC descrbes all parallelsm avalable n the orgnal Matlab specfcaton. The tool whch does the Matlab-to-SAC transformaton s MAT- PARSER [5]. MATPARSER s an array dataflow analyss compler that fnds all parallelsm avalable n NLPs wrtten n Matlab usng a very aggressve data-depency analyss technque. Ths technque s based on parametrc nteger lnear programmng. Also, MATPARSER can handle non-lnear operators lke Max, Mn, Cel, loor, Mod and Dv. Therefore, t can handle the result of the skewng and unfoldng transformatons presented n Secton 3. Next, a tool called DGPARSER [2] converts the SAC descrpton nto a Polyhedral Reduced Depence Graph (PRDG) [7] descrpton. The PRDG s a compact graphcal representaton of the SAC usng parameterzed polyhedral embeddngs of the atomc functons. nally, the PANDA tool [7] uses the PRDG descrpton n order to generate the Kahn Process Network descrpton and the ndvdual

š In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 for = ::N, Transformaton: Unfold(U), U = [u, u2] = [2,] for = ::N, f ( mod 2) =, f ( mod 2) = 0, Converson to KPN: Transformaton: Unfold(U), U = [u, u2] = [2,2] for = ::N, f ( mod 2) =, f ( mod 2) =, f ( mod 2) = 0, f ( mod 2) = 0, f ( mod 2) =, f ( mod 2) = 0, Converson to KPN: P for = ::N, for = 2::N+K, for = max(, N)::mn(,K), [y(), x( )] = (y(), x( )); Converson to KPN: Transformaton: Skew(M) + Unfold(U), m m2 M = = m2 m22 0 Transformaton: U = [u, u2] = [2,] Skew(M), m m2 M = = m2 m22 0 P KPN_3 for = 2::N+K, f ( mod 2) =, for = max(, N)::mn(,K), [y(), x( )] = (y(), x( )); f ( mod 2) = 0, for = max(, N)::mn(,K), [y(), x( )] = (y(), x( )); KPN_4 Converson to KPN: gure 8: An example of generatng two possble Kahn Process Networks from a sngle applcaton usng the skewng and unfoldng transformatons and the COMPAAN tool. P KPN_ P P3 P4 KPN_2 gure 7: An example of generatng two possble Kahn Process Networks from a sngle applcaton usng the unfoldng transformaton and the COMPAAN tool. processes. 5. EXAMPLES In ths secton, we demonstrate the use of our algorthmc transformatons n combnaton wth the COMPAAN tool set. We show how, merely by changng the values of the parameters, a set of Kahn Process Networks (KPN) can be easly generated from a sngle applcaton. Consder the applcaton shown n the top-left corner of gure 7. It s a 2-deep affne nested loop program wrtten n Matlab. In gure 7 frst we apply the unfoldng transformaton on our applcaton and then we use COMPAAN to convert the transformed code nto a KPN descrpton. We assgn two dfferent values to the parameter vector H, namely H, BC @ and HI, BfYB. As a result we obtan two dfferent KPNs. They have dfferent numbers of processes and dfferent communcaton structures (see gure 7- KPN and KPN 2). In gure 8, we show another example n whch we use the same applcaton as n gure 7. We obtan KPN 3, whch has only one process, n by applyng the skewng transformaton wth a parameter matrx, @ @. Also, we show that the skewng transformaton and the unfoldng transformaton can be appled n combna- @ ton. KPN 4 n gure 8 s derved by applyng frst the skewng @ and then the unfoldng transfor- n transformaton wth, @ @ maton wth H, Bf @. 6. EXPERIMENTS AND RESULTS In ths secton, we present some of the experments we have done n order to evaluate and show the usefulness of the algorthmc transformaton technques presented n ths paper. We bult a Y-chart envronment exted wth the Applcaton Transformaton Layer as shown n gure 2. As an nput applcaton for the transformaton layer we used the QR-decomposton algorthm [2] because t s common computatonal ntensve task n many sgnal processng applcatons lke Dgtal Beamformng, Adaptve Dgtal lterng etc. The algorthm was wrtten n Matlab. The applcaton transformaton layer apples the Unfoldng and Skewng transformatons on the QR algorthm and generates alternatve applcaton nstances - Process Networks - as syntheszable VHDL. We mapped these nstances onto a Xlnx XCV000E PGA devce whch was the archtecture template for our experments. The mappng was done by a syntheszer and place-and-route tools provded by Xlnx. The performance analyss was done usng the tmng analyss and smulaton tools from the Xlnx oundaton R package. gure 9 shows the estmated total executon tme for three applcaton nstances of the QR-decomposton algorthm. These nstances were derved automatcally by applyng the transformaton technques presented n Secton 3. The results show that the effect of Skewng + Unfoldng Unfoldng No transform 0 2 4 6 8 0 Tme ( mcro seconds) gure 9: Executon tme of the QR algorthm transformed by usng the unfoldng and skewng transformatons. The unfoldng factor s 3 and the sze of the nput data matrx s 0 by 6. applyng our transformatons s that we can generate alternatve applcaton nstances wth dfferent performance when mappng them onto an archtecture template (n our case an PGA). It can be seen from gure 9 that the unfoldng and skewng transformatons mprove sgnfcantly the performance. gure 0 shows the results obtaned from the exploraton of the performance of ten applcaton nstances of the QR algorthm derved by applyng only the unfoldng transformaton wth unfoldng factors from to 0. Agan, the results show that the performance can be sgnfcantly mproved. In ths experment we also measured how much tme t takes to obtan the results presented n gure 0. The tme taken for these ten experments to be processed

In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 number of cycles 30000 25000 20000 5000 0000 5000 0 2928 4646 977 7336 6 4906 4296 3696 367 3086 2 3 4 5 6 7 8 9 0 unfoldng factor gure 0: Exploraton of the performance of the QR algorthm unfolded by factors from to 0. The sze of the nput data matrx s 48 by 6. automatcally from Matlab to a hardware mappng onto an PGA and VHDL smulaton was wthn 8 hours. Table shows the processng tmes for some of the experments n more detals. The second row Transform+Comple shows the processng tmes for our tools MATTRANSORM and COMPAAN step and step 2 n gure 2. The row Mappng+Smulaton gves the tme needed to express the Process Networks n terms of a syntheszable VHDL code, to map ths VHDL code on an PGA and fnally to obtan performance numbers from VHDL smulaton step 3 n gure 2. Table : Processng Tmes (hh:mm:ss). Unfold 2 Unfold 5 Unfold 0 Transform+Comple 00:00:08 00:00:8 00:00:29 Mappng+Smulaton 00:22:54 0:24:44 04:47:30 Total 00:23:02 0:25:02 04:47:59 The last row of Table suggests that an extensve desgn space exploraton of alternatve applcaton nstances can be done n a relatvely short amount of tme. Moreover, the accuracy of the results obtaned durng the exploraton s wthn 5%, because we dd very detaled VHDL cycle accurate smulaton. The results gven n the second row of Table show that the applcaton transformaton layer presented n Secton 2 generates very fast alternatve applcaton nstances from a gven applcaton. The tme to do ths s only a few seconds, whereas the tme to map the nstances onto an PGA and smulate them vares form mnutes to hours - see row 3 of Table. However, there s a potental to mprove the mappng and smulaton tme (row 3 of Table ) by usng some system-level desgn space exploraton tools lke SPADE [9] and ORAS [6]. Prelmnary results ndcate that the mappng and smulaton tme can be reduced to a few mnutes nstead of several hours obtanng performance numbers wth reasonable accuracy. 7. RELATED WORK The Unfoldng and Skewng transformatons presented n ths paper are related to the unfoldng and retmng transformaton technques used n the Sgnal-Processng communty []. Also, they are related to the loop unrollng and loop skewng technques used n compler desgn [0]. However, there are some mportant dfferences. rst, we use our transformatons for generatng a set of Kahn Process Networks correspondng to an applcaton (nested loop program) thereby generatng alternatve applcaton nstances. Usng the Unfoldng transformaton to generate Process Networks we do reverse parttonng compared to [3]. We start by puttng all computatonal workload n one process and by unfoldng we partton the workload over more processes. Second, we developed procedures to do these transformatons on the algorthmc (source code) level, whereas n [] smlar transformatons are appled on sgnal-flow graphs, data-flow graphs or depence graphs correspondng to an algorthm. Thrd, our transformatons am at exposng and explotng the task-level parallelsm avalable n an applcaton, whereas the transformatons n [0] am at explotng the fne-gran nstructon-level parallelsm. 8. CONCLUSIONS In ths paper, we presented algorthmc transformaton technques for dervng a set of applcaton nstances (Kahn Process Networks) correspondng to an applcaton. These technques support a system desgner n explorng alternatve nstances of an applcaton mapped onto an archtecture template. We have mplemented our technques n the tools MATTRANSORM and COMPAAN whch means that the process of dervng alternatve nstances s fully automated for applcatons descrbed as affne nested loop programs. Therefore, the presented technques help a system desgner to speedup sgnfcantly the process of explorng alternatve applcaton nstances n system level desgn. Our experments and results show that an extensve desgn space exploraton of alternatve applcaton nstances can be done n a relatvely short amount of tme wth accuracy of the results wthn 5%. 9. REERENCES [] C. Ancourt and. Irgon. Scannng polyhedra wth DO loops. In Proc. ACM SIGPLAN 9, pages 39 50, June 99. [2] P. Held. unctonal Desgn of Data-low Networks, 996. PhD thess, Delft Unversty of Technology, The Netherlands. [3] G. Kahn. The semantcs of a smple language for parallel programmng. In Proc. of the IIP Congress 74. North-Holland Publshng Co., 974. [4] B. Kenhus. Desgn Space Exploraton of Stream-based Dataflow Archtectures: Methods and Tools, Jan. 999. PhD thess, Delft Unversty of Technology, The Netherlands. [5] B. Kenhus. MatParser: An array dataflow analyss compler. Techncal report, Unversty of Calforna at Berkeley, 2000. UCB/ERL M00/9. [6] B. Kenhus, E. Deprettere, K. Vssers, and P. van der Wolf. The Constructon of a Retargetable Smulator for an Archtecture Template. In Proc. 6-th Int. Workshop on Hardware/Software Codesgn (CODES 98), Seattle, Washngton, Mar. 5-8 998. [7] B. Kenhus, E. Rpkema, and E.. Deprettere. : Dervng Process Networks from Matlab for Embedded Sgnal Processng Archtectures. In Proc. 8th Internatonal Workshop on Hardware/Software Codesgn (CODES 2000), San Dego, CA, USA, May 3-5 2000. [8] P. Leverse, T. Stefanov, P. van der Wolf, and E. Deprettere. System Level Desgn wth SPADE: an M-JPEG Case Study. In Proc. Int. Conference on Computer Aded Desgn (ICCAD 0), pages 3 38, San Jose CA, USA, Nov. 4-8 200. [9] P. Leverse, P. van der Wolf, K. Vssers, and E. Deprettere. A Methodology for Archtecture Exploraton of Heterogeneous Sgnal Processng Systems. Int. Journal of VLSI Sgnal Processng for Sgnal, Image and Vdeo Technology, 29(3):97 207, 200. [0] S. Muchnck. Advanced Compler Desgn and Implementaton. Morgan Kaufmann Publshers, Inc., 997. [] K. Parh. VLSI Dgtal Sgnal Processng Systems: Desgn and Implementaton. John Wley & Sons, Inc., 999. [2] J. Proaks, C. Rader,. Lng, C. Nkas, M. Moonen, and I. Proudler. Algorthms for Statstcal Sgnal Processng. Prentce Hall, Inc., 2002. [3] J. Tech and L. Thele. Exact Parttonng of Affne Depence Algorthms. Lecture Notes n Computer Scence (LNCS), Sprnger, 2268:33 5, 2002.