Incorporating Speculative Execution into Scheduling of Control-flow Intensive Behavioral Descriptions

Similar documents
Combined Radix-10 and Radix-16 Division Unit

Parametric Micro-level Performance Models for Parallel Computing

Relayer Selection Strategies in Cellular Networks with Peer-to-Peer Relaying

Macrohomogenous Li-Ion-Battery Modeling - Strengths and Limitations

Description of Traffic in ATM Networks by the First Erlang Formula

1. Introduction. Abstract

KINEMATIC ANALYSIS OF VARIOUS ROBOT CONFIGURATIONS

Laboratory Exercise 6

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

COURSEWORK 1 FOR INF2B: FINDING THE DISTANCE OF CLOSEST PAIRS OF POINTS ISSUED: 9FEBRUARY 2017

Pruning Game Tree by Rollouts

Laboratory Exercise 6

Pipelined Multipliers for Reconfigurable Hardware

Q1:Choose the correct answer:

1 The secretary problem

Inverse Kinematics 1 1/29/2018

Course Project: Adders, Subtractors, and Multipliers a

Automatic design of robust PID controllers based on QFT specifications

Shortest Paths in Directed Graphs

Lecture 14: Minimum Spanning Tree I

Using Bayesian Networks for Cleansing Trauma Data

This fact makes it difficult to evaluate the cost function to be minimized

Visual Targeted Advertisement System Based on User Profiling and Content Consumption for Mobile Broadcasting Television

On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2

About this Topic. Topic 4. Arithmetic Circuits. Different adder architectures. Basic Ripple Carry Adder

OSI Model. SS7 Protocol Model. Application TCAP. Presentation Session Transport. ISDN-UP Null SCCP. Network. MTP Level 3 MTP Level 2 MTP Level 1

Laboratory Exercise 6

Topics. FPGA Design EECE 277. Number Representation and Adders. Class Exercise. Laboratory Assignment #2

Datum Transformations of NAV420 Reference Frames

Routing Definition 4.1

Deterministic Access for DSRC/802.11p Vehicular Safety Communication

An Evolutionary Multiple Heuristic with Genetic Local Search for Solving TSP

Computer Arithmetic Homework Solutions. 1 An adder for graphics. 2 Partitioned adder. 3 HDL implementation of a partitioned adder

Folding. Hardware Mapped vs. Time multiplexed. Folding by N (N=folding factor) Node A. Unfolding by J A 1 A J-1. Time multiplexed/microcoded

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425)

A SIMPLE IMPERATIVE LANGUAGE THE STORE FUNCTION NON-TERMINATING COMMANDS

MAT 155: Describing, Exploring, and Comparing Data Page 1 of NotesCh2-3.doc

Calculations for multiple mixers are based on a formalism that uses sideband information and LO frequencies: ( ) sb

Laboratory Exercise 2

Contents. shortest paths. Notation. Shortest path problem. Applications. Algorithms and Networks 2010/2011. In the entire course:

Performance of a Robust Filter-based Approach for Contour Detection in Wireless Sensor Networks

Operational Semantics Class notes for a lecture given by Mooly Sagiv Tel Aviv University 24/5/2007 By Roy Ganor and Uri Juhasz

Increasing Throughput and Reducing Delay in Wireless Sensor Networks Using Interference Alignment

In-Plane Shear Behavior of SC Composite Walls: Theory vs. Experiment

The Association of System Performance Professionals

Universität Augsburg. Institut für Informatik. Approximating Optimal Visual Sensor Placement. E. Hörster, R. Lienhart.

Laboratory Exercise 6

SPH3UW Unit 7.1 The Ray Model of Light Page 2 of 5. The accepted value for the speed of light inside a vacuum is c m which we usually

Multiple Assignments

c s ha2 c s Half Adder Figure 2: Full Adder Block Diagram

Distributed Packet Processing Architecture with Reconfigurable Hardware Accelerators for 100Gbps Forwarding Performance on Virtualized Edge Router

Karen L. Collins. Wesleyan University. Middletown, CT and. Mark Hovey MIT. Cambridge, MA Abstract

Floating Point CORDIC Based Power Operation

View-Based Tree-Language Rewritings

Topics. Lecture 37: Global Optimization. Issues. A Simple Example: Copy Propagation X := 3 B > 0 Y := 0 X := 4 Y := Z + W A := 2 * 3X

Interconnection Styles

DAROS: Distributed User-Server Assignment And Replication For Online Social Networking Applications

Key Terms - MinMin, MaxMin, Sufferage, Task Scheduling, Standard Deviation, Load Balancing.

Exploring the Commonality in Feature Modeling Notations

ES205 Analysis and Design of Engineering Systems: Lab 1: An Introductory Tutorial: Getting Started with SIMULINK

See chapter 8 in the textbook. Dr Muhammad Al Salamah, Industrial Engineering, KFUPM

What are Cycle-Stealing Systems Good For? A Detailed Performance Model Case Study

[N309] Feedforward Active Noise Control Systems with Online Secondary Path Modeling. Muhammad Tahir Akhtar, Masahide Abe, and Masayuki Kawamata

Minimum congestion spanning trees in bipartite and random graphs

A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR

Capturing Large Intra-class Variations of Biometric Data by Template Co-updating

A Specification for Rijndael, the AES Algorithm

Extracting Partition Statistics from Semistructured Data

Alleviating DFT cost using testability driven HLS

A Novel Validity Index for Determination of the Optimal Number of Clusters

Refining SIRAP with a Dedicated Resource Ceiling for Self-Blocking

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

Uninformed Search Complexity. Informed Search. Search Revisited. Day 2/3 of Search

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall Test I Solutions

arxiv: v1 [cs.ds] 27 Feb 2018

Multi-Target Tracking In Clutter

3D SMAP Algorithm. April 11, 2012

Keywords Cloud Computing, Service Level Agreements (SLA), CloudSim, Monitoring & Controlling SLA Agent, JADE

A System Dynamics Model for Transient Availability Modeling of Repairable Redundant Systems

Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging

Lecture Outline. Global flow analysis. Global Optimization. Global constant propagation. Liveness analysis. Local Optimization. Global Optimization

VLSI Design 9. Datapath Design

Design of High Speed Mac Unit

Midterm 2 March 10, 2014 Name: NetID: # Total Score

CleanUp: Improving Quadrilateral Finite Element Meshes

An Intro to LP and the Simplex Algorithm. Primal Simplex

Kinematic design of a double wishbone type front suspension mechanism using multi-objective optimization

Calculation of typical running time of a branch-and-bound algorithm for the vertex-cover problem

arxiv: v1 [physics.soc-ph] 17 Oct 2013

How to Select Measurement Points in Access Point Localization

Performance Benchmarks for an Interactive Video-on-Demand System

Design of a Stewart Platform for General Machining Using Magnetic Bearings

Chapter S:II (continued)

Distributed Resource Allocation Strategies for Achieving Quality of Service in Server Clusters

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines

Stochastic Search and Graph Techniques for MCM Path Planning Christine D. Piatko, Christopher P. Diehl, Paul McNamee, Cheryl Resch and I-Jeng Wang

SLA Adaptation for Service Overlay Networks

Graph-Based vs Depth-Based Data Representation for Multiview Images

Fall 2010 EE457 Instructor: Gandhi Puvvada Date: 10/1/2010, Friday in SGM123 Name:

An Optimized Approach on Applying Genetic Algorithm to Adaptive Cluster Validity Index

Transcription:

Inorporating Speulative Exeution into Sheduling of Control-flow Intenive Behavioral Deription Ganeh Lakhminarayana, Anand Raghunathan, and Niraj K. Jha Dept. of Eletrial Engineering C&C Reearh Laboratorie Prineton Univerity NEC USA Prineton, NJ 08544 Prineton, NJ 08540 Abtrat Speulative exeution refer to the exeution of part of a omputation before the exeution of the onditional operation that deide whether it need to be exeuted. It ha been hown to be a promiing tehnique for eliminating performane bottlenek impoed by ontrol flow in hardware and oftware implementation alike. In thi paper, we preent tehnique to inorporate peulative exeution in a fine-grained manner into heduling of ontrol-flow intenive behavioral deription. We demontrate that failing to take into aount information uh a reoure ontraint and branh probabilitie an lead to ignifiantly ub-optimal performane. We alo demontrate that it may be neeary to peulate imultaneouly along multiple path, ubjet to reoure ontraint, in order to minimize the delay overhead inurred when predition error our. Experimental reult on everal benhmark how that our peulative heduling algorithm an reult in ignifiant (upto even-fold) improvement in performane (meaured in term of the average number of lok yle) a ompared to heduling without peulative exeution. Alo, the bet and wort ae exeution time for the peulatively performed hedule are the ame a or better than the orreponding value for the hedule obtained without peulative exeution. 1 Introdution Speulative exeution refer to the exeution of a part of a omputation before it i known if the ontrol path to whih it belong will be exeuted (for example, exeution of the ode after a branh tatement before the branh ondition itelf i evaluated). It ha been ued to overome, to ome extent, the heduling bottlenek impoed by ontrol-flow. There ha been previou work on peulative exeution in the area of high-level ynthei [1, 2, 3] a well a high-performane ompilation [4, 5]. Previou work [1, 2, 3] in high-level ynthei ha attempted to loate ingle or multiple path for peulation prior to heduling. Thi paper preent tehnique to integrate peulative exeution into heduling during high-level ynthei of ontrol-flow intenive deign. In that ontext, we demontrate that not uing information uh a reoure ontraint and branh probabil- Thi work wa upported in part by NSF under Grant No. 9319269 and in part by Alternative Sytem Conept, In. under an SBIR ontrat from Air Fore Rome Laboratorie. Permiion to make digital/hard opy of all or part of thi work for peronal or laroom ue i granted without fee provided that opie are not made or ditributed for profit or ommerial advantage, the opyright notie, the title of the publiation and it date appear, and notie i given that opying i by permiion of ACM, In. To opy otherwie, to republih, to pot on erver or to reditribute to lit, require prior peifi permiion and/or a fee. DAC 98, San Franio, California () 1998 ACM 0-89791-920-3/97/06..$3.50 itie while deiding when to peulate an lead to ignifiantly uboptimal performane. We alo demontrate that it i neeary to perform peulative exeution along multiple path at a fine-grain level during the oure of heduling, in order to obtain maximal benefit. In addition, we preent tehnique to automatially manage the additional peulative reult that are generated by peulatively exeuted operation. We how how to inorporate peulative exeution into a generi heduling methodology, and in partiular preent the reult of it integration into an effiient heduler Wavehed [6]. Experimental reult for variou benhmark and example are preented that indiate upto even-fold improvement in performane (average number of lok yle required to perform the omputation). 2 Bakground and Motivation Sheduling tool typially work uing one or more intermediate repreentation of the behavioral deription, uh a a data flow graph (DFG), ontrol flow graph (CFG), or ontrol-data flow graph (CDFG). In thi paper, we ue the CDFG a the intermediate repreentation of a behavioral deription, and tate tranition graph (STG) to repreent the heduled behavioral deription, a explained in later etion. In addition to the behavioral deription, our heduler alo aept the following information: A ontraint on the number of reoure of eah type available (reoure alloation ontraint). The target lok period for the implementation, or ontraint that limit the extent of data and ontrol haining allowed. Profiling information that indiate the branh probabilitie for the variou onditional ontrut preent in the behavioral deription. We now preent ome motivational example to illutrate the ue of peulative exeution during heduling. Example 1: Conider a part of a behavioral deription and the orreponding CDFG fragment hown in Figure 1, that ontain a while loop. The CDFG ontain vertie orreponding to operation of the behavioral deription, where olid line indiate data dependenie, and dotted line indiate ontrol dependenie. Control edge in the CDFG are annotated with a variable that repreent the reult of the onditional operation that generate them. For example, the ontrol edge fed by operation > 1 are marked in Figure 1. The initial value of variable i and t4 ued in the loop body are indiated in parenthee beide the orreponding CDFG data edge. Let u now onider the tak of heduling the CDFG hown in Figure 1. Suppoe we have the following ontraint to be ued during heduling.

_1 S8 (0) i ++1... i := 0; t4 := 0; while (k > t4) { i := i + 1; t1 := M1[i]; t2 := t1 * C1; t3 := t2 * C2; t4 := t3 + C3 M2[i] := t4; }... k >1 M1 t1 t2 C1 * t3 2 C2 M2 Figure 1: A CDFG to illutrate peulative exeution S1 S2 S3 S5 S6 S7 ++1_0 M1_0 *1_0 *1_0 *2_0 *2_0 +1_0 M2_0, >1_1 >1_0 _0 _1 S9 _0 (a) S5 S6 (0) +1 >1_0, ++1_0/_0 _0 S1 ++1_1/ _1, M1_0 C3 t4 ++1_2/ (_1 _2), S2 M1_1/_1, *1_0 _0 S3 ++1_3, M1_2, *1_0, *1_1 ++1_4, M1_3, *1_1, *1_2, *2_0 ++1_5, M1_4, *1_2, *1_3, *2_0, *2_1 ++1_6, M1_5, *1_3, *1_4, *2_1, *2_2, +1_0 S7 >1_1, ++1_7, M1_6, *1_4, *1_5, *2_2, *2_3, +1_1, M2_0 S9 _1 _2 _2 S8 >1_2, ++1_8, M1_7, *1_5, *1_6, *2_3, *2_4, +1_2, M2_1 (b) Figure 2: (a) Non-peulative hedule for the CDFG of Figure 1, and (b) hedule inorporating peulative exeution The target lok period allow the exeution of +, ++, >, and memory ae operation in one lok yle, while the operation require two lok yle. In addition, we aume that the operation will be implemented uing a 2-tage pipelined multiplier. No operation haining i allowed, ine it lead to a violation of the target lok period ontraint (in general, however, our algorithm an handle haining). _1 The aim i to optimize the performane of the deign a muh a poible. Hene, no reoure ontraint are peified for the purpoe of illutration for thi example. Thi i not a limitation of our heduling algorithm, whih doe handle reoure ontraint a deribed in later etion. A hedule for the CDFG that doe not inorporate peulative exeution i hown in Figure 2(a). Thi hedule an be obtained by applying either the loop-direted heduling [7] tehnique or the Wavehed [6] tehnique to the CDFG. Vertie in the STG repreent hedule tate, that diretly orrepond to tate in the ontroller of the RTL implementation. Eah tate i annotated with the name of the CDFG operation that are performed in that tate, inluding a uffix that repreent a ymboli iteration index of the CDFG loop that the operation belong to. For example, onider operation > 1 of the CDFG. When > 1 i enountered the firt time during heduling, it i aigned a ubript 0, reulting in operation > 1 0 in the STG of Figure 2(a). In general, multiple opie of an operation may be generated during heduling, orreponding to different onditional path, or different iteration of a loop. For example, operation > 1 1 in the STG of Figure 2(a) orrepond to the exeution of the firt unrolled intane of CDFG operation > 1. An edge in the STG repreent a ontroller tate tranition, and i annotated with the ondition that ativate the tranition. Eah iteration of the loop in the heduled CDFG require eight lok yle. For thi example, the data dependenie among the operation within the loop require them to be performed erially. In addition, the ontrol dependenie between the omparion operation > 1 and operation + + 1 and M1, together with the interiteration data dependeny 1 from +1 to > 1, prevent the parallel omputation of multiple loop iteration, even when loop unrolling i employed. A hedule for the CDFG of Figure 1 that inorporate peulative exeution i hown in the STG of Figure 2(b). Thi hedule wa derived by tehnique we preent in later etion. Speulatively exeuted operation are annotated with the onditional operation whoe reult they depend upon, uing the following notation. op/ond repreent an operation op that i exeuted auming that the peulation ondition ond will evaluate to true. The peulation ondition ond ould, in general, be an expreion that i a onjuntion of the reult of variou onditional operation in the STG. For example, onider operation + + 1 1/ 1 in tate S1 of Figure 2(b). Thi i a peulatively exeuted operation, that orrepond to the eond intane of CDFG operation + + 1 in the hedule, and aume that the reult of onditional operation > 1 1, whih i exeuted only in tate S7, i going to be true. State S7 and S8 repreent the teady tate of the hedule. Note that, when in the teady tate, a new iteration i initiated every yle, a oppoed to one in eight yle. The following example illutrate the impat of branh probabilitie and reoure ontraint on the performane of peulatively derived hedule and make a ae for the integration of peulation into the heduling proe. Example 2: Conider the example CDFG hown in Figure 3. The elet operation Sel1 elet the data operand at it l (r) port if the value at it port i 1 (0). Figure 4 how three different hedule that ue peulative exeution, that were generated uing different reoure ontraint and branh probabilitie. The STG of Figure 4(a) wa generated auming the following information. Available reoure onit of one inrementer (++), one adder(+), 1 An intra-iteration data or ontrol dependeny i between operation that orrepond to the ame iteration of a loop, while an inter-iteration dependeny i between operation in different (e.g., oneutive) iteration. We refer to intra-iteration data and ontrol dependenie imply a data and ontrol dependenie.

a 4.0 ++1 >1 21 b () +1 () >>1 3 +2 d Expeted Number of Cyle 3.5 3.0 2.5 2.0 1.5 1.0 CC a CC b CC l r Sel1 Figure 3: CDFG demontrating the effet of reoure ontraint and branh probabilitie on peulative exeution ++1, +2 / >>1 S2 S3 >1, /, +1 / S1 ++1, +1 / * 1 S2 S3 >>1 /, >1, +2 / S1 out e ++1, +1 /, +2 / >>1 /, >1, / S1 S2 1 * S3 (a) (b) () Figure 4: Three peulative hedule derived uing different reoure ontraint or branh probabilitie 0.5 0.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability (P) Figure 5: Comparion of the peulative hedule The value of CC a, CC b, and CC for variou value of P ranging from 0 to 1 are plotted in Figure 5. A expeted, the hedule of Figure 4(a) outperform the hedule of Figure 4(b) when P() < 0.5, and the hedule of Figure 4(b) perform better when P() > 0.5. Moreover, the hedule of Figure 4(), whih wa derived uing one extra adder, outperform the other two hedule for all value of P(). Thu, we an onlude that branh probabilitie and reoure ontraint do influene the trade-off involved in deiding whih onditional path to peulate upon, making the ae for the integration of peulative exeution into the heduling tep where uh information i available. The following example illutrate that it i neeary to perform peulative exeution along multiple path, in a fine-grained manner, in order to obtain maximal performane improvement. Example 3: The hedule hown in Figure 4 were all generated ++1, +1 / one omparator (>), one hifter (>>), and one multiplier ( ), all of whih require one yle. Alo, the probability of omparion > 1 evaluating to f ale i higher than it evaluating to true. Sine the reult of > 1 evaluate to f ale more often, the hedule of Figure 4(a) give preferene to exeuting operation from the orreponding ontrol path (e.g., +2). A a reult, +2 i heduled to be performed on the ole adder in tate, a oppoed to +1, even though the data operand for both operation are available. The average number of lok yle, CC a, required for the STG in Figure 4(a) an be alulated a follow. >>1 /, >1 S2 S1 +2 S3 CC a = 4.P()+2.(1? P()) = 2.P()+2 (1) S5 In the above equation, P() repreent the probability that the reult of omparion > 1 evaluate to true. The STG of Figure 4(b) wa derived with the ame information a above, exept that it wa aumed that omparion > 1 evaluate to true more often than it evaluate to f ale. Hene, operation +1 i given preferene over operation +2 and i heduled in. The average number of lok yle, CC b, required for the STG in Figure 4(b) i given by the following expreion CC b = 3.P()+3.(1? P()) = 3 Suppoe the reoure ontraint were relaxed to allow two adder. The peulative hedule that reult i hown in Figure 4(). The average number of lok yle, CC, required for the STG in Figure 4() i given by the following expreion. CC = 3.P()+2.(1? P()) = P()+2 (2) Figure 6: Speulation along a ingle path by peulatively exeuting operation from both the onditional path of the CDFG in a fine-grained manner, a allowed by the reoure ontraint. For the purpoe of omparion, we heduled the CDFG hown in Figure 3, auming the ame heduling information that wa aumed to derive the hedule of Figure 4(b). However, in thi ae, we retrited the heduler to allow peulative exeution along only one path. The reulting hedule i hown in Figure 6. The average number of lok yle, CC d, required for the STG in Figure 6 i given by the following expreion. CC d = 3.P()+4.(1? P()) = 4? P() (3) Comparing the expreion for CC d to the expreion for CC b from the previou example indiate that CC d CC b for all feaible value of P(). Thu, in thi example, imultaneouly peulating

along multiple path aording to reoure availability reult in a hedule that i provably better than one derived by peulating along only the mot probable path. Our heduling algorithm automatially deide the bet path to peulate upon for the given reoure ontraint and branh probabilitie. 3 The Algorithm In thi etion, we preent the hange that need to be made to a generi heduling algorithm to upport peulative exeution. 3.1 A generi heduling algorithm Figure 7 how the peudoode for a generi heduling algorithm. The input to the heduler are a CDFG, G, to be hed- Generi heduler (CDFG G, ALLOCATION CONSTRAINT K, MODULE SELECTION INFO M inf, CLOCK PERIOD lk) f SET<OPERATION> Unheduled operation; SET<OPERATION> Shedulable operation; 1 while (junheduled operationj > 0) f 2 op = Selet hedulable operation (Shedulable operation, K, M inf, lk); //Selet an operation for heduling. The eleted // operation mut honor alloation and lok yle ontraint 3 Shedule(op); 4 Unheduled operation.remove operation(op); 5 Shedulable operation.remove operation(op); 6 SET<OPERATION> hedulable ueor = Computehedulable ueor(op);//find the et of operation // in op fanout whih beome hedulable when op i heduled 7 Shedulable operation.append(hedulable ueor); //Augment Shedulable operation by addition of //operation in hedulable ueor gg Figure 7: Peudoode for a generi heduling algorithm uled, the target lok period of the deign, alloation ontraint, whih peify the number and type of funtional unit available, and module eletion information, whih give the type of funtional unit an operation i mapped to. The output of the heduler i an STG whih deribe the hedule. At any point, a generi heduler maintain (a) the et of unheduled operation whoe data and ontrol dependenie have been atified, and an therefore be heduled (Shedulable operation), and (b) the et of operation whih are unheduled (Unheduled operation). The heduling proe proeed a follow: an operation from Shedulable operation i eleted for heduling in a given tate (tatement 2). The eletion hould honor alloation and lok yle ontraint. The manner in whih the eletion i done varie from one heduling algorithm to another. The eleted operation, op, i heduled in the tate. Sine op no longer belong to either Shedulable operation or Unheduled operation, it i removed from thee et (tatement 4 and 5). Alo, the heduling of op might render ome of the operation in it fanout hedulable. The routine Compute hedulable ueor (tatement 6) identifie uh operation, and thee operation are ubequently inluded in the et Shedulable operation (tatement 7). 3.2 Inorporating peulative exeution into a generi heduler: An overview We now provide an overview of the hange that need to be made to inorporate peulative exeution into the framework of the generi heduler hown in Figure 7. To upport peulative exeution, the generi heduler hown in Figure 7 need to be modified a follow (the detail of thee tep are provided in Setion 3.3). 1. When an operation i heduled, one need to reognize all it hedulable ueor, inluding the one whih an be (op1) op2 op1 l r Sel1 (op1) Figure 8: A CDFG fragment illutrating peulative exeution op3 op4 peulatively heduled. In addition, peulatively exeuted operation and their ueor need to be peially marked. Clearly, proedure Compute hedulable ueor need to be augmented to onider uh ae. Note that, at any tage, every peulatively hedulable operation i added to the lit of hedulable operation. However, few of them are atually heduled. Operation whih are not worth being peulated on are ignored, and eventually removed from the lit of hedulable operation, uing proedure deribed later in thi etion. Example 4: Conider the CDFG fragment hown in Figure 8. We aume that operation op0 i heduled, operation op2 ha jut been heduled, and operation op1, op3, Sel1, and op4 are unheduled. The output of the routine Compute hedulable ueor(op2) mut inlude operation op4, whih an now be peulatively exeuted, i.e., it operand an be aumed to be the reult of operation op2 and op0. 2. When operation are heduled, ontrol and data dependenie of peulatively exeuted operation are reolved. Thi would potentially validate or invalidate peulatively performed operation. Operation whih are validated hould be onidered normal, i.e., they need not be peially marked any longer. Operation in Unheduled operation and Shedulable operation whih are invalidated need no longer be onidered for heduling. They an, therefore, be removed from thee et. In general, the reolution of the ontrol or data dependenie of a peulatively performed operation reate two eparate thread of exeution, whih orrepond to the ue and failure of the peulation. Example 5: Conider again, the CDFG fragment hown in Figure 8. Suppoe operation op0, op2 and op4 have been heduled, and operation op3 i unheduled. Operation op4 ue a it operand, the reult of operation op2 and op0. Aume that operation op1 ha jut been heduled. If op1 evaluate to true, then the exeution of op4 an be onidered fruitful, beaue the operand hoen for it omputation are orret. Therefore, op4, and it heduled and hedulable ueor need not be onidered onditional on the reult of op1 anymore, and the data truture an be modified to reflet thi fat. If, however, op1 evaluate to fale, then op4 hould ue a it operand, the reult of operation op3 and op0, thu invalidating the reult of our peulation. Therefore, hedulable operation, whoe omputation are influened by the reult omputed by op4 are invalid, and an be removed from the et Shedulable operation. 3. The et, Shedulable operation, from whih an operation i eleted for heduling, ontain operation whoe exeution i peulative, i.e., whoe reult are not alway ueful. The eletion proedure, repreented by the routine Selet hedulable operation() (tatement 2), need to be modified to aount for thi fat. For example, operation, whoe exeution i extremely improbable, would make poor eletion andidate, a the reoure onumed by them might be op0

better utilized by operation whoe exeution i more probable. Alo, operation, whih fall on ritial path, would be better andidate for eletion than thoe on off-ritial path. 3.3 Inorporating peulative exeution into a generi heduler: A loer look In thi etion, we fill in the detail of the hange outlined in Setion 3.2. Thi i preeded by a formal treatment of onept related to peulative exeution. A heduler whih upport peulative exeution work with onditioned operation a it atomi hedulable unit, jut a a normal heduler ue operation. Therefore, the fanin-fanout relationhip between operation, aptured by the CDFG, need to be defined for onditioned operation. Sine all peulatively performed operation are onditioned on ome event, the adjetive peulatively performed when applied to an operation, implie that it i onditioned on ome event or ombination of event. A mentioned in Setion 3.2, when an operation i heduled, it hedulable ueor need to be omputed. (op1) op2 l op1 Sel1 (op1) r op3 op7 (op4) op5 l op4 Sel2 (op4) Figure 9: Illutrating the heduling of ueor of peulatively performed operation Example 6: Conider the CDFG fragment hown in Figure 9. Aume that operation op5 and op6 have been heduled, operation op1, op3, and op4 are unheduled, and op2 ha jut been heduled. It i now poible to hedule two verion of operation op7, with the firt verion, op7 0, uing op2 and op5 a it operand, and the eond, op7 00 uing op2 and op6. op7 0 i onditioned on (op1) (op4), and op7 00 i onditioned on (op1) (op4). The following analyi preent a trutured mean of identifying uh relationhip. We now preent a reult whih help derive fanin-fanout relationhip among peulatively performed operation. Lemma 1: Conider an operation, op, whoe fanin are op1, op2,..., opn. If the fanin of op have been peulatively heduled, o an op. In partiular, if the ith fanin, opi, i onditioned on C i, then op would be onditioned on V n i=1 C i. We now preent detail of Step 1, 2, and 3, outlined in Setion 3.2. Step 1: Thi tep addree the iue of deriving all hedulable ueor of a heduled operation, op0. The reult of Lemma 1 i ued for thi proedure. Obervation 1 Every et, S = fop0,op1,...,opig of heduled operation, whih atifie the following ondition oure a hedulable operation. Condition: There exit an operation, fanout, in the CDFG, all of whoe fanin are reahable from the output of the operation in S through path whih onit exluively (if at all) of elet operation. The path onneting the output of an operation opj in S to an input of fanout i denoted by Pj, and the operation on Pj are Selj 1, Selj 2,..., Selj aj. Note that aj an equal 0. C j repreent the ondition that path Pj i eleted, i.e., the reult of operation op j i r op6 propagated through path Pj to the appropriate input of fanout. Operation fanout i onditioned on V i k=0 (C(opk) C k ) where C(opk) repreent the expreion opk i onditioned on. Obervation 1 an be ued to infer the hedulable ueor of an operation. The proedure Compute hedulable ueor, whih i alled in tatement 6 of the peudoode hown in Figure 7, i appropriately augmented. So far, we have deribed the tehnique ued to identify all hedulable ueor of an operation. Thi wa aomplihed by tagging operation with the ondition under whih their reult would be valid. Note that our proedure allow u to peulate on all poible outome of a branh, and arbitrarily deeply into neted branhe. If integrated with a heduler whih upport loop unrolling, the peulation ould alo ro loop boundarie. We now preent the tehnique ued to validate or invalidate peulatively performed operation whoe dependenie have jut been reolved. Step 2: Suppoe operation op, whih reolve a ondition, ha jut been heduled. The reolution of reult in the reation of two different thread of exeution, where (i) = true, and (ii) = fale. The following proedure i arried out for every operation, op, whih belong either to the et, Shedulable operation, or the et of heduled operation. Let op be onditioned on C = V i j=1 j. In the true (fale) branh, C i evaluated auming a value of 1 (0) for, and the reultant expreion i the new expreion that op i onditioned on. Step 3: We now deribe the proedure employed by the heduler to elet an operation to hedule, from a pool of hedulable operation, Shedulable operation. Shedulable operation an ontain operation whih are onditioned on different et of event, i.e., we an hooe different path to peulate upon. We need to deide the bet andidate to map to a given reoure, where, by bet, we mean the operation whoe mapping on the given reoure would minimize the expeted number of yle for the hedule. Formally, the problem an be tated a follow: given (i) a partial hedule, (ii) a funtional unit, fu, (iii) a et of operation, S (ome of whih may be peulative), whih an exeute on the funtional unit, and (iv) typial input trae, elet the operation, whih, if mapped to fu, would minimize the expeted number of yle. The above problem ha been proven to be NP-omplete, even for onditional- and loop-free behavioral deription [8]. We, therefore, ue the following heuriti, whoe guiding priniple ha been uefully employed by everal heduling algorithm [9]. The heuriti i baed on the following premie: operation in the CDFG whih feed primary output through long path are more ritial than operation whih feed primary output through hort path and, therefore, need to be heduled earlier. The rationale behind thi heuriti i that operation whih belong to hort path are more mobile than thoe on long path, i.e., the total hedule length i le enitive to variation in their hedule. The length of a path i meaured a the um of the delay of it ontituent operation. In data-dominated deription, with no loop and onditional operation, the longet path between any pair of operation i fixed. In ontrol-flow intenive deription, ome path ould be inputdependent. Therefore, the longet path between a pair of operation mut be defined with repet to a given input. For example, for the CDFG hown in Figure 3, the longet path onneting primary input with output out depend upon the value taken by operation > 1. Sine our heduling algorithm i geared toward minimizing the average exeution time, we ue the expeted length of the longet path from an operation to a primary output a a metri to rank different operation. We ue the notation λ(op) to denote thi quantity for operation op. Speulation add a new dimenion to thi problem: the reult omputed by an operation i not guaranteed to be ueful. For an

Table 1: Expeted number of yle, number of tate, betand wort-ae number of yle reult Ciruit E.N.C. #tate b w WS SP WS SP WS SP WS SP Barode 450 227 8 3 4 1 502 251 GCD 95 49 6 6 3 3 515 260 Tet1 580 80 10 10 1 1 4050 265 TLC 507 507 4 4 507 507 507 507 Findmin 522 265 4 3 2 2 472 237 Table 2: Alloation ontraint for the example in Table 2 Ciruit add1 ub1 mult1 omp1 eq in Barode 2 - - 3 3 3 GCD - 2-1 2 - Tet1 2-4 1-1 TLC - - - 1 1 1 Findmin - - - 2 2 1 operation, op, we aount for thi effet by multiplying the probability that an operation output i utilized with λ(op) to derive a metri of an operation ritiality. Thi i expreed by mean of the following equation: i ritiality(op) = λ(op) P( j ) (4) j=1 where ritiality(op) meaure the deirability of heduling op, i j=1 P( j) i the produt of the probabilitie of the event that op i onditioned on, and λ(op) i a defined above. 4 Experimental Reult The tehnique deribed in thi paper were implemented in a program alled Wavehed-pe, written in C++. We evaluated thi program by uing it to produe hedule for everal ommonly available benhmark. Thee hedule were ompared againt thoe produed by the heduling algorithm, Wavehed [6], without the ue of peulative exeution, with repet to the following metri: (a) expeted number of yle, (b) number of tate in the STG produed, () the mallet number of yle taken to exeute the behavioral deription, and (d) the larget number of yle taken to exeute the behavioral deription. In general, finding the larget number of yle taken to exeute a behavioral deription i a hard problem. However, for the example onidered in thi paper, tati analyi of the deription wa uffiient to find the number. Table 1 ummarize the reult obtained. The olumn labeled E.N.C., #tate, b, and w repreent, repetively, the expeted number of yle, the number of tate in the STG produed, mallet number of yle taken to exeute the STG, and the larget number of yle taken to exeute the STG. Minor olumn WS and SP repreent hedule produed by Wavehed and Wavehedpe, repetively. We ued a library of funtional unit whih onited of (a) an adder, add1, (b) a ubtrater, ub1, () a multiplier, mult1, (d) a le-than omparator, omp1, (e) an equality omparator, eq, and (f) an inrementer, in. Unlimited number of ingle-input logi gate (OR, AND, and NOT) were aumed to be available. All funtional unit exept mult1, whih exeute in two yle, take one yle to exeute. The alloation ontraint for an example an be found by looking up the entry orreponding to the example in Table 2. For example, the alloation ontraint for GCD are two ub1, one omp1, and two eq. The expeted number of yle for the final deign wa meaured by imulating a VHDL deription of the hedule uing the Synopy VSS imulator. The input trae ued for imulation were obtained a zero-mean Gauian equene. Of our example, Barode, GCD, TLC, and Findmin are borrowed from the literature. Tet1 i the example hown in Figure 1. Barode repreent a barode reader, GCD ompute the greatet ommon divior of it input, TLC repreent a traffi light ontroller, and Findmin return the index of the minimum element in an array. The reult obtained indiate that Wavehed-pe produed an average expeted hedule length peedup of 2.8 over hedule obtained uing Wavehed. Note that Wavehed [6] wa reported to have ahieved an average peedup of 2 over hedule produed by exiting heduling algorithm, uh a path-baed heduling [10], and loop-direted heduling [7]. To get an idea of the area overhead of thi tehnique, we obtained a 16-bit RTL implementation for the GCD example uing an in-houe high-level ynthei ytem, for the hedule produed by Wavehed-pe and Wavehed. Thee RTL iruit were tehnology-mapped uing the MSU library, and the area of the gate-level iruit were obtained. The area overhead for the iruit produed from Wavehed-pe wa found to be only 3.1%. We alo note that for Wavehed-pe, the number of yle in the hortet and longet path i maller than or equal to the orreponding number for Wavehed. 5 Conluion In thi paper, we preented a tehnique for inorporating peulative exeution into heduling of ontrol-flow intenive deign. We demontrated that in order to fully exploit the power of peulative exeution, one need to integrate it with heduling. We introdued a node-tagging heme for the identifiation of operation whih an be peulatively heduled in a given tate, and a heuriti to elet the bet operation to hedule. Our tehnique were fully integrated into an exiting heduling algorithm whih an upport impliit unrolling of loop, funtional pipelining of ontrolflow intenive behavior, and an parallelize the exeution of independent loop whoe bodie hare reoure. Experimental reult demontrate that the preented tehnique an improve the performane of the generated hedule ignifiantly. Shedule produed uing peulative exeution were, on an average, 2.8 time fater than hedule produed without it benefit. Referene [1] U. Holtmann and R. Ernt, Experiment with low-level peulative omputation baed on multiple branh predition, IEEE Tran. VLSI Sytem, vol. 1, pp. 262 267, Sept. 1993. [2] K. Wakabayahi and H. Tanaka, Global heduling independent of ontrol dependeniebaed on ondition vetor, in Pro. Deign Automation Conf., pp. 112 115, June 1991. [3] U. Holtmann and R. Ernt, Combining MBP-peulative omputation and loop pipelining in high-level ynthei, in Pro. European Deign & Tet Conf., pp. 550 556, Mar. 1995. [4] J. A. Fiher, Trae heduling: A tehnique for global miroode ompation, IEEE Tran. Computer, vol. 30, pp. 478 490, July 1981. [5] S. A. Mahlke et al., Sentinel heduling: A model for ompilerontrolled peulative exeution, IEEE Tran. Computer, vol. 11, pp. 376 408, Nov. 1993. [6] G. Lakhminarayana, K. S. Khouri, and N. K. Jha, Wavehed: A novel heduling tehnique for ontrol-flow intenive behavioral deription, in Pro. Int. Conf. Computer-Aided Deign, pp. 244 250, Nov. 1997. [7] S. Bhattaharya, S. Dey, and F. Brglez, Performane analyi and optimization of hedule for onditional and loop-intenive peifiation, in Pro. Deign Automation Conf., pp. 491 496, June 1994. [8] M. Garey and D. Johnon, Computer and Intratibility. W.H. Freeman & Company, New York, 1979. [9] R. Jain, A. Majumdar, A. Sharma, and H. Wang, Empirial evaluation of ome high-level ynthei heduling heuriti, in Pro. Deign Automation Conf., pp. 686 689, June 1991. [10] R. Campoano, Path-baed heduling for ynthei, IEEE Tran. Computer-Aided Deign, vol. 10, pp. 85 93, Jan. 1991.