High Level Synthesis. Shankar Balachandran Assistant Professor, Dept. of CSE IIT Madras

Size: px

Start display at page:

Download "High Level Synthesis. Shankar Balachandran Assistant Professor, Dept. of CSE IIT Madras"

Benjamin Hill
5 years ago
Views:

1 High Level Synthesis Shankar Balachandran Assistant Professor, Dept. of CSE IIT Madras

2 References and Copyright Tetbooks referred (none required) [Mic94] G. De Micheli Synthesis and Optimization of Digital Circuits McGrawHill, 994. Slides used: Giovanni de Micheli s Slides on Synthesis Kia Bazargan s Material on High Level Synthesis [ Gupta] Rajesh Gupta UCIrvine 2

3 High Level Synthesis (HLS) The process of converting a highlevel description of a design to a netlist Input: o Highlevel languages (e.g., C) o Behavioral hardware description languages (e.g., VHDL) o Structural HDLs (e.g., VHDL) o State diagrams / logic networks Tools: o Parser o Library of modules Constraints: o Area constraints (e.g., # modules of a certain type) o Delay constraints (e.g., set of operations should finish in λ clock cycles) Output: o Operation scheduling (time) and binding (resource) o Control generation and detailed interconnections

4 Architecturallevel synthesis motivation Raise input abstraction level Reduce specification of details Etend designer base Selfdocumenting design specifications Ease modifications and etensions Reduce design time Eplore and optimize macroscopic structure: Series/parallel eecution of operations 4

5 Synthesis Transform behavioral into structural view Architecturallevel synthesis: Architectural abstraction level Determine macroscopic structure Eample: major building blocks Logiclevel synthesis: Logic abstraction level Determine microscopic structure Eample: logic gate interconnection 5

6 HighLevel Synthesis Compilation Flow Le Parse Compilation frontend = a ( b c ) d Behavioral Optimization Arch synth Logic synth Lib Binding Intermediate form HLS backend a b c d a d b c 6

7 Eample diffeq { read (; y; u; d; a); repeat l = d; } ul = u (.. u. d) (. y. d) yl = y u. d ; c = l < a; X = l; u = ul; y = yl; until ( c ) write (y); 7

8 Compilation and behavioral optimization Software compilation: Compile program into intermediate form Optimize intermediate form Generate target code for an architecture Hardware compilation: Compile programs/hdl into sequencing graph Optimize sequencing graph Generate gatelevel interconnection for a cell library 8

9 Behaviorallevel optimization Semanticpreserving transformations aiming at simplifying the model Applied to parsetrees or during their generation Taonomy: Dataflow based transformations Controlflow based transformations 9

10 Architectural synthesis and optimization Synthesize macroscopic structure in terms of buildingblocks Eplore area/performance tradeoff: maimize performance of implementations subject to area constraints minimize area implementations subject to performance constraints Determine an optimal implementation Create logic model for datapath and control 0

11 Design space: Design space and objectives Set of all feasible implementations Implementation parameters: Area Performance: o Cycletime o Latency o Throughput (for pipelined implementations) Power consumption

12 Design evaluation space Area Area Area Area Ma Cycletime Latency Latency Latency Ma Latency 2

13 Dependency Graph (Sequencing Graph) diffeq { read (; y; u; d; a); repeat l = d; } ul = u (.. u. d) (. y. d) yl = y u. d ; c = l < a; X = l; u = ul; y = yl; until ( c ) write (y); u y a d < ul vl c l

14 Hardware modeling Circuit behavior: Sequencing graphs Building blocks: Resources Constraints: Timing and resource usage 4

15 Resources Functional resources: Perform operations on data Eample: arithmetic and logic blocks Storage resources: Store data Eample: memory and registers Interface resources: Eample: busses and ports 5

16 Resources and circuit families Resourcedominated circuits. Area and performance depend on few, wellcharacterized blocks The most common in DSP Circuits Non resourcedominated circuits Area and performance are strongly influenced by sparse logic, control and wiring Eample: some ASIC circuits 6

17 Implementation constraints Timing constraints: Cycletime Latency of a set of operations Time spacing between operation pairs Resource constraints: Resource usage (or allocation) Partial binding 7

18 Sequence Graph Remove all nodes corresponding to constants (with respect to the loop) Remove edges from those constants as well Remove nodes that are being written in the loop and the corresponding edges Add NOP nodes at both ends to represent reads and writes of the loop Such a graph is much simpler to operate on u < y NOP a d NOP ul vl c l 8

19 Synthesis in the temporal domain Scheduling: Associate a starttime with each operation Determine latency and parallelism of the implementation The schedule is called φ Scheduled sequencing graph: Sequencing graph with starttime annotation 9

20 Synthesis in Temporal Domain Schedule: Mapping of operations to time slots (cycles) A scheduled sequencing graph is a labeled graph NOP NOP 2 < NOP Schedule Schedule 2 < NOP 20

21 Operation Types For each operation, define its type. For each resource, define a resource type, and a delay (in terms of # cycles) T is a relation that maps an operation to a resource type that can implement it T : V {, 2,..., n res }. More general case: A resource type may implement more than one operation type (e.g., ALU) Resource binding: Map each operation to a resource with the same type Might have multiple options 2

22 Binding: Synthesis in the spatial domain Associate a resource with each operation with the same type Determine the area of the implementation Sharing: Bind a resource to more than one operation Operations must not eecute concurrently Bound sequencing graph: Sequencing graph with resource annotation 22

23 Resource sharing Schedule in Spatial Domain More than one operation bound to same resource Operations have to be serialized Can be represented using hyperedges (define verte partition) NOP 2 4 < NOP 2

24 Binding Should Change with Schedules NOP NOP 2 4 < NOP 2 4 NOP < 4 Multipliers 2 Adders 2 Multipliers 2 Adders 24

25 Scheduling and Binding Resource constraints: Number of resource instances of each type {a k : k=, 2,..., n res }. Scheduling: Labeled vertices φ (v )=. Binding: Hyperedges (or verte partitions) β (v 2 )=adder. Cost: Resource dominated Number of resources area Registers, steering logic (Mues, busses), wiring, control unit Delay: Start time of the sink node Might be affected by steering logic and schedule (control logic) resourcedominated vs. ctrldominated Control dominated 25

26 Architectural Optimization Optimization in view of design space fleibility A multicriteria optimization problem: Determine schedule φ and binding β. Under area Α, latency λ and cycle time τ objectives Find nondominated points in solution space Solution space tradeoff curves: Nonlinear, discontinuous Area / latency / cycle time (more?) 26

27 Area/latency tradeoff Area (2,2) (2,) (,2) (,) X (,2) (,) (2,) 40 Cycletime Latency 27

28 Scheduling and Binding Cost λ and Α determined by both φ and β. Also affected by floorplan and detailed routing β affected by φ: Resources cannot be shared among concurrent ops φ affected by β: Resources cannot be shared among concurrent ops When register and steering logic delays added to eecution delays, might violate cycle time. Order? Apply either one (scheduling, binding) first 28

29 How Is the Datapath Implemented? In the following schedule and binding, every operation has two inputs. If an input is not shown eplicitly, it comes from a unique register <

30 Operation Scheduling Input: Sequencing graph G(V, E), with n vertices Cycle time τ. Operation delays D = {d i : i=0..n}. Output: Schedule φ determines start time t i of operation v i. Latency λ = t n t 0. Goal: determine area / latency tradeoff Classes: Nonhierarchical and unconstrained Latency constrained Resource constrained Hierarchical 0

31 Min Latency Unconstrained Scheduling Simplest case: no constraints, find min latency Given set of vertices V, delays D and a partial order > on operations E, find an integer labeling of operations φ: V Z Such that: t i = φ(v i ). t i t j d j (v j, v i ) E. λ = t n t 0 is minimum. Solvable in polynomial time Bounds on latency for resource constrained problems ASAP algorithm used: topological order

32 ASAP Schedules Schedule v 0 at t 0 =0. While (v n not scheduled) o Select v i with all scheduled predecessors o Schedule v i at t i = ma {t j d j }, v j being a predecessor of v i. Return t n. NOP 2 4 < NOP 2

33 ALAP Schedules Schedule v n at t n =λ. While (v 0 not scheduled) o Select v i with all scheduled successors o Schedule v i at t i = min {t j d j }, v j being a succecessor of v i. NOP 2 4 < NOP

34 Remarks ALAP solves a latencyconstrained problem Latency bound can be set to latency computed by ASAP algorithm Mobility: Defined for each operation Difference between ALAP and ASAP schedule Slack on the start time 4

35 Eample Operations with zero mobility: { v, v 2, v, v 4, v 5 } Critical path Operations with mobility one: { v 6, v 7 } Operations with mobility two: { v 8, v 9, v 0, v } NOP0 NOP0 2 * * * 4 6 * * * 5 9 < TIME TIME 2 TIME TIME * * * * 7 9 * * < 4 5 NOP n NOP n 5

36 Scheduling under Relative Timing constraints Motivation: Deadlines and Release times are absolute Also makes sense to have relative constraints o Eg: Memory fetch must be done within 6 cycles and takes a minimum of 2 cycles Constraints: Upper/lower bounds on starttime difference of any operation pair A minimum timing constraint l ij 0 for specified i,j pairs A maimum timing constraint u ij 0 for specified i,j pairs Feasibility of a solution 6

37 Constraint graph model Start from sequencing graph Model delays as weights on edges Add forward edges for minimum constraints: Edge ( v i, v j ) with weight l ij t j t i l ij Add backward edges for maimum constraints: That is, for constraint from v i to v j add backward edge ( v j, v i ) with weight: u ij o because t j t i u ij t i t j u ij 7

38 Eample NOP 0 NOP MAX TIME * * 2 4 MIN TIME 4 2 * * NOP n NOP n. Ensure that there are no positive cycles in the graph 2. Find longest paths in the graph between 2 nodes i and j and use as delay separations Verte v 0 v v 2 v v 4 Start time 5 v n 6 8

39 Resource Constraint Scheduling Constrained scheduling General case NPcomplete Minimize latency given constraints on area or the resources (MLRCS) Minimize resources subject to bound on latency (MR LCS) Eact solution methods ILP: Integer Linear Programming Hu s heuristic algorithm for identical processors Heuristics List scheduling Forcedirected scheduling 9

40 Use binary decision variables i = 0,,..., n l =, 2,..., λ λ given upperbound on latency il = if operation i starts at step l, 0 otherwise. Set of linear inequalities (constraints), and an objective function (min latency) Observations il ( t S i = ILP Formulation of MLRCS = 0 ASAP( v t = l. for i l ), < t t L i ALAP( v t i = start time of op i. i l l il S i = and? is op v i (still) eecuting at step l? m= l d i im = l i > )) t L i [Mic94] p.98 40

41 Constraints Operations start only once Σ il = i =, 2,, n Sequencing relations must be satisfied t i t j d j t i t j d j 0 for all (v j, v i ) є E Resource bounds must be satisfied Simple case (unit delay) Σ l il a k k =,2, n res ; for all l i:t(v i )=k Equation 2 can be rewritten as Σ l il Σ l jl d j 0 for all (v j, v i ) є E 4

42 Start Time vs. Eecution Time For each operation v i, only one start time If d i =, then the following questions are the same: Does operation v i start at step l? Is operation v i running at step l? But if d i >, then the two questions should be formulated as: Does operation v i start at step l? o Does il = hold? Is operation v i running at step l? o Does the following hold? l m= l d i im? = 42

43 Operation v i Still Running at Step l? Is v 9 running at step 6? Is 9,6 9,5 9,4 =? v 9 6 v 9 6 v 9 6 9,6 = 9,5 = 9,4 = Note: Only one (if any) of the above three cases can happen To meet resource constraints, we have to ask the same question for ALL steps, and ALL operations of that type 4

44 Operation v i Still Running at Step l? Is v i running at step l? Is i,l i,l... i,ldi =? ld i ld i ld i... l... l l v i l v i l v i l i,l = i,l = i,ldi = 44

45 ILP Formulation of MLRCS (cont.) Constraints: Unique start times: Sequencing (dependency) relations must be satisfied t i t j d Resource constraints Objective: min c T t. j ( v im i: T ( v ) = k m= l d i l i j, v i a ) k, il =, i = 0,,..., n t =start times vector, c =cost weight (e.g., [ ]) When c =[ ], c T t = l l E k. l l. il l. jl l l =,..., n nl res, l d =,..., λ j 45

46 ILP Eample Assume λ = 4 First, perform ASAP and ALAP (we can write the ILP without ASAP and ALAP, but using ASAP and ALAP will simplify the inequalities) NOP NOP v v2 v6 v8 v0 v v2 2 v v7 v9 < v4 v 2 v v4 v6 v7 v8 v0 4 v5 4 v5 v9 < v NOPvn NOP vn 46

47 ILP Eample: Unique Start Times Constraint Without using ASAP and ALAP values:, 2,,, 2 2, 2, 2, 2,,, 4 2, 4 =, 4 = = Using ASAP and ALAP:..., 2,, 2 4, 5, 4 6, 7, 2 8, 9, 2 = = = = = 6, 2 7, 8, 2 9, * * * * = = 7 9 * * < 4 5 8, 9, 4 = = NO P NO P 0 n 47 0

48 ILP Eample: Dependency Constraints Using ASAP and ALAP, the nontrivial inequalities are: (assuming unit delay for and *) 2. 2., 2 9, 2.., 9, 7 9 * * < NO P * * * * n 4., 4 n, 5, 5 7, 2 9, , 5, 4 9, 2, 2 8, 7, , 8, 2, 2 7, 2 9,, , 2 0 7, 9, 4 8,,, NO P n 48

49 ILP Eample: Resource Constraints Resource constraints (assuming 2 adders and 2 multipliers), 2, 6, 8, 2 Objective:, 2 4, Since λ=4 and sink has no mobility, any feasible solution is optimum, but we can use the following anyway: 9, 2 6, 2 9, 5, 4 0, 2 0 7, 2 7,, 9, 4 Min 8, 2 0 8,,, 2,, 4 n, 2. n,2. n, 4. n,

50 Eample NOP 0 TIME 2 * * 0 TIME 2 * * 6 < TIME 4 * 7 * 8 TIME NOP n Note that the schedule is different from both ALAP and ASAP schedules. 50

51 ILP Formulation of MRLCS Dual problem to MLRCS Objective: Goal is to optimize total resource usage, a. Objective function is c T a, where entries in c are respective area costs of resources Constraints: Same as MLRCS constraints, plus: Latency constraint added: l l. λ nl Note: unknown a k appears in constraints. Resource usage is unknown in the constraints Resource usage is the objective to minimize 5

52 ILP Solution Use standard ILP packages Transform into LP problem Advantages: Eact method Others constraints can be incorporated Disadvantages: Works well only up to few thousand variables 52

53 Hu s Algorithm Simple case of the scheduling problem Operations of unit delay Operations (and resources) of the same type Hu s algorithm Greedy Polynomial AND optimal Computes lower bound on number of resources for a given latency OR: computes lower bound on latency subject to resource constraints Basic idea: Label operations based on their distances from the sink Try to schedule nodes with higher labels first (i.e., most critical operations have priority) 5

54 Assumptions: Hu s algorithm Graph is a forest All operations have unit delay All operations have the same type Algorithm: Greedy strategy Eact solution 54

55 Eample Assumptions: One resource type only All operations have unit delay Labels: Distance to sink 5 0 n 55

56 Hu s Algorithm HU (G(V,E), a) { Label the vertices // label = length of longest path passing through the verte l = repeat { U = unscheduled vertices in V whose predecessors have been scheduled (or have no predecessors) } Select S U such that S a and labels in S are maimal Schedule the S operations at step l by setting t i =l, i: v i S. l = l } until v n is scheduled. 56

57 Eample _ a = Step : Op,2,6 Step 2: Op,7,8 Step : Op 4,9,0 Step 4: Op 5, 5 0 n 57

58 Hu s Algorithm: Eample (a=) 58

59 List Scheduling Greedy algorithm for MLRCS and MRLCS Does NOT guarantee optimum solution Similar to Hu s algorithm Operation selection decided by criticality O(n) time compleity More general input Resource constraints on different resource types 59

60 List Scheduling Algorithm: MLRCS LIST_L (G(V,E), a) { l = repeat { for each resource type k { U l,k = available vertices in V. T l,k = operations in progress. } Select S k U l,k such that S k T l,k a k Schedule the S k operations at step l } l = l } until v n is scheduled. 60

61 Eample NOP * * 2 * 6 * NOP 0 * * 7 9 < TIME 2 * * * NOP n TIME 2 TIME * * 7 8 * < TIME 4 Resource bounds: TIME 5 4 multipliers with delay 2 TIME 6 5 ALU with delay TIME 7 9 n NOP 6

62 List Scheduling Algorithm: MRLCS LIST_R (G(V,E), λ ) { a =, l = Compute the ALAP times t L. if t 0 L < 0 return (not feasible) repeat { for each resource type k { U l,k = available vertices in V. Compute the slacks { s i = t i L l, vi U l,k }. Schedule operations with zero slack, update a } Schedule additional S k U l,k under a constraints } l = l } until v n is scheduled. 62

63 Eample NOP * * * * 7 9 * * < Step Two multiplications on CP Set a = 2 Schedule Mult,2 Schedule ALU 0 Step 2 Schedule Mult, 6 Schedule ALU Step Schedule Mult 7,8 Schedule ALU 4 Step 4 Set a 2 =2 Schedule ALU 5, 9 NOP0 NOP n TIME 2 * * 0 Assumptions Unitdelay resources Maimum latency = 4 Start with : a = multiplier a 2 = ALUs TIME 2 TIME TIME 4 * 4 5 * * 6 7 * 8 9 n < NOP 6

64 Summary Scheduling algorithms are used by tools Compilers use them when you write code for DSP processors Tools like Xilin ISE, Synopsys DC etc. use them when you compiler HDL models Good understanding of under the hood operations of tools is useful The constraint solving techniques can be used directly for your custom designs Eg: In DSP software, if you know the resources, you write assembly code to minimize latency 64

65 ForceDirected Scheduling Similar to list scheduling Can handle MLRCS and MRLCS For MLRCS, schedules stepbystep BUT, selection of the operations tries to find the globally best set of operations Idea: Find the mobility μ i = t i L ti S of operations Look at the operation type probability distributions Try to flatten the operation type distributions Definition: operation probability density p i ( l ) = Pr { v i starts at step l }. Assume uniform distribution: p i ( l) = μ i for l [ t S i, t L i ] 65

66 ForceDirected Scheduling: Definitions Operationtype distribution (NOT normalized to ) q k ( l) = i: T ( v i ) = k Operation probabilities over control steps: p i ( l) p i = { p (0), p (), Κ, p ( n)} i i i Distribution graph of type k over all steps: { q (0), q (), Κ, q ( n)} k k k q k ( l ) can be thought of as epected operator cost for implementing operations of type k at step l. 66

67 67 Eample NOP < NOP (4) () (2) () = = = = = = = mult mult mult mult q q q q (4) 2 () (2) 0. () = = = = = = = = add add add add q q q q

68 ForceDirected Scheduling Algorithm: Idea Very similar to LIST_L(G(V,E), a) Compute mobility of operations using ASAP and ALAP Computer operation probabilities and type distributions Select and schedule operations Update operation probabilities and type distributions Go to net control step Difference with list sched in selecting operations Select operations with least force Consider the effect on the type distribution Consider the effect on successor nodes and their type distributions 68

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers are not