SPARK: A Parallelizing High-Level Synthesis Framework Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark Supported by Semiconductor Research Corporation & Intel Inc
System Level Synthesis System Level Model Task Analysis HW/SW Partitioning Hardware Behavioral Description High Level Synthesis ASIC I/O FPGA Memory Software Behavioral Description Software Compiler Processor Core 2
High Level Synthesis Transform behavioral descriptions to RTL/gate level x = a + b; c = a < b; if (c) then d = e f; else g = h + i; j = d x g; l = e + x; From C to CDFG to Architecture d = e - f x = a + b c = a < b T c j = d x g l = e + x If Node F g = h + i Control M e m o r y ALU Data path 3
High Level Synthesis Transform behavioral descriptions to RTL/gate level x = a + b; c = a < b; if (c) then d = e f; else g = h + i; j = d x g; l = e + x; From C to CDFG to Architecture T d = e - f x = a + b c = a < b c If Node F g = h + i Control M e m o r y Problem # 1 : Poor quality of HLS results beyond straight-line j = d x g behavioral descriptions Problem # 2 : Poor/No l = e + x controllability of the HLS results ALU Data path 4
Outline Motivation and Background Our Approach to Parallelizing High-Level Synthesis Code Transformations Techniques for PHLS Parallelizing Transformations Dynamic Transformations The PHLS Framework and Experimental Results Multimedia and Image Processing Applications Case Study: Intel Instruction Length Decoder Conclusions and Future Work 5
High-level Synthesis Well-researched area: from early 1980 s Renewed interest due to new system level design methodologies Large number of synthesis optimizations have been proposed Either operation level: : algebraic transformations on DSP codes or logic level: : Don t Care based control optimizations In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler Transformations Different optimization objectives and cost models than HLS Our aim: : Develop Synthesis and Parallelizing Compiler Transformations that are useful for HLS Beyond scheduling results: in Circuit Area and Delay For large designs with complex control flow (nested conditionals/loops) 6
Our Approach: Parallelizing HLS (PHLS) C Input Original CDFG Scheduling & Binding Optimized CDFG VHDL Output Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Source-level code refinement using Pre-synthesis transformations Code Restructuring by Speculative Code Motions Operation replication to improve concurrency Dynamic transformations: exploit new opportunities during scheduling 7
PHLS Transformations Organized into Four Groups 1. Pre-synthesis synthesis: : Loop-invariant code motions, Loop unrolling, CSE 2. Scheduling: : Speculative Code Motions, Multi- cycling, Operation Chaining, Loop Pipelining 3. Dynamic: : Transformations applied dynamically during scheduling: Dynamic CSE, Dynamic Copy Propagation, Dynamic Branch Balancing 4. Basic Compiler Transformations: : Copy Propagation, Dead Code Elimination 8
Speculative Code Motions Operation Movement to reduce impact of Programming Style on Quality of HLS Results Across Hierarchical Blocks + a If Node Speculation T F Reverse Speculation + b _ c Conditional Speculation Early Condition Execution Evaluates conditions As soon as possible 9
Dynamic Transformations Called dynamic since they are applied during scheduling (versus a pass before/after scheduling) Dynamic Branch Balancing Increase the scope of code motions Reduce impact of programming style on HLS results Dynamic CSE and Dynamic Copy Propagation Exploit the Operation movement and duplication due to speculative code motions Create new opportunities to apply these transformations Reduce the number of operations 10
Dynamic Branch Balancing If Node BB 1 + a + b _ c T F _ e BB 0 BB 3 BB 2 _ d BB 4 Original Design Resource Allocation If Node a BB 1 b + + + _ c T F _ e Longest Path BB 0 BB 3 BB 2 _ d BB 4 Unbalanced Conditional Scheduled Design S0 S1 S2 S3 11
Insert New Scheduling Step in Shorter Branch If Node BB 1 + a + b _ c T F _ e Resource Allocation BB 0 BB 3 BB 2 _ d BB 4 If Node a BB 1 b + + + _ c T F _ e BB 0 BB 3 BB 2 _ d BB 4 S0 S1 S2 S3 Original Design Scheduled Design 12
Insert New Scheduling Step in Shorter Branch If Node BB 1 + a + b _ c T F BB 0 BB 3 Original Design Resource Allocation BB 2 _ d If Node a BB 1 b + + + _ c _ e T F BB 0 BB 3 BB 2 _ d _ e BB 4 Dynamic Branch _ Balancing BB 4 inserts new scheduling steps Enables Conditional e Speculation Leads to further code compaction Scheduled Design S0 S1 S2 S3 13
Dynamic CSE: : Going beyond Traditional CSE a = b + c; cd = b < c; if (cd) d = b + c; else e = g + h; If Node a = b + c BB 0 BB 1 T F d = a e = g + h BB 2 BB 3 BB 4 C Description a = b + c BB 0 After Traditional CSE If Node T BB 1 F d = b + c e = g + h BB 2 BB 3 BB 4 HTG Representation 14
Dynamic CSE: : Going beyond Traditional CSE a = b + c; cd = b < c; if (cd) d = b + c; else e = g + h; C Description a = b + c BB 0 We use notion of Dominance If Node d = b + c e = g + h BB 2 BB 3 BB 4 HTG Representation If Node Dominance of Basic Blocks T a = b + c BB 0 BB 1 F d = a e = g + h BB 2 BB 3 BB 4 After Traditional CSE Basic block BBi dominates BB 1BBj if all control paths from T F the initial basic block of the design graph leading to BBj goes through BBi We can eliminate an operation opj in BBj using common expression in opi if BBi dominates BBj 15
New Opportunities for Dynamic Dynamic CSE Scheduler decides to Speculate Due to Code Motions BB 1 BB 0 dcse = b + c BB 0 BB 1 a = b + c BB 2 BB 3 BB 4 BB 5 a = dcse BB 2 BB 3 BB 4 BB 5 d = b + c d = b + c BB 6 BB 7 BB 6 BB 7 BB 8 CSE not possible since BB2 does not dominate BB6 BB 8 CSE possible now since BB0 does not dominate BB6 16
New Opportunities for Dynamic Dynamic CSE Scheduler decides to Speculate Due to Code Motions BB 1 BB 0 dcse = b + c BB 0 BB 1 a = b + c BB 2 BB 3 BB 4 BB 5 a = dcse BB 2 BB 3 BB 4 BB 5 d = b + c d = dcse BB 6 BB 7 BB 6 BB 7 BB 8 BB 8 If If scheduler not possible moves since or or duplicates BB2 an an CSE operation possible op,, apply now CSE since on on remaining operations using op op CSE not possible does not dominate BB6 BB0 does not dominate BB6 17
Condition Speculation & Dynamic CSE BB 0 BB 0 BB 1 BB 2 BB Scheduler 3 decides to Conditionally BB Speculate 4 a' = b + c a' = b + c BB 1 BB 2 BB 3 BB 4 a = b + c BB 5 BB 6 BB 7 a = a' BB 5 BB 6 BB 7 d = b + c BB 8 d = b + c BB 8 18
Condition Speculation & Dynamic CSE BB 0 BB 0 BB 1 BB 2 BB Scheduler 3 decides to Conditionally BB Speculate 4 a' = b + c a' = b + c BB 1 BB 2 BB 3 BB 4 a = b + c BB 5 BB 6 BB 7 a = a' BB 5 BB 6 BB 7 d = b + c d = a' BB 8 BB 8 19
Condition Speculation & Dynamic CSE BB 0 BB 0 BB 1 BB 2 BB Scheduler 3 decides to Conditionally BB Speculate 4 a' = b + c a' = b + c BB 1 BB 2 BB 3 BB 4 Use the notion of dominance a = b + c by BB groups 5 of basic blocks BB 6 BB 7 All Control Paths leading up to BB8 come from either BB1 or d BB2: = b + c=> BB BB1 8 and d = a' BB 8 BB2 together dominate BB8 a = a' BB 5 BB 6 BB 7 20
Loop Node Loop Shifting: An Incremental Loop Pipelining Technique BB 0 Loop Node a + _ c BB 0 BB 1 BB 1 BB 2 a + b + _ c _ d Loop Shifting BB 2 b + a + _ d _ c BB 3 Loop Exit BB 3 Loop Exit BB 4 BB 4 21
Loop Node Loop Shifting: An Incremental Loop Pipelining Technique BB 0 Loop Node a + _ c BB 0 BB 1 BB 1 BB 2 a + b + _ c _ d Loop Shifting BB 2 b + a + _ c _ d BB 3 Loop Exit Compac -tion BB 3 Loop Exit BB 4 BB 4 22
SPARK High Level Synthesis Framework 23
SPARK Parallelizing HLS Framework C input and Synthesizable RTL VHDL output Tool-box of Transformations and Heuristics Each of these can be developed independently of the other Script based control over transformations & heuristics Hierarchical Intermediate Representation (HTGs( HTGs) Retains structural information about design (conditional blocks, loops) Enables efficient and structured application of transformations Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Interconnect Minimizing Resource Binding Enables Graphical Visualization of Design description and intermediate results 100,000+ lines of C++ code 24
Synthesizable C ANSI-C C front end from Edison Design Group (EDG) Features of C not supported for synthesis Pointers However, Arrays and passing by reference are supported Recursive Function Calls Gotos Features for which support has not been implemented Multi-dimensional arrays Structs Continue, Breaks Hardware component generated for each function A called function is instantiated as a hardware component in calling function 25
HTG Graph Visualization DFG 26
Resource Utilization Graph Scheduling 27
Example of Complex HTG Example of a real design: MPEG-1 1 pred2 function Just for demonstration; you are not expected to read the text Multiple nested loops and conditionals 28
Results presented here for Experiments Pre-synthesis transformations Speculative Code Motions Dynamic CSE We used SPARK to synthesize designs derived from several industrial designs MPEG-1, MPEG-2, GIMP Image Processing software Case Study of Intel Instruction Length Decoder Scheduling Results Number of States in FSM Cycles on Longest Path through Design VHDL: Logic Synthesis Critical Path Length (ns) Unit Area 29
Target Applications Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred1 4 2 17 123 MPEG-1 pred2 11 6 45 287 MPEG-2 dp_frame 18 4 61 260 GIMP tiler 11 2 35 150 30
1.2 1 0.8 0.6 0.4 0.2 0 Scheduling & Logic Synthesis Results Longest Path(l cyc) MPEG-1 Pred1 Function Critical Path(c ns) 36% 42% Total Delay (c*l) 10% Unit Area Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions 1.2 0.8 0.6 0.4 0.2 1 0 Longest Path(l cyc) MPEG-1 Pred2 Function Critical Path(c ns) + Pre-Synthesis Transforms + Dynamic CSE 39% 36% Total Delay (c*l) 8% Unit Area 31
1.2 1 0.8 0.6 0.4 0.2 0 Scheduling & Logic Synthesis Results Longest Path(l cyc) MPEG-1 Pred1 Function Critical Path(c ns) 36% 42% Total Delay (c*l) 10% Unit Area Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions 1.2 0.8 0.6 0.4 0.2 1 0 Longest Path(l cyc) MPEG-1 Pred2 Function Critical Path(c ns) + Pre-Synthesis Transforms + Dynamic CSE 39% 36% Total Delay (c*l) Overall: 63-66 66 % improvement in Delay Almost constant Area 8% Unit Area 32
1.2 Scheduling & Logic Synthesis Results MPEG-2 DpFrame Function 1.2 GIMP Tiler Function 1 0.8 33% 1 0.8 52% 0.6 0.4 0.2 20% 1% 0.6 0.4 0.2 41% 14% 0 Longest Path(l cyc) Critical Path(c ns) Total Delay (c*l) Unit Area Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions 0 Longest Path(l cyc) Critical Path(c ns) + Pre-Synthesis Transforms + Dynamic CSE Total Delay (c*l) Unit Area 33
1.2 Scheduling & Logic Synthesis Results MPEG-2 DpFrame Function 1.2 GIMP Tiler Function 1 0.8 33% 1 0.8 52% 0.6 0.4 0.2 20% 1% 0.6 0.4 0.2 41% 14% 0 Longest Path(l cyc) Critical Path(c ns) Total Delay (c*l) Unit Area Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions 0 Longest Path(l cyc) Critical Path(c ns) + Pre-Synthesis Transforms + Dynamic CSE Total Delay (c*l) Overall: 48-76 % improvement in Delay Almost constant Area Unit Area 34
Case Study: Intel Instruction Length Decoder Stream of Instructions Instruction Buffer Instruction Length Decoder First Insn Second Insn Third Instruction 35
Example Design: ILD Block from Intel Case Study: A design derived from the Instruction Length Decoder of the Intel Pentium class of processors Decodes length of instructions streaming from memory Has to look at up to 4 bytes at a time Has to execute in one cycle and decode about 64 bytes of instructions Characteristics of Microprocessor functional blocks Low Latency: Single or Dual cycle implementation Consist of several small computations Intermix of control and data logic 36
Basic Instruction Length Decoder: Initial Description Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 + + + = Total Length Of Instruction Byte 1 Byte 2 Byte 3 Byte 4 Need Byte 4? Need Byte 3? Need Byte 2? 37
Instruction Length Decoder: Decoding 2 nd Instruction Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 + + + = Total Length Of Insn First Insn Byte 3 Byte 4 Byte 5 Byte 6 Need Byte 4? Need Byte 3? Need Byte 2? After decoding the length of an instruction Start looking from next byte Again examine up to 4 bytes to determine length of next instruction 38
Byte 1 Instruction Length Decoder: Parallelized Description Byte 2 Byte 3 Byte 4 Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 Need Byte 4? Need Byte 3? + + + = Total Length Of Instruction Speculatively calculate the length contribution of all 4 bytes at a time Determine actual total length of instruction based on this data Need Byte 2? 39
ILD: Extracting Further Parallelism Byte 1 Insn. Len Calc Byte 1 Byte 2 Byte 2 Insn. Len Calc Byte 3 Insn. Len Calc Byte 3 Byte 4 Byte 4 Insn. Len Calc Byte 5 Insn. Len Calc Byte 5 Speculatively calculate length of instructions assuming a new instruction starts at each byte Do this calculation for all bytes in parallel Traverse from 1 st byte to last Determine length of instructions starting from the 1 st till the last Discard unused calculations 40
Initial: Multi-Cycle Sequential Architecture Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 Byte 1 Byte 2 Byte 3 Byte 4 Need Byte 4? Need Byte 3? Need Byte 2? 41
ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable 42
ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Single cycle Parallel Architecture Our toolbox approach enables us to develop a script to synthesize applications from different domains Final design looks close to the actual implementation done by Intel 43
Conclusions Parallelizing code transformations enable a new range of HLS transformations Provide the needed improvement in quality of HLS results Possible to be competitive against manually designed circuits. Can enable productivity improvements in microelectronic design Built a synthesis system with a range of code transformations Platform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics can be developed Enables the designer to find the right synthesis script for different application domains Performance improvements of 60-70 % across a number of designs We have shown its effectiveness on an Intel design 44
Advisors Acknowledgements Professors Rajesh Gupta, Nikil Dutt,, Alex Nicolau Contributors to SPARK framework Nick Savoiu, Mehrdad Reshadi, Sunwoo Kim Intel Strategic CAD Labs (SCL) Timothy Kam,, Mike Kishinevsky Supported by Semiconductor Research Corporation and Intel SCL 45
Thank You