SPARK: A Parallelizing High-Level Synthesis Framework

Similar documents
SPARK : A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations

Coordinated Transformations for High-Level Synthesis of High Performance Microprocessor Blocks

Dynamic Common Sub-Expression Elimination during Scheduling in High-Level Synthesis

Coordinated Parallelizing Compiler Optimizations and High-Level Synthesis

User Manual for the SPARK Parallelizing High-Level Synthesis Framework Version 1.1

Using Speculative Computation and Parallelizing techniques to improve Scheduling of Control based Designs

Unit 2: High-Level Synthesis

VHDL for Synthesis. Course Description. Course Duration. Goals

COE 561 Digital System Design & Synthesis Introduction

ICS 252 Introduction to Computer Design

Verilog for High Performance

High-Level Synthesis Creating Custom Circuits from High-Level Code

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

High-Level Synthesis (HLS)

Hello I am Zheng, Hongbin today I will introduce our LLVM based HLS framework, which is built upon the Target Independent Code Generator.

FPGA for Software Engineers

NISC Application and Advantages

Intermediate representation

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

High Level Synthesis

SpecC Methodology for High-Level Modeling

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Co-synthesis and Accelerator based Embedded System Design

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015

Tour of common optimizations

FPGAs: Instant Access

Two HDLs used today VHDL. Why VHDL? Introduction to Structured VLSI Design

High-Level Synthesis

VHDL Essentials Simulation & Synthesis

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

A Process Model suitable for defining and programming MpSoCs

VHDL. VHDL History. Why VHDL? Introduction to Structured VLSI Design. Very High Speed Integrated Circuit (VHSIC) Hardware Description Language

MARKET demands urge embedded systems to incorporate

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

ECE 486/586. Computer Architecture. Lecture # 7

Automated modeling of custom processors for DCT algorithm

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Design Verification Lecture 01

Worst Case Execution Time Analysis for Synthesized Hardware

Previous Exam Questions System-on-a-Chip (SoC) Design

Interface Synthesis using Memory Mapping for an FPGA Platform. CECS Technical Report #03-20 June 2003

Processor (IV) - advanced ILP. Hwansoo Han

HIGH-LEVEL SYNTHESIS

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

Lecture 20: High-level Synthesis (1)

EEL 4783: HDL in Digital System Design

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Reconfigurable Computing. Introduction

Hardware-based Speculation

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

System Level Design Flow

Low energy and High-performance Embedded Systems Design and Reconfigurable Architectures

Verilog Essentials Simulation & Synthesis

xpilot: A Platform-Based Behavioral Synthesis System

Lec 25: Parallel Processors. Announcements

Interface Synthesis using Memory Mapping for an FPGA Platform Λ

SDSoC: Session 1

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Multiple Instruction Issue. Superscalars

Instruction Level Parallelism

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?

ECE1387 Exercise 3: Using the LegUp High-level Synthesis Framework

COP5621 Exam 4 - Spring 2005

Automatic Counterflow Pipeline Synthesis

Controller FSM Design Issues : Correctness and Efficiency. Lecture Notes # 10. CAD Based Logic Design ECE 368

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

COMPUTER ORGANIZATION AND DESI

NISC Technology Online Toolset

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

April 9, 2000 DIS chapter 1

A comparative evaluation of high-level hardware synthesis using Reed-Solomon decoder

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

Control flow graphs and loop optimizations. Thursday, October 24, 13

קורס VHDL for High Performance. VHDL

8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments

An Algorithm for the Allocation of Functional Units from. Realistic RT Component Libraries. Department of Information and Computer Science

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems

High-Level Synthesis of Programmable Hardware Accelerators Considering Potential Varieties

VHDL: RTL Synthesis Basics. 1 of 59

SystemVerilog Essentials Simulation & Synthesis

Multi-Level Computing Architectures (MLCA)

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999

SoC Design for the New Millennium Daniel D. Gajski

CS Computer Architecture

Pilot: A Platform-based HW/SW Synthesis System

EE382V: System-on-a-Chip (SoC) Design

Advanced Instruction-Level Parallelism

The S6000 Family of Processors

Synthesis at different abstraction levels

Sequential Circuit Design: Principle

System Level Design For Low Power. Yard. Doç. Dr. Berna Örs Yalçın

Transcription:

SPARK: A Parallelizing High-Level Synthesis Framework Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark Supported by Semiconductor Research Corporation & Intel Inc

System Level Synthesis System Level Model Task Analysis HW/SW Partitioning Hardware Behavioral Description High Level Synthesis ASIC I/O FPGA Memory Software Behavioral Description Software Compiler Processor Core 2

High Level Synthesis Transform behavioral descriptions to RTL/gate level x = a + b; c = a < b; if (c) then d = e f; else g = h + i; j = d x g; l = e + x; From C to CDFG to Architecture d = e - f x = a + b c = a < b T c j = d x g l = e + x If Node F g = h + i Control M e m o r y ALU Data path 3

High Level Synthesis Transform behavioral descriptions to RTL/gate level x = a + b; c = a < b; if (c) then d = e f; else g = h + i; j = d x g; l = e + x; From C to CDFG to Architecture T d = e - f x = a + b c = a < b c If Node F g = h + i Control M e m o r y Problem # 1 : Poor quality of HLS results beyond straight-line j = d x g behavioral descriptions Problem # 2 : Poor/No l = e + x controllability of the HLS results ALU Data path 4

Outline Motivation and Background Our Approach to Parallelizing High-Level Synthesis Code Transformations Techniques for PHLS Parallelizing Transformations Dynamic Transformations The PHLS Framework and Experimental Results Multimedia and Image Processing Applications Case Study: Intel Instruction Length Decoder Conclusions and Future Work 5

High-level Synthesis Well-researched area: from early 1980 s Renewed interest due to new system level design methodologies Large number of synthesis optimizations have been proposed Either operation level: : algebraic transformations on DSP codes or logic level: : Don t Care based control optimizations In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler Transformations Different optimization objectives and cost models than HLS Our aim: : Develop Synthesis and Parallelizing Compiler Transformations that are useful for HLS Beyond scheduling results: in Circuit Area and Delay For large designs with complex control flow (nested conditionals/loops) 6

Our Approach: Parallelizing HLS (PHLS) C Input Original CDFG Scheduling & Binding Optimized CDFG VHDL Output Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling Source-level code refinement using Pre-synthesis transformations Code Restructuring by Speculative Code Motions Operation replication to improve concurrency Dynamic transformations: exploit new opportunities during scheduling 7

PHLS Transformations Organized into Four Groups 1. Pre-synthesis synthesis: : Loop-invariant code motions, Loop unrolling, CSE 2. Scheduling: : Speculative Code Motions, Multi- cycling, Operation Chaining, Loop Pipelining 3. Dynamic: : Transformations applied dynamically during scheduling: Dynamic CSE, Dynamic Copy Propagation, Dynamic Branch Balancing 4. Basic Compiler Transformations: : Copy Propagation, Dead Code Elimination 8

Speculative Code Motions Operation Movement to reduce impact of Programming Style on Quality of HLS Results Across Hierarchical Blocks + a If Node Speculation T F Reverse Speculation + b _ c Conditional Speculation Early Condition Execution Evaluates conditions As soon as possible 9

Dynamic Transformations Called dynamic since they are applied during scheduling (versus a pass before/after scheduling) Dynamic Branch Balancing Increase the scope of code motions Reduce impact of programming style on HLS results Dynamic CSE and Dynamic Copy Propagation Exploit the Operation movement and duplication due to speculative code motions Create new opportunities to apply these transformations Reduce the number of operations 10

Dynamic Branch Balancing If Node BB 1 + a + b _ c T F _ e BB 0 BB 3 BB 2 _ d BB 4 Original Design Resource Allocation If Node a BB 1 b + + + _ c T F _ e Longest Path BB 0 BB 3 BB 2 _ d BB 4 Unbalanced Conditional Scheduled Design S0 S1 S2 S3 11

Insert New Scheduling Step in Shorter Branch If Node BB 1 + a + b _ c T F _ e Resource Allocation BB 0 BB 3 BB 2 _ d BB 4 If Node a BB 1 b + + + _ c T F _ e BB 0 BB 3 BB 2 _ d BB 4 S0 S1 S2 S3 Original Design Scheduled Design 12

Insert New Scheduling Step in Shorter Branch If Node BB 1 + a + b _ c T F BB 0 BB 3 Original Design Resource Allocation BB 2 _ d If Node a BB 1 b + + + _ c _ e T F BB 0 BB 3 BB 2 _ d _ e BB 4 Dynamic Branch _ Balancing BB 4 inserts new scheduling steps Enables Conditional e Speculation Leads to further code compaction Scheduled Design S0 S1 S2 S3 13

Dynamic CSE: : Going beyond Traditional CSE a = b + c; cd = b < c; if (cd) d = b + c; else e = g + h; If Node a = b + c BB 0 BB 1 T F d = a e = g + h BB 2 BB 3 BB 4 C Description a = b + c BB 0 After Traditional CSE If Node T BB 1 F d = b + c e = g + h BB 2 BB 3 BB 4 HTG Representation 14

Dynamic CSE: : Going beyond Traditional CSE a = b + c; cd = b < c; if (cd) d = b + c; else e = g + h; C Description a = b + c BB 0 We use notion of Dominance If Node d = b + c e = g + h BB 2 BB 3 BB 4 HTG Representation If Node Dominance of Basic Blocks T a = b + c BB 0 BB 1 F d = a e = g + h BB 2 BB 3 BB 4 After Traditional CSE Basic block BBi dominates BB 1BBj if all control paths from T F the initial basic block of the design graph leading to BBj goes through BBi We can eliminate an operation opj in BBj using common expression in opi if BBi dominates BBj 15

New Opportunities for Dynamic Dynamic CSE Scheduler decides to Speculate Due to Code Motions BB 1 BB 0 dcse = b + c BB 0 BB 1 a = b + c BB 2 BB 3 BB 4 BB 5 a = dcse BB 2 BB 3 BB 4 BB 5 d = b + c d = b + c BB 6 BB 7 BB 6 BB 7 BB 8 CSE not possible since BB2 does not dominate BB6 BB 8 CSE possible now since BB0 does not dominate BB6 16

New Opportunities for Dynamic Dynamic CSE Scheduler decides to Speculate Due to Code Motions BB 1 BB 0 dcse = b + c BB 0 BB 1 a = b + c BB 2 BB 3 BB 4 BB 5 a = dcse BB 2 BB 3 BB 4 BB 5 d = b + c d = dcse BB 6 BB 7 BB 6 BB 7 BB 8 BB 8 If If scheduler not possible moves since or or duplicates BB2 an an CSE operation possible op,, apply now CSE since on on remaining operations using op op CSE not possible does not dominate BB6 BB0 does not dominate BB6 17

Condition Speculation & Dynamic CSE BB 0 BB 0 BB 1 BB 2 BB Scheduler 3 decides to Conditionally BB Speculate 4 a' = b + c a' = b + c BB 1 BB 2 BB 3 BB 4 a = b + c BB 5 BB 6 BB 7 a = a' BB 5 BB 6 BB 7 d = b + c BB 8 d = b + c BB 8 18

Condition Speculation & Dynamic CSE BB 0 BB 0 BB 1 BB 2 BB Scheduler 3 decides to Conditionally BB Speculate 4 a' = b + c a' = b + c BB 1 BB 2 BB 3 BB 4 a = b + c BB 5 BB 6 BB 7 a = a' BB 5 BB 6 BB 7 d = b + c d = a' BB 8 BB 8 19

Condition Speculation & Dynamic CSE BB 0 BB 0 BB 1 BB 2 BB Scheduler 3 decides to Conditionally BB Speculate 4 a' = b + c a' = b + c BB 1 BB 2 BB 3 BB 4 Use the notion of dominance a = b + c by BB groups 5 of basic blocks BB 6 BB 7 All Control Paths leading up to BB8 come from either BB1 or d BB2: = b + c=> BB BB1 8 and d = a' BB 8 BB2 together dominate BB8 a = a' BB 5 BB 6 BB 7 20

Loop Node Loop Shifting: An Incremental Loop Pipelining Technique BB 0 Loop Node a + _ c BB 0 BB 1 BB 1 BB 2 a + b + _ c _ d Loop Shifting BB 2 b + a + _ d _ c BB 3 Loop Exit BB 3 Loop Exit BB 4 BB 4 21

Loop Node Loop Shifting: An Incremental Loop Pipelining Technique BB 0 Loop Node a + _ c BB 0 BB 1 BB 1 BB 2 a + b + _ c _ d Loop Shifting BB 2 b + a + _ c _ d BB 3 Loop Exit Compac -tion BB 3 Loop Exit BB 4 BB 4 22

SPARK High Level Synthesis Framework 23

SPARK Parallelizing HLS Framework C input and Synthesizable RTL VHDL output Tool-box of Transformations and Heuristics Each of these can be developed independently of the other Script based control over transformations & heuristics Hierarchical Intermediate Representation (HTGs( HTGs) Retains structural information about design (conditional blocks, loops) Enables efficient and structured application of transformations Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation Interconnect Minimizing Resource Binding Enables Graphical Visualization of Design description and intermediate results 100,000+ lines of C++ code 24

Synthesizable C ANSI-C C front end from Edison Design Group (EDG) Features of C not supported for synthesis Pointers However, Arrays and passing by reference are supported Recursive Function Calls Gotos Features for which support has not been implemented Multi-dimensional arrays Structs Continue, Breaks Hardware component generated for each function A called function is instantiated as a hardware component in calling function 25

HTG Graph Visualization DFG 26

Resource Utilization Graph Scheduling 27

Example of Complex HTG Example of a real design: MPEG-1 1 pred2 function Just for demonstration; you are not expected to read the text Multiple nested loops and conditionals 28

Results presented here for Experiments Pre-synthesis transformations Speculative Code Motions Dynamic CSE We used SPARK to synthesize designs derived from several industrial designs MPEG-1, MPEG-2, GIMP Image Processing software Case Study of Intel Instruction Length Decoder Scheduling Results Number of States in FSM Cycles on Longest Path through Design VHDL: Logic Synthesis Critical Path Length (ns) Unit Area 29

Target Applications Design # of Ifs # of Loops # Non-Empty Basic Blocks # of Operations MPEG-1 pred1 4 2 17 123 MPEG-1 pred2 11 6 45 287 MPEG-2 dp_frame 18 4 61 260 GIMP tiler 11 2 35 150 30

1.2 1 0.8 0.6 0.4 0.2 0 Scheduling & Logic Synthesis Results Longest Path(l cyc) MPEG-1 Pred1 Function Critical Path(c ns) 36% 42% Total Delay (c*l) 10% Unit Area Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions 1.2 0.8 0.6 0.4 0.2 1 0 Longest Path(l cyc) MPEG-1 Pred2 Function Critical Path(c ns) + Pre-Synthesis Transforms + Dynamic CSE 39% 36% Total Delay (c*l) 8% Unit Area 31

1.2 1 0.8 0.6 0.4 0.2 0 Scheduling & Logic Synthesis Results Longest Path(l cyc) MPEG-1 Pred1 Function Critical Path(c ns) 36% 42% Total Delay (c*l) 10% Unit Area Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions 1.2 0.8 0.6 0.4 0.2 1 0 Longest Path(l cyc) MPEG-1 Pred2 Function Critical Path(c ns) + Pre-Synthesis Transforms + Dynamic CSE 39% 36% Total Delay (c*l) Overall: 63-66 66 % improvement in Delay Almost constant Area 8% Unit Area 32

1.2 Scheduling & Logic Synthesis Results MPEG-2 DpFrame Function 1.2 GIMP Tiler Function 1 0.8 33% 1 0.8 52% 0.6 0.4 0.2 20% 1% 0.6 0.4 0.2 41% 14% 0 Longest Path(l cyc) Critical Path(c ns) Total Delay (c*l) Unit Area Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions 0 Longest Path(l cyc) Critical Path(c ns) + Pre-Synthesis Transforms + Dynamic CSE Total Delay (c*l) Unit Area 33

1.2 Scheduling & Logic Synthesis Results MPEG-2 DpFrame Function 1.2 GIMP Tiler Function 1 0.8 33% 1 0.8 52% 0.6 0.4 0.2 20% 1% 0.6 0.4 0.2 41% 14% 0 Longest Path(l cyc) Critical Path(c ns) Total Delay (c*l) Unit Area Non-speculative CMs: Within BBs & Across Hier Blocks + Speculative Code Motions 0 Longest Path(l cyc) Critical Path(c ns) + Pre-Synthesis Transforms + Dynamic CSE Total Delay (c*l) Overall: 48-76 % improvement in Delay Almost constant Area Unit Area 34

Case Study: Intel Instruction Length Decoder Stream of Instructions Instruction Buffer Instruction Length Decoder First Insn Second Insn Third Instruction 35

Example Design: ILD Block from Intel Case Study: A design derived from the Instruction Length Decoder of the Intel Pentium class of processors Decodes length of instructions streaming from memory Has to look at up to 4 bytes at a time Has to execute in one cycle and decode about 64 bytes of instructions Characteristics of Microprocessor functional blocks Low Latency: Single or Dual cycle implementation Consist of several small computations Intermix of control and data logic 36

Basic Instruction Length Decoder: Initial Description Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 + + + = Total Length Of Instruction Byte 1 Byte 2 Byte 3 Byte 4 Need Byte 4? Need Byte 3? Need Byte 2? 37

Instruction Length Decoder: Decoding 2 nd Instruction Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 + + + = Total Length Of Insn First Insn Byte 3 Byte 4 Byte 5 Byte 6 Need Byte 4? Need Byte 3? Need Byte 2? After decoding the length of an instruction Start looking from next byte Again examine up to 4 bytes to determine length of next instruction 38

Byte 1 Instruction Length Decoder: Parallelized Description Byte 2 Byte 3 Byte 4 Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 Need Byte 4? Need Byte 3? + + + = Total Length Of Instruction Speculatively calculate the length contribution of all 4 bytes at a time Determine actual total length of instruction based on this data Need Byte 2? 39

ILD: Extracting Further Parallelism Byte 1 Insn. Len Calc Byte 1 Byte 2 Byte 2 Insn. Len Calc Byte 3 Insn. Len Calc Byte 3 Byte 4 Byte 4 Insn. Len Calc Byte 5 Insn. Len Calc Byte 5 Speculatively calculate length of instructions assuming a new instruction starts at each byte Do this calculation for all bytes in parallel Traverse from 1 st byte to last Determine length of instructions starting from the 1 st till the last Discard unused calculations 40

Initial: Multi-Cycle Sequential Architecture Length Contribution 1 Length Contribution 2 Length Contribution 3 Length Contribution 4 Byte 1 Byte 2 Byte 3 Byte 4 Need Byte 4? Need Byte 3? Need Byte 2? 41

ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable 42

ILD Synthesis: Resulting Architecture Speculate Operations, Fully Unroll Loop, Eliminate Loop Index Variable Multi-cycle Sequential Architecture Single cycle Parallel Architecture Our toolbox approach enables us to develop a script to synthesize applications from different domains Final design looks close to the actual implementation done by Intel 43

Conclusions Parallelizing code transformations enable a new range of HLS transformations Provide the needed improvement in quality of HLS results Possible to be competitive against manually designed circuits. Can enable productivity improvements in microelectronic design Built a synthesis system with a range of code transformations Platform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics can be developed Enables the designer to find the right synthesis script for different application domains Performance improvements of 60-70 % across a number of designs We have shown its effectiveness on an Intel design 44

Advisors Acknowledgements Professors Rajesh Gupta, Nikil Dutt,, Alex Nicolau Contributors to SPARK framework Nick Savoiu, Mehrdad Reshadi, Sunwoo Kim Intel Strategic CAD Labs (SCL) Timothy Kam,, Mike Kishinevsky Supported by Semiconductor Research Corporation and Intel SCL 45

Thank You