High-Level Synthesis Creating Custom Circuits from High-Level Code

Similar documents
Unit 2: High-Level Synthesis

Compiler Optimization Intermediate Representation

High-Level Synthesis (HLS)

Lecture 20: High-level Synthesis (1)

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Instructor: Randy H. Katz hap://inst.eecs.berkeley.edu/~cs61c/fa13. Fall Lecture #7. Warehouse Scale Computer

HIGH-LEVEL SYNTHESIS

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL

VHDL simulation and synthesis

High Level Synthesis

ICS 252 Introduction to Computer Design

CDA 4253 FPGA System Design Op7miza7on Techniques. Hao Zheng Comp S ci & Eng Univ of South Florida

VHDL for Synthesis. Course Description. Course Duration. Goals

Lecture 1 Introduc-on

COE 561 Digital System Design & Synthesis Introduction

SPARK: A Parallelizing High-Level Synthesis Framework

High-Level Synthesis

Code Genera*on for Control Flow Constructs

Ways to implement a language

Synthesis at different abstraction levels

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

An introduction to CoCentric

Program Op*miza*on and Analysis. Chenyang Lu CSE 467S

Advanced Synthesis Techniques

Topics. Verilog. Verilog vs. VHDL (2) Verilog vs. VHDL (1)

Sta$c Single Assignment (SSA) Form

System-on-Chip Design Analysis of Control Data Flow. Hao Zheng Comp Sci & Eng U of South Florida

Compiler Optimization

Compiler: Control Flow Optimization

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

Verilog for High Performance

RTL Coding General Concepts

: Advanced Compiler Design. 8.0 Instruc?on scheduling

W1005 Intro to CS and Programming in MATLAB. Brief History of Compu?ng. Fall 2014 Instructor: Ilia Vovsha. hip://

Introduction to CMOS VLSI Design (E158) Lab 4: Controller Design

CS 61C: Great Ideas in Computer Architecture Func%ons and Numbers

COVER SHEET: Problem#: Points

High Level Synthesis. Shankar Balachandran Assistant Professor, Dept. of CSE IIT Madras

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Programmable Logic Devices HDL-Based Design Flows CMPE 415

EDAN65: Compilers, Lecture 13 Run;me systems for object- oriented languages. Görel Hedin Revised:

C-based Interactive RTL Design Methodology

Automated modeling of custom processors for DCT algorithm

16.1. Unit 16. Computer Organization Design of a Simple Processor

Register Alloca.on Deconstructed. David Ryan Koes Seth Copen Goldstein

An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs

Compiler Code Generation COMP360

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999

קורס VHDL for High Performance. VHDL

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering

Lecture 2 Hardware Description Language (HDL): VHSIC HDL (VHDL)

EE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination

Advanced Design System 1.5. DSP Synthesis

CS 61C: Great Ideas in Computer Architecture Strings and Func.ons. Anything can be represented as a number, i.e., data or instruc\ons

PSD3A Principles of Compiler Design Unit : I-V. PSD3A- Principles of Compiler Design

CSSE232 Computer Architecture I. Datapath

CS415 Compilers. Intermediate Represeation & Code Generation

Administrivia. ECE/CS 5780/6780: Embedded System Design. Acknowledgements. What is verification?

EEL 4783: HDL in Digital System Design

Processor (I) - datapath & control. Hwansoo Han

CpE242 Computer Architecture and Engineering Designing a Single Cycle Datapath

EE382V-ICS: System-on-a-Chip (SoC) Design

ESL design with the Agility Compiler for SystemC

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring

Compiler Architecture

EDA180: Compiler Construc6on Context- free grammars. Görel Hedin Revised:

UPCRC. Illiac. Gigascale System Research Center. Petascale computing. Cloud Computing Testbed (CCT) 2

MetaRTL: Raising the Abstraction Level of RTL Design

Design Methodologies. Kai Huang

ii Copyright c Harikrishna Samala 2009 All Rights Reserved

System Level Design For Low Power. Yard. Doç. Dr. Berna Örs Yalçın

361 datapath.1. Computer Architecture EECS 361 Lecture 8: Designing a Single Cycle Datapath

Punctual Coalescing. Fernando Magno Quintão Pereira

Tutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!

Recommended Design Techniques for ECE241 Project Franjo Plavec Department of Electrical and Computer Engineering University of Toronto

Verilog. What is Verilog? VHDL vs. Verilog. Hardware description language: Two major languages. Many EDA tools support HDL-based design

Dynamic Languages. CSE 501 Spring 15. With materials adopted from John Mitchell

EN2911X: Reconfigurable Computing Topic 03: Reconfigurable Computing Design Methodologies

Two HDLs used today VHDL. Why VHDL? Introduction to Structured VLSI Design

RTL Design. Gate-level design is now rare! RTL = Register Transfer Level

SoC Design for the New Millennium Daniel D. Gajski

System Level Design Flow

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scheduling in RTL Design Methodology

Lecture 8: Synthesis, Implementation Constraints and High-Level Planning

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Synthesis from VHDL. Krzysztof Kuchcinski Department of Computer Science Lund Institute of Technology Sweden

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Binding Pipelining

Computer Systems CSE 410 Autumn Memory Organiza:on and Caches

A Process Model suitable for defining and programming MpSoCs

structure syntax different levels of abstraction

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this

International Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013

LECTURE 3. Compiler Phases

CS 61C: Great Ideas in Computer Architecture. MIPS CPU Datapath, Control Introduction

Chapter 4. The Processor

ESE532: System-on-a-Chip Architecture. Today. Message. Coding Accelerators. Course Hypothesis. Discussion [open]

Chapter 4. The Processor

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Transcription:

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng University of South Florida

Exis%ng Design Flow Register-transfer (RT) synthesis - Specify RT structure (muxes, registers, etc) - Allows precise specification - But, time consuming, difficult, error prone Synthesizable HDL RT Synthesis Netlist Technology Mapping Physical Design Bitfile Placement Routing ASIC Processor FPGA

Future Design Flow C/C, Java, etc. High-level Synthesis Synthesizable HDL HDL RT Synthesis Netlist Physical Design Bitfile Technology Mapping Placement Routing ASIC Processor FPGA

Overview Input: - High-level languages (e.g., C) - Behavioral hardware descrip=on languages (e.g., VHDL) - State diagrams / logic networks Tools: - Parser - Library of modules Constraints: - Area constraints (e.g., # modules of a certain type) - Delay constraints (e.g., set of opera=ons should finish in λ clock cycles) Output: - Opera=on scheduling (=me) and binding (resource) - Control genera=on and detailed interconnec=ons

High-level Synthesis - Benefits Ra=o of C to VHDL developers (10000:1?) Easier to specify complex designs Technology/architecture independent designs Manual HW design poten=ally slower - Similar to assembly code era - Programmers could always beat compiler - But, no longer the case Ease of HW/SW par==oning - enhance overall system efficiency More efficient verifica=on and valida=on - Easier to V & V of high-level code

High-level Synthesis More challenging than SW compila=on - Compila=on maps behavior into assembly instruc=ons - Architecture is known to compiler HLS creates a custom architecture to execute specified behavior - Huge hardware explora=on space - Best solu=on may include microprocessors - Ideally, should handle any high-level code But, not all code appropriate for hardware

High-level Synthesis: An Example First, consider how to manually convert high-level code into circuit Steps acc = 0; for (i=0; i < 128; i) acc = a[i]; 1) Build FSM for controller 2) Build datapath based on FSM

A Manual Example Build a FSM (controller) - Decompose code into states acc = 0; for (i=0; i < 128; i) acc = a[i]; acc=0, i = 0 i < 128 i >= 128 load a[i] Done acc = a[i] i

A Manual Example Build a datapath - Allocate resources for each state acc=0, i = 0 i < 128 i >= 128 load a[i] Done i acc a[i] addr 1 128 1 acc = a[i] i acc = 0; for (i=0; i < 128; i) acc = a[i]; <

A Manual Example Build a datapath - Determine register inputs In from memory acc=0, i = 0 0 0 &a i < 128 i >= 128 MUX 2x1 MUX 2x1 MUX 2x1 load a[i] Done i acc a[i] addr acc = a[i] i 1 128 acc = 0; for (i=0; i < 128; i) acc = a[i]; < 1

A Manual Example Build a datapath - Add outputs In from memory acc=0, i = 0 0 0 &a i < 128 i >= 128 MUX 2x1 MUX 2x1 MUX 2x1 load a[i] Done i acc a[i] addr acc = a[i] 1 128 1 < i acc = 0; for (i=0; i < 128; i) acc = a[i]; acc Memory address

A Manual Example Build a datapath - Add control signals In from memory acc=0, i = 0 0 0 &a i < 128 i >= 128 MUX 2x1 MUX 2x1 MUX 2x1 load a[i] Done i acc a[i] addr acc = a[i] 1 128 1 < i acc = 0; for (i=0; i < 128; i) acc = a[i]; acc Memory address

A Manual Example Combine controller datapath Start In from memory 0 0 &a MUX 2x1 MUX 2x1 MUX 2x1 Controller 1 128 i acc a[i] addr 1 < Done Memory Read acc = 0; for (i=0; i < 128; i) acc = a[i]; acc Memory address

A Manual Example - Op%miza%on Alterna=ves - Use one adder (plus muxes) In from memory 0 0 &a MUX 2x1 MUX 2x1 MUX 2x1 i acc a[i] addr 128 1 MUX MUX < acc Memory address

A Manual Example Summary Comparison with high-level synthesis - Determining when to perform each operation => Scheduling - Allocating resource for each operation => Resource allocation - Mapping operations to allocated resources => Binding

Another Example: Try it at home Your turn x=0; for (i=0; i < 100; i) { if (a[i] > 0) x ; else x --; a[i] = x; } //output x 1) Build FSM (do not perform if conversion) 2) Build datapath based on FSM

High-Level Synthesis high-level code Could be C, C, Java, Perl, Python, SystemC, ImpulseC, etc. High-Level Synthesis Custom Circuit Usually a RT VHDL/Verilog description, but could as low level as a bit file for FPGA, or a gate netlist.

High-Level Synthesis Overview acc = 0; for (i=0; i < 128; i) acc = a[i]; High-Level Synthesis In from memory Controller 0 0 &a 2x1 2x1 2x1 i acc a[i] 1 128 addr 1 < Done Memory Read acc Memory address

Main Steps High-level Code Front-end Syntactic Analysis Intermediate Representation Converts code to intermediate representation - allows all following steps to use language independent format. Optimization Back-end Scheduling/Resource Allocation Binding/Resource Sharing Determines when each operation will execute, and resources used Maps operations onto physical resources Cycle accurate RTL code

Parsing & Syntac%c Analysis

Syntactic Analysis Defini=on: Analysis of code to verify syntac=c correctness - Converts code into intermediate representa=on Steps: similar to SW compila=on 1) Lexical analysis (Lexing) 2) Parsing 3) Code genera=on intermediate representa=on High-level Code Lexical Analysis Syntactic Analysis Intermediate Representation Parsing

Intermediate Representa%on Parser converts tokens to intermediate representa=on - Usually, an abstract syntax tree Assign x = 0; if (y < z) x = 1; d = 6; x 0 if cond assign assign y < z x 1 d 6

Intermediate Representation Why use intermediate representa=on? - Easier to analyze/optimize than source code - Theoretically can be used for all languages Makes synthesis back end language independent C Code Java Perl Syntactic Analysis Syntactic Analysis Syntactic Analysis Intermediate Representation Back End Scheduling, resource allocation, binding, independent of source language - sometimes optimizations too

Intermediate Representation Different Types - Abstract Syntax Tree - Control/Data Flow Graph (CDFG) - Sequencing Graph... We will focus on CDFG - Combines control flow graph (CFG) and data flow graph (DFG)

Control Flow Graphs (CFGs) Represents control flow dependencies of basic blocks A basic block is a sec=on of code that always executes from beginning to end I.e. no jumps into or out of block acc=0, i = 0 acc = 0; for (i=0; i < 128; i) acc = a[i]; yes i < 128? no acc = a[i] i Done

Control Flow Graphs: Your Turn Find a CFG for the following code. i = 0; while (i < 10) { if (x < 5) y = 2; else if (z < 10) y = 6; i; }

Data Flow Graphs Represents data dependencies between operations a b c d x = ab; y = c*d; z = x - y; * - x z y

Control/Data Flow Graph Combines CFG and DFG - Maintains DFG for each node of CFG acc = 0; for (i=0; i < 128; i) acc = a[i]; 0 0 acc=0; i=0; acc i acc a[i] i 1 if (i < 128) acc = a[i] i Done acc i

Transforma%on/Op%miza%on

Synthesis Op%miza%ons Aier crea=ng CDFG, high-level synthesis op=mizes it with the following goals - Reduce area - Improve latency - Increase parallelism - Reduce power/energy 2 types of op=miza=ons - Data flow op=miza=ons - Control flow op=miza=ons

Data Flow Op%miza%ons Tree-height reduc=on - Generally made possible from commuta=vity, associa=vity, and distribu=vity x = a b c d a b c d a b c d a b c d a b c d * *

Data Flow Optimizations Operator Strength Reduc=on - Replacing an expensive ( strong ) opera=on with a faster one - Common example: replacing mul=ply/divide with shii 1 multiplication 0 multiplications b[i] = a[i] * 8; b[i] = a[i] << 3; a = b * 5; c = b << 2; a = b c; a = b * 13; c = b << 2; d = b << 3; a = c d b;

Data Flow Op%miza%ons Constant propaga=on - Sta=cally evaluate expressions with constants x = 0; y = x * 15; z = y 10; x = 0; y = 0; z = 10;

Data Flow Op%miza%ons Func=on Specializa=on - Create specialized code for common inputs Treat common inputs as constants If inputs not known sta=cally, must include if statement for each call to specialized func=on int f (int x) { y = x * 15; return y 10; } Treat frequent input as a constant int f (int x) { y = x * 15; return y 10; } int f_opt () { return 10; } for (I=0; I < 1000; I) f(0); } for (I=0; I < 1000; I) f_opt(); }

Data Flow Op%miza%ons Common sub-expression elimina=on - If expression appears more than once, repe==ons can be replaced a = x y;............ b = c * 25 x y; a = x y;............ b = c * 25 a; x y already determined

Data Flow Op%miza%ons Dead code elimina=on - Remove code that is never executed May seem like stupid code, but oien comes from constant propaga=on or func=on specializa=on int f (int x) { if (x > 0 ) a = b * 15; else a = b / 4; return a; } int f_opt () { a = b * 15; return a; } Specialized version for x > 0 does not need else branch - dead code

Data Flow Op%miza%ons Code mo=on (hois=ng/sinking) - Avoid repeated computa=on for (I=0; I < 100; I) { z = x y; b[i] = a[i] z ; } z = x y; for (I=0; I < 100; I) { b[i] = a[i] z ; }

Control Flow Op%miza%ons Loop Unrolling - Replicate body of loop May increase parallelism for (i=0; i < 128; i) a[i] = b[i] c[i]; for (i=0; i < 128; i=2) { a[i] = b[i] c[i]; a[i1] = b[i1] c[i1] }

Control Flow Op%miza%ons Func=on inlining replace func=on call with body of func=on - Common for both SW and HW - SW: Eliminates func=on call instruc=ons - HW: Eliminates unnecessary control states for (i=0; i < 128; i) a[i] = f( b[i], c[i] );.... int f (int a, int b) { return a b * 15; } for (i=0; i < 128; i) a[i] = b[i] c[i] * 15;

Control Flow Op%miza%ons Condi=onal Expansion replace if with logic expression - Execute if/else bodies in parallel y = ab if (a) x = bd else x = bd [DeMicheli] y = ab x = a(bd) a bd Can be further optimized to: y = ab x = y d(ab)

Example Op=mize this x = 0; y = a b; if (x < 15) z = a b - c; else z = x 12; output = z * 12;

Scheduling/Resource Alloca%on

Scheduling Scheduling assigns a start =me to each opera=on in DFG - Start =mes must not violate dependencies in DFG - Start =mes must meet performance constraints Alterna=vely, resource constraints Performed on the DFG of each CFG node - Cannot execute mul=ple CFG nodes in parallel

Examples a b c d Cycle1 Cycle2 Cycle3 a b c d Cycle1 Cycle3 Cycle2 a b c d Cycle2 Cycle1

Scheduling Problems Several types of scheduling problems - Usually some combination of performance and resource constraints Problems: - Unconstrained Not very useful, every schedule is valid - Minimum latency - Latency constrained - Mininum-latency, resource constrained i.e. find the schedule with the shortest latency, that uses less than a specified # of resources NP-Complete - Mininum-resource, latency constrained i.e. find the schedule that meets the latency constraint (which may be anything), and uses the minimum # of resources NP-Complete

Minimum Latency Scheduling ASAP (as soon as possible) algorithm - Find a candidate node Candidate is a node whose predecessors have been scheduled and completed (or has no predecessors) - Schedule node one cycle later than max cycle of predecessor - Repeat until all nodes scheduled Cycle1 Cycle2 Cycle3 Cycle4 a b c d * * e f g h - < Minimum possible latency - 4 cycles

Minimum Latency Scheduling ALAP (as late as possible) algorithm - Run ASAP, get minimum latency L - Find a candidate Candidate is node whose successors are scheduled (or has none) - Schedule node one cycle before min cycle of successor Nodes with no successors scheduled to cycle L - Repeat until all nodes scheduled a b c d e f g h Cycle1 Cycle2 Cycle3 Cycle4 L = 4 cycles * * - <

Latency-Constrained Scheduling Instead of finding the minimum latency, find latency less than L Solutions: - Use ASAP, verify that minimum latency less than L. - Use ALAP starting with cycle L instead of minimum latency (don t need ASAP)

Scheduling with Resource Constraints Schedule must use less than specified number of resources Constraints: 1 ALU (/-), 1 Multiplier a b c d e f g Cycle1 Cycle2 - Cycle3 * * Cycle4 Cycle5

Scheduling with Resource Constraints Schedule must use less than specified number of resources Constraints: 2 ALU (/-), 1 Multiplier a b c d e f g Cycle1 - Cycle2 * * Cycle3 Cycle4

Mininum-Latency, Resource-Constrained Scheduling Defini=on: Given resource constraints, find schedule that has the minimum latency - Example: Constraints: 1 ALU (/-), 1 Multiplier a b c d e f g - *

Mininum-Latency, Resource-Constrained Scheduling Defini=on: Given resource constraints, find schedule that has the minimum latency - Example: Constraints: 1 ALU (/-), 1 Multiplier a b c d e f g - *

Mininum-Latency, Resource-Constrained Scheduling Defini=on: Given resource constraints, find schedule that has the minimum latency - Example: Constraints: 1 ALU (/-), 1 Multiplier a b c d e f g - *

Binding/Resource Sharing

Binding During scheduling, we determined: - When operations will execute - How many resources are needed We still need to decide which operations execute on which resources binding - If multiple operations use the same resource, we need to decide how resources are shared - resource sharing.

Binding Map operations onto resources such that operations in same cycle do not use same resource 2 ALUs (/-), 2 Multipliers Cycle1 1 * 2 3 Cycle2 5 * 6 * 4 - Cycle3 Cycle4 7 8 - Mult1 ALU1 ALU2 Mult2

Binding Many possibilities - Bad binding may increase resources, require huge steering logic, reduce clock, etc. 2 ALUs (/-), 2 Multipliers Cycle1 1 * 2 3 Cycle2 5 * 6 * 4 - Cycle3 Cycle4 7 8 - Mult1 ALU1 Mult2 ALU2

Binding Can t do this - 1 resource can t perform multiple ops simultaneously! 2 ALUs (/-), 2 Multipliers Cycle1 1 * 2 3 Cycle2 5 * 6 * 4 - Cycle3 Cycle4 7 8 -

Translation to Datapath a b c d e f g h i Cycle1 1 * 2 3 Cycle2 5 * 6 * 4 - Cycle3 Cycle4 7 8 - a b c h d e i g e f Mux Mux Mux Mux Mult(1,5) ALU(2,7,8,4) Mult(6) ALU(3) Reg Reg Reg Reg 1) Add resources and registers 2) Add mux for each input 3) Add input to left mux for each left input in DFG 4) Do same for right mux 5) If only 1 input, remove mux

Summary

Main Steps Front-end (lexing/parsing) converts code into intermediate representation - We looked at CDFG Scheduling assigns a start time for each operation in DFG - CFG node start times defined by control dependencies - Resource allocation determined by schedule Binding maps scheduled operations onto physical resources - Determines how resources are shared Big picture: - Scheduled/Bound DFG can be translated into a datapath - CFG can be translated to a controller - => High-level synthesis can create a custom circuit for any CDFG!

Limitations Task-level parallelism - Parallelism in CDFG limited to individual control states Can t have multiple states executing concurrently - Potential solution: use model other than CDFG Kahn Process Networks n Nodes represents parallel processes/tasks n Edges represent communication between processes High-level synthesis can create a controllerdatapath for each process n Must also consider communication buffers - Challenge: Most high-level code does not have explicit parallelism n Difficult/impossible to extract task-level parallelism from code

Limitations Coding practices limit circuit performance - Very often, languages contain constructs not appropriate for circuit implementation Recursion, pointers, virtual functions, etc. Potential solution: use specialized languages - Remove problematic constructs, add task-level parallelism Challenge: - Difficult to learn new languages - Many designers resist changes to tool flow

Limitations Expert designers can achieve better circuits - High-level synthesis has to work with specification in code Can be difficult to automatically create efficient pipeline May require dozens of optimizations applied in a particular order - Expert designer can transform algorithm Synthesis can transform code, but can t change algorithm Potential Solution:??? - New language? - New methodology? - New tools?