Introduction to Model-Based High- Level Synthesis. Synphony Model Compiler

Similar documents
Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

Simulink Design Environment

Basic Xilinx Design Capture. Objectives. After completing this module, you will be able to:

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

Verilog for High Performance

VHDL for Synthesis. Course Description. Course Duration. Goals

INTRODUCTION TO CATAPULT C

Design and Verification of FPGA and ASIC Applications Graham Reith MathWorks

Reducing the cost of FPGA/ASIC Verification with MATLAB and Simulink

Agenda. How can we improve productivity? C++ Bit-accurate datatypes and modeling Using C++ for hardware design

Optimize DSP Designs and Code using Fixed-Point Designer

Design and Verification of FPGA Applications

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

LogiCORE IP Floating-Point Operator v6.2

Method We follow- How to Get Entry Pass in SEMICODUCTOR Industries for 3rd year engineering. Winter/Summer Training

Introduction to DSP/FPGA Programming Using MATLAB Simulink

Hardware Implementation and Verification by Model-Based Design Workflow - Communication Models to FPGA-based Radio

Objectives. After completing this module, you will be able to:

Advanced Synthesis Techniques

Modeling a 4G LTE System in MATLAB

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

OUTLINE RTL DESIGN WITH ARX

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Synphony Model Compiler Release Notes

An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015

Agenda. Introduction FPGA DSP platforms Design challenges New programming models for FPGAs

Intro to System Generator. Objectives. After completing this module, you will be able to:

Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks

קורס VHDL for High Performance. VHDL

FPGA for Software Engineers

FIR Filter IP Core User s Guide

SDR Spring KOMSYS-F6: Programmable Digital Devices (FPGAs)

Parallel FIR Filters. Chapter 5

Introduction to C and HDL Code Generation from MATLAB

AccelDSP tutorial 2 (Matlab.m to HDL for Xilinx) Ronak Gandhi Syracuse University Fall

Accelerating FPGA/ASIC Design and Verification

ALTERA FPGA Design Using Verilog

Modeling and implementation of dsp fpga solutions

Xilinx DSP. High Performance Signal Processing. January 1998

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

Advanced Design System DSP Synthesis

Synthesis Options FPGA and ASIC Technology Comparison - 1

Definitions. Key Objectives

FPGA Polyphase Filter Bank Study & Implementation

Chapter 2 Using Hardware Description Language Verilog. Overview

Cover TBD. intel Quartus prime Design software

Cover TBD. intel Quartus prime Design software

Vivado HLx Design Entry. June 2016

Advanced Design System 1.5. DSP Synthesis

Spiral 1 / Unit 4 Verilog HDL. Digital Circuit Design Steps. Digital Circuit Design OVERVIEW. Mark Redekopp. Description. Verification.

Wordlength Optimization

ibob ADC Tutorial CASPER Reference Design

Tutorial - Using Xilinx System Generator 14.6 for Co-Simulation on Digilent NEXYS3 (Spartan-6) Board

DSP Flow for SmartFusion2 and IGLOO2 Devices - Libero SoC v11.6 TU0312 Quickstart and Design Tutorial

FPGA Matrix Multiplier

Vivado Design Suite Tutorial. Model-Based DSP Design using System Generator

LogiCORE IP FIR Compiler v7.0

Field Programmable Gate Array (FPGA)

Synthesis of Combinational and Sequential Circuits with Verilog

An introduction to CoCentric

Floating-point to Fixed-point Conversion. Digital Signal Processing Programs (Short Version for FPGA DSP)

DSP Builder User Guide

The Application of SystemC to the Design and Implementation of a High Data Rate Satellite Transceiver

Synphony Model Compiler Release Notes Version H , March Contents

DSP Builder Handbook Volume 1: Introduction to DSP Builder

CMPE 415 Programmable Logic Devices Introduction

Vivado Design Suite User Guide

isplever Parallel FIR Filter User s Guide October 2005 ipug06_02.0

Register Transfer Level in Verilog: Part I

TSEA44 - Design for FPGAs

A Verilog Primer. An Overview of Verilog for Digital Design and Simulation

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Overview. Design flow. Principles of logic synthesis. Logic Synthesis with the common tools. Conclusions

Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for best synthesis. Spring 2007 Lec #8 -- HW Synthesis 1

Synplify Pro for Microsemi Edition Release Notes Version L M-G5, November 2016

Basic HLS Tutorial. using C++ language and Vivado Design Suite to design two frequencies PWM. modulator system

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Topics. Midterm Finish Chapter 7

University of Massachusetts Amherst Department of Electrical & Computer Engineering

Appendix SystemC Product Briefs. All product claims contained within are provided by the respective supplying company.

101-1 Under-Graduate Project Digital IC Design Flow

EECS150 - Digital Design Lecture 10 Logic Synthesis

EN2911X: Reconfigurable Computing Topic 02: Hardware Definition Languages

Xilinx System Generator v Xilinx Blockset Reference Guide. for Simulink. Introduction. Xilinx Blockset Overview.

Verilog Essentials Simulation & Synthesis

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

EEL 4783: HDL in Digital System Design

A Brief Introduction to Verilog Hardware Definition Language (HDL)

Quixilica Floating Point FPGA Cores

FPGA Based Digital Design Using Verilog HDL

Tutorial. CASPER Reference Design

AES1. Ultra-Compact Advanced Encryption Standard Core AES1. General Description. Base Core Features. Symbol. Applications

EECS150 - Digital Design Lecture 5 - Verilog Logic Synthesis

Logic Synthesis. EECS150 - Digital Design Lecture 6 - Synthesis

Making the Most of your MATLAB Models to Improve Verification

Chapter 5: ASICs Vs. PLDs

Generating Parameterized Modules and IP Cores

How to validate your FPGA design using realworld

RTL Power Estimation and Optimization

Transcription:

Introduction to Model-Based High- Level Synthesis Synphony Model Compiler Doug Johnson Synopsys, Incorporated dougj@synopsys.com Synopsys 2013 1 April 22, 2014

Agenda Introduction What is High-Level Synthesis? HLS versus RTL Synthesis Why use High-Level Synthesis? Overview - Synphony Model Compiler HLS Using Synphony Model Compiler HLS Signal Processing IP and High-Level Design Optimizations and Quality of Results Simulation and Verification Using RTL and C Models Optimization for ASIC Targets Design Example Using SMC for ASIC Power Optimization for a Digital Downconverter Q&A Synopsys 2013 2

Some Definitions HLS Definition - High-level synthesis (HLS) tools raise the design abstraction level by automatically generating optimized RTL hardware from an algorithm description Language-based (C-code, Matlab M-model, SystemC,.) Model-based Matlab/Simulink HLS versus RTL Synthesis RTL Synthesis - Registers, clocks defined, low level hardware description in language, IP blocks High Level Synthesis - Design description abstracted away from implementation e.g. samples versus clocks, behavioral versus RTL description, sample-based models Retiming System level retiming adds pipelining registers to improve performance; latency is increased but functionality stays the same Logic level retiming moves registers around and redistributes clouds of logic to improve performance; latency stays the same Synopsys 2013 3

Why High-Level Synthesis? Higher Design Reuse & Vendor Independence Optimized results for multiple FPGAs and ASIC Fast migration to new FPGA families and devices Design Productivity Simulink/MATLAB design & verification flow High-Value Signal Processing IP Custom high-level IP design methodology Verification Productivity Automated test from MATLAB Faster simulation using C-model generation Faster/earlier integrated system verification Synopsys 2013 4

Synphony Model Compiler (SMC) High-Level Synthesis for FPGA & ASIC Algorithm Implementation Signal Processing IP library Synphony DSP IP Library Optimizations for high QoR across multiple technologies, IP, and system architectures High-performance verification for RTL and C-model system simulation Synplify for FPGA Multi-vendor DSP Mapping Timing Closure High-Level Design (Simulink/MATLAB) Synphony Model Compiler High-Level Synthesis Verification RTL Testbench C-Models Prototypes Design Compiler for ASIC Custom ASIC Timing Closure Power Analysis Synopsys 2013 5

Quality and Design Portability using HLS Achieve excellent results for DSP designs across technologies Migrate designs more quickly and reliably Enable area, speed and power trade-offs early in design cycle Algorithm Design Synphony Model Compiler Design Compiler for ASIC Targets Target custom ASIC libraries (.lib file) Direct timing characterization using Design Compiler Synplify Pro/Premier for FPGA Targets Optimized mapping for DSP and memory resources Direct timing characterization using Synplify Integrated Power Analysis Flows using Power Compiler Synopsys 2013 6

Adopting High-Level Synthesis More flexibility with Synphony Model Compiler Use #1 : Verification of RTL using MATLAB/Simulink environment High level verification advantages with speed and without the pain Use #2 : Use optimal Synphony MC IP blocks Reduces need to hand-craft RTL Eliminates high-effort QoR issues like DSP mapping, BRAM mapping in a vendorindependent way Mix high-level blocks and RTL in a MATLAB/Simulink-based verification & design environment Use #3 : High-level design for complex subsystems Complex signal processing blocks created with much higher productivity Quickly migrate blocks to new FPGA devices or ASIC process nodes Phase adoption over time, retain value of existing HDL infrastructure Synopsys 2013 7

Overview Synphony Model Compiler High-Level Synthesis Synopsys 2013 8

Design Entry Using SMC Library High-level IP functions for wireless & communications applications SMC Library (Blockset) for High-Level Synthesis Fixed-point tools for easy control of algorithm precision Vector math for fast and concise parallelism Full multi-rate support simplifies multiple clock designs Language blocks: use RTL & M-Control for easier design of control & interfaces Example Digital Radio Receiver 2-3X Productivity Specification-to- Verified Fixed-Point Model Synopsys 2013 9

Optimized Signal Processing IP Portable Signal Processing Functions for FPGA and ASIC Easy to use with high capacity and advanced features Achieve excellent DSP hardware mapping on advanced FPGAs and ASIC IP Target multiple FPGA and ASIC vendors Synphony IP Cores Function (Blocks) Key Features/ Modes Architecture Optimizations FFT, FFT2 FIR, FIR2 DDS/NCO Viterbi Decoder / Convolutional Encoder Multichannel, Parallel / Serial Flow Control Dynamic Length Multi-Channel, Parallel / Serial Programmable Coef. Polyphase multirate Flow Control Symmetry-optimized Multichannel, Parallel / Serial SFDR mode Dithering Flow Control 1/K Rate, Mother Codes, Traceback Length, Speed (HLS retiming) Area (HLS Folding) Micro-architecture (IP-specific) Target Optimizations Xilinx Altera ASIC Synopsys 2013 10

Area Optimized Implementations SMC High-Level Synthesis Example Digital Radio Receiver System-Level Architecture Optimizations: User-specified architecture optimizations for timing, area, and power IP Micro-Architecture Optimizations IP blocks automatically select microarchitecture optimized for given target device and user-specified constraints Target-Specific Optimizations FPGA: Optimized mapping to on-chip resources (DSP, Memories, Shift Registers) ASIC: Real-time characterization, support for ASIC IP arithmetic units 5-10X Productivity Model-To- Optimized and Verified HW Synphony Model Compiler High-Level Synthesis RTL for multiple architectures and targets fft f B Sequential fft f f B B B Throughput f fft f Parallel Architectural Optimizations f f B B B B f Synopsys 2013 11

Adjac ent Chan nel -18db How SMC Optimizes for Different Targets SMC Accuracy for ASIC & FPGA Example: GSM Radio Receiver Problem: Accurate Timing and Area for HLS User- Specified Settings Required for timing closure and accurate optimization results User s logic synthesis settings might cause divergence (i.e. logic optimizations for power) DC or Synplify + Tech. Lib Timing & Area Info SMC Advanced Timing Mode SMC Advanced Timing Mode Dynamically characterizes blocks directly using logic synthesis tool Utilizes user settings for Design Compiler (ASIC) or Synplify (FPGA) Timing Closure & High QoR with Design Compiler and Synplify Flows High-Level Synthesis Optimizations fft f B Optimized HW Architectures fft f f B B B Logic Synthesis Timing Closure with User Specific Settings in Design Compiler (ASIC) or Synplify (FPGA) f fft f f f B B B B f Synopsys 2013 12

Using Synphony Model Compiler 1 Getting Started HLS Flow 6 HLS Multi-Rate Optimizations 2 Fixed-Point, Vectors, Latency 7 HLS IP-Level Optimizations 3 Multi-Rate Modeling 8 Verification Features 4 HLS Intro & Retiming 5 HLS - Folding 9 Using SMC for ASICs Synopsys 2013 13

Getting Started Basic HLS Flow with SMC 1 SMC Flow Overview Getting Started with Simulink Creating New Models Useful Settings Synphony Model Compiler Blockset Creating a synthesizable model Four steps of HLS implementation Specify Target Device/Technology Select Optimizations Synthesize RTL Implementation Logic Synthesis and Evaluating Results Architectural Exploration Overview Synopsys 2013 14

Simulink: Getting Started Start Matlab in the usual way often by using the Desktop Icon Access Simulink by Creating or Opening Models A new model provides a blank panel for starting a new blockbased system model. Simulink Models are saved as separate files with the extension.mdl In this Presentation we ll be using Matlab in the All Tabbed Desktop Layout Configuration: Desktop Desktop layout All tabbed Synopsys 2013 15

Creating New Synphony HLS Models Create a new model and save it under a new name Use the syn_get_dspstartup and syn_set_dspstartup commands to check and optimize the simulation mode. Popup Messages will display the result of the command and the function will return values at the command line. What they do: check and modify some of the parameters in the Simulation Configuration menu: Simulation Configuration Parameters The parameters are saved in the model file To automatically apply these defaults for every new model you create put dspstartup in matlabroot/toolbox/local/startup.m Synopsys 2013 16

Useful Simulink Settings Also make sure to turn on some of the Simulink display modes in Format Port/Signal Displays These are very useful for Synphony HLS designs: - Sample Time Colors will show the sample rate of signals and blocks - Port Data Types will show the fixed-point settings of signals - Signal Dimensions will show the vector or matrix size of signals Synopsys 2013 17

Choosing the Sample Rate Defining global variables Model Callbacks File (M-function) Model Properties Callbacks Synopsys 2013 18

Assigning the Sample Rate Others: the same sample time or -1 Sample time = 1 / Speed Synopsys 2013 19

Accessing the Synphony HLS Blockset Use shlslib to directly bring up the Synphony HLS Blockset browser window. The latest version will be used as indicated in the window and at the command line. Bring up the Simulink Library Browser by typing simulink at the command line or using the Library Browser Icon on the model or the Matlab window. The Simulink Library Browser can be used to access the Synphony HLS Blockset and all other installed Blocksets. Synopsys 2013 20

SMC Blockset Summary SMC s Feature-rich Blockset includes: CORDIC Communications Control Logic Filtering New in SMC / New in 14.03 Floating Point Add, Compare, Constant, Fused Mult Add, Mult, Port In/Out, FP<->Fixed Absolute, Accumulate, ArcTan, Div, Convert, InvSqrt, Log, Pow, SinCos, Sqrt, Square, Tan Filtering Parallel FIR, Parallel CIC2 Min Max Filter, Frame FIR Dynamic Farrow Resampler, FIR Resampler Math Functions Memories Ports & Subsystems Sources Signal Operations Transforms M Synthesis Math Functions Complex Abs, Complex Mult, Divider2 Memories Loadable Shift Register, RAM Based Delay, Interleaver2 Sources Parallel DDS2 Signal Operations Pulse Extender Transforms R2SDF FFT Parallel FFT Synopsys 2013 21

Basic Synphony Model Compiler Flow Creating a synthesizable model Four steps of HLS implementation Specify Target Device/Technology Select Optimizations Synthesize RTL Implementation Logic Synthesis and Evaluating Results Architectural Exploration Overview Synopsys 2013 22

Synphony Modeling Basics Port In and Port Out blocks define the boundaries of a design. The fixed point and sample rate settings you set here will be propagated to downstream blocks. SMC blocks include operations and IP that inherit input signal formats and sample rates. In this example the inputs are defined as a 18-bit signed number with 8-bit fraction length at 50Mhz sample rate. Then they are multiplied by constant gain of.333 and summed. Gain and Adder blocks will propagate full precision results to their outputs by default. This can be changed if desired. The red color indicates the sample rates. In this case they are all the same, i.e. no blocks have rate changes. Synopsys 2013 23

Verification using Simulink and Matlab You can use Simulink blocks to create stimulus waveforms to drive Port In blocks. Or you can read signals in from Matlab variables using the Signal From Workspace block in the Simulink Signal Processing Blockset/Signal Processing Sources You can connect Simulink analysis blocks to output ports or internal Synphony HLS signals to help analyze an verify the algorithm behavior. These will be ignored for implementation. The Signal To Workspace block can be used to dump any signal into Matlab for further analysis. Synopsys 2013 24

Creating Implementations Any Synphony HLS model can be implemented by instantiating the SHLSTool block inside the model. Double-clicking will open up a GUI which points to the model file. This can also be opened at the command line with shlstool Define an implementation by clicking on New Implementation to bring up the Implementation dialog box. Here you will specify: Implementation Name Device (Vendor, Technology, ) Output types (Verilog, VHDL, ) Design options (Global Reset, ) Synopsys 2013 25

Architectural Optimizations and Implementation Synthesis Click Run to create to Synthesize an implementation using the specified architectural optimizations. When an implementation is selected, the target device is displayed at the top. You can change it with Edit Implementation or delete it with Delete Implementation The implementations you create will be listed in the implementation window. Choose the architectural optimizations you want applied to the selected implementation. In this example none are applied. These will be explained in more detail later. Synopsys 2013 26

Fixed-Point, Vectors, and Latency 2 Fixed-Point, Vectors, Latency Fixed-Point Data Types Port In Block Data Type Propagation Convert and Recast Blocks Bit Extraction/Concatenation Vectors Model Latency Management Synopsys 2013 27

Fixed Point in Synphony Model Compiler Synphony Model Compiler uses the Simulink Fixed-Point features Synphony MC blocks that provide fixedpoint manipulation features: Convert Block Type conversion Quantization and Rounding Scaling Bit Manipulation Vector Manipulation IP blocks with internal data path propagation rules IP blocks with output quantization control Synopsys 2013 28

Fixed-Point Basics All Synphony HLS blocks support Simulink fixed-point data types and propagation features Up to 128-bit precision (most blocks) Simulink will update and propagate: Edit Update Diagram Hit <ctrl>-d Starting a simulation WL-1 WL-2 FL-1 2 1 0 MSB Sign Bit Value WL 2 Integer Length (IL) Binary Point FL Fraction Length (FL) Word Length (WL) LSB Simulink Annotation Format: s fix<wl>_en <FL> (signed) u fix<wl>_en <FL> (unsigned) boolean int8 int16 int32 int64 uint8 uint16 uint32 uint64 = ufix1 = sfix8 = sfix16 = sfix32 = sfix64 = ufix8 = ufix16 = ufix32 = ufix64 Synopsys 2013 29

Synphony Port In Block Converts from floating point stimulus Defines input to design Fixed point data type Word length Fraction length Sample rate Wrap on overflow Round on underflow Put Scope on signals of same data type only Synopsys 2013 30

Data Type Propagation Floating Point formats are used by other Simulink blocks and can drive input ports for stimulus & analysis. Port In and Port Out blocks defines data type and sample rates. All Synphony HLS blocks propagate full precision format to the output. Some blocks will offer different propagation rules that allow a choice of quantization. Most blocks support up to 128-bit precision with a few exceptions. A summary of quantization and precision support can be found in the Blockset Summary chapter. Synopsys 2013 31

Synphony Signal Conversion Convert Block Features Works on real world value of input Data type conversion Saturation/Rounding management Scaling by 2^N Inherited size variables Inherit port signed unsigned preserve inherit Size Inheritance Variables: syn_inp_wl syn_inp_fl syn_inh_wl syn_inh_fl Synopsys 2013 32

Synphony Convert Examples Examples converting an sfix4 type: Full Precision (nothing done) Convert to unsigned Convert to sfix3 with no saturation i.e. wrap (strips MSB) Convert to sfix3 with saturation Shift down by 1-bit before convert to sfix3 - scale and strip LSB Synopsys 2013 33

Recast Block Data Format Management Recast works on bit representation of input Synopsys 2013 34

Bit Concatenation and Extraction Extract Concat Works on bit representation of input Output is unsigned integer Vector LSB:MSB of result syn_inp_wl syn_inp_fl Synopsys 2013 35

Overflow, Minimum and Maximum Logging The Overflows, Minimums and Maximums are detected with Overflow, Minimum and Maximum logging feature. Overflows, Minimums and Maximums can be seen in the Fixed Point Toolbox of Simulink. Choose Minimums, maximums and overflows option as the Fixed Point instrumentation mode to log the Overflows, Minimums and Maximums Synopsys 2013 36

Creating Vectors A vector signal dimension is annotated as shown. Port In will accept vector input signals from Simulink Vector Concat will combine scalar and vector signals to create a vector signal Shift register has a vector output option that provides a sliding window function Decommutation with vector output will distribute samples into each element of the vector. The output is a slower sample rate of 1/N Vector Expand will replicate a scalar into a vector Synopsys 2013 37

Vector Processing 16-element vector inputs are propagated to downstream blocks 16-element vectorized multiply and add 16-element gain All memories also support vectorized operations All inputs must have the same vector size, including enables, resets, etc. Use Vector Expand to drive these inputs from scalar Synopsys 2013 38

FIR and IIR Filter Banks Decommutator converts input stream into slower rate vector Vector input infers an FIR for each element, i.e. a 256 bank FIR or IIR A single input adder will infer a sum-ofelements function creating a scalar output. Example 256-channel polyphase decimating filters Synopsys 2013 39

Multi-Rate Modeling 3 Multi-Rate Modeling Synopsys 2013 40

Multi-Rate Support Overview Sources Port In Counter, Constant, etc. Basic Rate Changes Upsample/Downsample De/Commutator Parallel2Serial/S2P Mux FIFO Dual Port RAM Application Specific (covered in later section) FIR Resampler/CIC Puncturing Synopsys 2013 41

Example - Downsampling Downsample ratio determines output rate Offset allows choice of which sample in the period to provide at the output Offset of zero picks the first sample and has zero latency Offset of one or higher picks the corresponding sample in the period and always has latency of 1 Synopsys 2013 42

Commutation and Decommutation Decommutation and Commutation is common behavior in multi-rate signal processing Synopsys 2013 43

Multi-Rate Memories FIFO and Dual Port RAM blocks support multiple rates on different ports Synopsys 2013 44

HLS Optimizations System Level Retiming 4 HLS Intro & Retiming HLS/SMC Optimization Terminology Basic Optimizations Constant propagation and coefficient optimization Multi-rate counter sharing Multi-rate filter transformations Memory/Register optimizations Retiming & Model Latency Management Synopsys 2013 45

HLS Optimizations Terminology Top-level optimization control is done by constraints in lower left panel SMC Optimization terminology and features: System-wide optimizations are directly controlled by the constraints. They are applied globally to the entire design to create a system-wide architecture IP-level optimizations are automatically done at the block level for more complicated IP-level functions All optimizations are target-aware based on the technology characterization of the selected target. Optimizations will sometimes rely on logic synthesis inferencing in the downstream tool to optimize operations to device resources A baseline implementation is created when no constraints are provided, but will still reflect many optimizations for target, inferencing, and IP. Advanced controls are also available (discussed in another module) Synopsys 2013 46

System-Wide Optimization Examples Optimization Constraint Architectural Optimizations Techniques Benefit Costs None (Baseline) Micro-architectural exploration at block and IP-level Automatic selection of best micro-architectures which gives best area for given sample rate. None Retiming Insertion of additional pipeline stages if needed to meet sample rates Higher Speed More Registers Folding Serialization using faster clock with resource sharing and scheduling Reduced Area Tighter Timing Constraint Control/Mux Overhead Folding (Multi-rate designs) Resource sharing and scheduling applied to all clock domains Reduced Area Control/Mux Overhead Multi- Channelization -Replication into multiple channels -Resource sharing and scheduling using faster clocks Reduced area Automatic multi-channel implementation Tighter Timing Constraint Control/Mux Overhead Synopsys 2013 47

Technology Characterization Options 1. Estimation Time Mode (EM) Uses a pre-calculated characterization for estimating timing Uses pre-determined heuristics for IP-architectural exploration 2. Advanced Timing Mode (AT) Uses more accurate timing estimates by calling Synplify PRO (FPGA) or Design Compiler (ASIC) Includes more precise device speed grade estimates Includes fanout information in the exploration Explores more micro-architectural options with more accurate feedback on inferencing and coding style Reports on timing loops Summarizes if block-level timing exploration meets constraints Synopsys 2013 48

Retiming Goal: Meet timing with minimal increase in area What does it do? System-Wide: Uses technology characterization to analyze timing for all data paths Insert new registers where needed Moves existing registers over data path Maintains consistent latency across parallel paths IP-Architectural: Explore block-level architecture options in target technology Includes pipelined versions of potential micro-architectures Choose the smallest option that meets timing Synopsys 2013 49

Multi-Rate Filter Optimizations FIR preceded or followed by up/down samplers have an equivalent polyphase version Exploits fact that rate conversions throw away samples Automatic transformation Filter bank running at slower rate gives With folding this can result in lower area Synopsys 2013 50

High-Level Synthesis - Folding 5 HLS - Folding Synopsys 2013 51

Folding Goal: Reduce area What does it do? System-Wide: Infers faster clock of F x sample rate Apply retiming algorithms to pipeline data paths Applies algorithms to detect which operations can be implemented using resource sharing Schedules operations and implements scheduling logic IP-Architectural: Explore block-level sequential architectural options in target technology Includes potential serialized and pipelined micro-architectures Synopsys 2013 52

32-Tap FIR Folding Example 32-Tap FIR 18-bit input Coefficients 18-bit fixed Low Pass frequency spec Positive Symmetric Synopsys 2013 53

Folding Exploration Results Results for 32-tap FIR into Spartan 3E: Baseline implementation barely fits into device Increasing folding factor begins using mults and storage for optimal serialized architecture implementation Culminates in a fully serial architecture which easily fits Target Device Sample Rate Folding Factor Inferred Clock Speed Achieved? Registers/SRL 16Es LUTs HW 18x18 Multipliers XC3S100E-4 2 MHz None None Yes 534/0 1889 0 98% XC3S100E-4 2 MHz 4 8 MHz Yes 512/108 2077 4 108% XC3S100E-4 2 MHz 8 16 MHz Yes 422/126 518 4 26% XC3S100E-4 2 MHz 16 32 MHz Yes 306/72 300 2 15% XC3S100E-4 2 MHz 32 64 MHz Yes 200/36 225 1 11% Max Util. % XC3S100E-4: Xilinx Spartan 3E Results after Logic Synthesis Synopsys 2013 54

Folding on Multi-Rate Designs Goal: Reduce area What does it do? System-Wide: Infers faster clock of F x fastest sample rate Apply retiming algorithms to pipeline data paths Applies folding clock optimizations to all sample rate domains Micro-Architectural: Counter sharing across upsample and downsample blocks Polyphase transformations on multi-rate FIR Explore block-level architectural options in target technology including potential serialized and pipelined micro-architectures Synopsys 2013 55

Multi-Rate Folding Results Results for 50 MHz 16-tap decimate x4 FIR: Polyphase transformation creates four 4-tap filters at 12.5MHz Fold x1 simply exploits fastest available clock Target Device Sample Rate Folding Factor Inferred Clock Speed Achieved? Registers LUTs HW Multipliers DSP48Es XC5VLX30-1 50 MHz 1 None Yes 166 88 4 XC5VLX30-1 50 MHz 4 200 MHz Yes 281 186 1 XC5VLX30-1: Xilinx Virtex 5 Results after Logic Synthesis Synopsys 2013 56

Verification Features 8 Verification Features Testbench generation C-model generation RTL encapsulation Importing RTL into Simulink Synopsys 2013 57

SMC C-Model Generation Create C-models of Simulink Model to: Increase simulation speeds Validate system integration in external simulators Verify SW/HW quickly Automated, flexible simulator support: Re-use in Simulink RTL simulators (ModelSim, VCS) Native ANSI-C direct executable SystemC simulators RTL for multiple architectures and targets fft fft fft filter filter filter SMC High-Level Model High-Level Design & Verificatiion Synphony Model Compiler High-Level Synthesis C S C S B Bit & cycle accurate C-Models for System Verification Synopsys 2013 58

C-Output - Creating C Output Files Generate C Code option (enabled only if the Generate RTL test bench option is selected) enables C- Output feature. Tool generates C output for Verilog RTL when Verilog and VHDL selected together. Synopsys 2013 59

C-Output - Creating C Output Files cont d For a model test.mdl, the directory structure of the C Output files shall be as follows Base folder having the test.mdl file Implementation folder having the cout folder cout folder has the C Output files File containing function definitions C driver file to verify the generated C Model File containing functions useful to debug Configuration Header file Header file containing the function declarations Supporting header file having variable declarations Makefile to run and verify the C Model VCProj file useful to create the executable in VC++ in windows Synopsys 2013 60

C-Model Results Design Complexity Digital Chaos Modem with large BER (bit-error rate) simulations Many instances of IP blocks: large filters, memories, Viterbi Decoder 11 sample rates with synchronous relationships, up to 300Mhz IP folding for area optimization Simulink Model 6165 sec (1X) Synphony Model Compiler High-Level Synthesis C-Model Benefits Up to 30X speedup over Simulink Up to 40X speedup over RTL simulator Over 10X higher verification productivity C-Models for System Verification C-Model Direct Exe. 206s (30X) C-Model in Simulink 230s (27X) C-Model in RTL Simulator 200s (40X) Optimized RTL RTL Simulation 8342s (.73X) Synopsys 2013 61

Using RTL in Simulink Models RTL Encapsulation Block RTL Encapsulation Features Insert RTL in SMC model Fast simulation No dependency on external RTL simulator Verilog and VHDL support RTL Block Use Models Use legacy IP or 3 rd party IP that was developed in RTL Add interface RTL to ease integrated system verification Add state machines and cycle-accurate control MATLAB-driven verification of RTL blocks Synopsys 2013 62

Advanced Features and IP for ASIC Targets 8 Using SMC for ASICs Synopsys 2013 63

User Flow for ASIC Targets Architectural Optimizations DSP Synthesis If FPGA Target Is selected FPGA RTL SynPRO.prj SynPRO.sdc Logic Optimizations RTL Synthesis Synplify Pro Synthesis Synphony HLS Synthesis Engine ASIC RTL Memory RTL.sdc Constraint ASIC Logic Synthesis ASIC ASIC Target Memory Extraction RTL Resolution /Retiming User Inserts 3 rd Party Memory Synphony HLS creates: Implementation with separate memory modules using RTL simulation model RTL testbench to verify with standard simulators User Selects Memory Implementation: Generate custom memories using preferred vendor compiler (i.e. Artisan, Virage, ) Replace memory references to point to new memory implementation modules Use testbench to verify ASIC Logic Synthesis Turn on Retiming fine tunes placement of pipeline registers inserted by Synphony HLS Use Synphony HLS generic.sdc file which defines all clock constraints Synopsys 2013 64

ASIC RAM Extraction The tool gives a user-control to selectively extract RAMs of different types from a model. Extracted RAMs satisfying user thresholds are reported in the shls.log file Verilog and VHDL simulation models are generated for the extracted RAM(s). EXTRACTED RAM INFORMATION ***************************************** RAM type "1RW" : 1 items of size 128x14 Total number of RAM modules = 1 Although RAM1 satisfies Memory Width threshold its depth is below the Memory Depth, so it s not extracetd. Synopsys 2013 65

Advanced Timing Mode for ASIC Example DC Setup file Users can also add their customizations to setup file (setting synthetic library) Synopsys 2013 66

Advanced Timing Mode Requirements Advanced Timing Mode requires Valid ASIC target libraries Valid Synopsys DC setup file Design Compiler Ultra license Advanced Timing Mode for ASIC can only run on supported x86 Linux environment. If license is available, Design Compiler instantiates basic DesignWare Building Blocks (Adders, Multipliers) Synopsys 2013 67

Advanced Timing Mode for ASIC Setup file can be defined in Timing Engine Configuration window or in implementation options. Synopsys 2013 68

Design Example Using SMC for ASIC Power Optimization for a Digital Downconverter Synopsys 2013 69

Multirate High-Level Synthesis for ASIC Easier Design Entry High-Level MR IP Library Concise Parallelism with Vector Notation MATLAB-Driven Verification Synphony IP Model Library 10X Reduction Implementation & Verification Effort Fast and Accurate Power Analysis & Optimization HLS MR System & Subsystem Optimizations HLS IP Optimizations MR Clock Implementation Automatic RTL and C-Model Testbench Creation Automatic Activity Data for Power Estimation DesignWare Datapath Optimizations Clock-Gating Power Optimizations Incorporate ASIC constraints back into HLS Synphony Model Compiler VCS DC Ultra Rapid Arch. Exploration More Optimal Designs Rapid Area & Power Exploration Fast FPGA Prototyping Retarget/Explore New Technologies Integrated HLS Solution Synopsys 2013 70

Digital Radio Example: High-Level Design Using SMC Library Synopsys 2013 71

Implementation Challenges Clocking? How many? Implement Clock Domain Crossings Impact on enables and resets Synchronization in testbench Implementing larger functions (i.e. building blocks or IP) Degrees of Parallelism vs. Sequential (Resource Sharing) How much in each rate domain? Are these building blocks available for given degree of parallelism? Pipelining for Timing Interfaces Testbench (block-level verification) Exploration How Much Time & Effort is Required? RTL Hand Coding Verification Hand Coding How Much Exploration is Possible? Synopsys 2013 72

Applying High-Level Synthesis Optimizations Target: 45nm ASIC (Estimation Mode) Apply Folding (Applied System-Wide) Apply Retiming/Pipelining RTL Optimized for Target Technology using Optimized HW Architecture RTL Testbench for HW Verification C-Model for System Verification Synopsys 2013 73

How Folding Works Across Multiple-Rates Goal: Reduce area Operations in high-level model: What does it do? Infers a system-wide clock of (Folding Factor * Sample Rate) Using the Faster Clock in each Sample Rate Domain: Apply time-domain multiplexing (TDM) to share HW for expensive operations (i.e. multiply) Also apply retiming to ensure timing is met Sharing increases in slower sample rate domains Ratio of System Clock to Sample Rate allows more sharing Synopsys 2013 74 c1 c2 c3 x x x Rst Rst Rst D Q + D Q D Q + En En En Require less HW 3Mult + 2Add x Rst + D Q Rst D Q Rst En D Q En En Fold x3: (1Mult +1Add)

Impact of Folding on Area and Power Folding Costs TDM overhead: muxes, state machines, and storage Higher clock frequencies Higher logic activity higher power Avoid Over-Folding Incur overhead costs w/little benefit Power increase Conclusion: Apply folding separately on slower rate subsystems Digital Radio Global Folding Optimization (Design & Power Compiler Results on TSMC 40nm LP) Area reduced but higher power Overfolding Parallel (No Folding) Fold x1 Fold x2 Area (Gates) Dynamic Power Synopsys 2013 75

Solution: Hierarchical Application of Folding Using HLS Subsystem Encapsulate Slower Filter Datapath in an HLS Subsystem Synopsys 2013 76

Solution: Hierarchical Application of Folding Using HLS Subsystem Encapsulate Slower Filter Datapath in an HLS Subsystem Apply Folding Separately to this Subsystem (and other HLS optimizations) Synopsys 2013 77

Solution: Hierarchical Application of Folding Using HLS Subsystem Encapsulate Slower Filter Datapath in an HLS Subsystem Apply Folding Separately to this Subsystem (and other HLS optimizations) Increases Exploration Capability Synopsys 2013 78

Hierarchical Folding Results Lowest Area (25% Reduction) Results using DC Ultra F-2011.09 compile_ultra default settings Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Area (Gates) DW_Mults Inferred Dynamic Power DDC results folding of filter subsystem, no folding of top-level Synopsys 2013 79

Hierarchical Folding Results Lowest Area (25% Reduction) Results using DC Ultra F-2011.09 compile_ultra default settings Lowest Power (34% Reduction) Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Area (Gates) DW_Mults Inferred Dynamic Power DDC results folding of filter subsystem, no folding of top-level Synopsys 2013 80

Hierarchical Folding Results Results using DC Ultra F-2011.09 compile_ultra default settings Lowest Area (25% Reduction) Overfolding (Limited Mult Reduction) Lowest Power (34% Reduction) Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Area (Gates) DW_Mults Inferred Dynamic Power DDC results folding of filter subsystem, no folding of top-level Synopsys 2013 81

Multi-rate Clock Implementation Choices Dedicated Clocking Slower rates are driven by dedicated clocks (i.e. PLL or clock divider) Advantage: lower activity resulting in lower power Disadvantage: more complex clock domain crossings, logic synthesis constraints, FPGA device support for multiple clock domains Enabled Clocking Slower rates are driven by fastest clock with clock enables determining the period Advantage: no clock domain crossings or multi-clock domain infrastructure support required by device Disadvantage: larger and more active enable nets can impact timing closure Global_EN Global_EN Dedicated Clocking FCLK SCLK EN EN Enable Clocking FCLK Enable Logic EN EN SCLK Fast Clock Domain Logic Slow Clock Domain Logic Fast Clock Domain Logic Slow Clock Domain Logic Synopsys 2013 82

SMC Multi-rate Clock Implementation Choice of Clock Strategy SMC Automatically Generates Clock / Reset Circuit Synchronizes clocks, resets, and enables Choice of: Clock Strategy Input Oscillator(s) Clock synthesis types Power-On and User Reset Polarities Synopsys 2013 83

Clocks & Resets: What SMC Generates New Top-Level Structure with: Core Design (i.e. with many clock domains) Clock_Reset Module that implements the selected clocking strategy Uses RTL coding styles for power optimization in ASIC flow Simplifies top-level system interface while allowing more complex MR HW architectures to be used in the design core Top level design osc1 porst g_en Clock- Reset Module.v/.vhd CLK1 CLK64 CLK128 CLK256 CLK8 CEN1 CEN64 Core Design Module.v/.vhd rst Synopsys 2013 84

Power Optimizations in ASIC RTL Flow Previous results were for a default flow Many option are available for lower power: (from the Power Compiler and DW MinPower User Guides): 1. Turn on automatic clock-gating conversion 2. Use DesignWare MinPower 3. Turn on datapath clock gating (DW components) 4. Add Activity data (significant accuracy improvement) 5. Turn on dynamic and leakage optimization 6. Constrain synthesis for area optimization Synopsys 2013 85

Automatic Activity Data Generation SMC Testbench can be used to automatically create activity data (using SAIF file) Adjace nt Chann el -18db Activity data enables significant improvement in power and area optimization in logic synthesis: Much higher accuracy in optimizations Enables some optimizations not available without activity data Benefits in using SMC: Very easy to create in a higher-level environment Enables early measurement of architecture decisions SMC HLS Testbench Gen. RTL & SDC IC Compiler IC Compiler Testbench VCS SAIF DC Ultra w/power Optimizations Synopsys 2013 86

Automated & Complete Power Synthesis Using Design Compiler Ultra RTL Design Compiler Ultra D_IN Register Bank D_OUT Power Compiler EN CLK Non clock gated implementation Netlist Optimized for Power, Timing, Area, Test Reduces clock-net EN CLK Latch D_IN ICG G_CLK Register Bank D_OUT Clock gated implementation switching power Reduces register internal power Reduces area Synopsys 2013 87

Turning On DesignWare minpower RTL DC Ultra Power Compiler Cost Function Met? minpower Use heuristic power models for architecture evaluation Integrate with DC and Power Compiler for architecture tradeoff Back off the solutions with timing impact Minimize area tradeoff (<1% on average) Optimize automatically when opportunity exists Requires no flow changes Synopsys 2013 88

SMC Advanced Timing Mode Correlation between DC and HLS is critical to timing closure The advanced power and area RTL optimization techniques can have significant impact on timing User DC Settings File High-Level Blocks High-Level Model SMC Advanced Timing Mode dynamically characterizes blocks based on user s DC settings Userspecified tech lib DC Synthesis Block Timing & Area High-Level Synthesis Optimizations Result: HLS Optimization results will have high correlation with DC using low power optimizations SMC in Advanced Timing Mode RTL Optimized for Target Technology using Optimized HW Architecture Synopsys 2013 89

Incorporating ASIC Power Optimizations Into SMC Flow Specify DC Setup file with user settings: TSMC 40nm Low Power lib. Turn on power optimizations Other Logic Synthesis Settings Turn on Advanced Timing Mode RTL Optimized for Target Technology using Optimized HW Architecture RTL Testbench for HW Verification Synopsys 2013 90

Results w/power Optimized DC Flow Results using power optimizations described previously 14.00 12.00 Power in mw 70% Power Reduction vs. non-power optimized result This was mostly reduction of net switching 10.00 8.00 6.00 4.00 Similar Area vs. Power Tradeoff Curve using Folding 2.00 0.00 Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Eclk Dynamic Pwr Opt. ASIC Flow Eclk Dynamic Pwr Std ASIC Flow Synopsys 2013 91

HLS Results Summary Results using power optimizations described previously Power in mw 4.50 Best Power Architecture Parallel-Dclk (2.9 mw) but with 57% greater area (~66K Gates) 4.00 3.50 3.00 2.50 2.00 Best Area Architecture Foldx16-Eclk (~42K Gates) But 43% higher power (4.26 mw) Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Dclk Area Eclk Area DClk Dynamic Power Eclk Dynamic Power 1.50 1.00 0.50 0.00 Synopsys 2013 92

Synphony Model Compiler Summary High-Level Synthesis for Model-Based Design Technology-Independent Model IP Model Library Synphony Model Compiler High-Level Synthesis User-Specified Optimizations Quickly create synthesizable multi-rate algorithms using optimized IP model library Verify & validate early using Simulink simulation and debugging RTL for multiple architectures and targets RTL Hardware Verification C-Models for Verification Globally optimize system architecture and IP using high-level synthesis fft fft fft C S C S filter filter filter B Achieve superior QoR and capacity using high quality RTL flows for ASIC and FPGA Implementation with automatically optimized system-wide architecture and IP cores Synopsys 2013 93