Introduction to Model-Based High- Level Synthesis Synphony Model Compiler Doug Johnson Synopsys, Incorporated dougj@synopsys.com Synopsys 2013 1 April 22, 2014
Agenda Introduction What is High-Level Synthesis? HLS versus RTL Synthesis Why use High-Level Synthesis? Overview - Synphony Model Compiler HLS Using Synphony Model Compiler HLS Signal Processing IP and High-Level Design Optimizations and Quality of Results Simulation and Verification Using RTL and C Models Optimization for ASIC Targets Design Example Using SMC for ASIC Power Optimization for a Digital Downconverter Q&A Synopsys 2013 2
Some Definitions HLS Definition - High-level synthesis (HLS) tools raise the design abstraction level by automatically generating optimized RTL hardware from an algorithm description Language-based (C-code, Matlab M-model, SystemC,.) Model-based Matlab/Simulink HLS versus RTL Synthesis RTL Synthesis - Registers, clocks defined, low level hardware description in language, IP blocks High Level Synthesis - Design description abstracted away from implementation e.g. samples versus clocks, behavioral versus RTL description, sample-based models Retiming System level retiming adds pipelining registers to improve performance; latency is increased but functionality stays the same Logic level retiming moves registers around and redistributes clouds of logic to improve performance; latency stays the same Synopsys 2013 3
Why High-Level Synthesis? Higher Design Reuse & Vendor Independence Optimized results for multiple FPGAs and ASIC Fast migration to new FPGA families and devices Design Productivity Simulink/MATLAB design & verification flow High-Value Signal Processing IP Custom high-level IP design methodology Verification Productivity Automated test from MATLAB Faster simulation using C-model generation Faster/earlier integrated system verification Synopsys 2013 4
Synphony Model Compiler (SMC) High-Level Synthesis for FPGA & ASIC Algorithm Implementation Signal Processing IP library Synphony DSP IP Library Optimizations for high QoR across multiple technologies, IP, and system architectures High-performance verification for RTL and C-model system simulation Synplify for FPGA Multi-vendor DSP Mapping Timing Closure High-Level Design (Simulink/MATLAB) Synphony Model Compiler High-Level Synthesis Verification RTL Testbench C-Models Prototypes Design Compiler for ASIC Custom ASIC Timing Closure Power Analysis Synopsys 2013 5
Quality and Design Portability using HLS Achieve excellent results for DSP designs across technologies Migrate designs more quickly and reliably Enable area, speed and power trade-offs early in design cycle Algorithm Design Synphony Model Compiler Design Compiler for ASIC Targets Target custom ASIC libraries (.lib file) Direct timing characterization using Design Compiler Synplify Pro/Premier for FPGA Targets Optimized mapping for DSP and memory resources Direct timing characterization using Synplify Integrated Power Analysis Flows using Power Compiler Synopsys 2013 6
Adopting High-Level Synthesis More flexibility with Synphony Model Compiler Use #1 : Verification of RTL using MATLAB/Simulink environment High level verification advantages with speed and without the pain Use #2 : Use optimal Synphony MC IP blocks Reduces need to hand-craft RTL Eliminates high-effort QoR issues like DSP mapping, BRAM mapping in a vendorindependent way Mix high-level blocks and RTL in a MATLAB/Simulink-based verification & design environment Use #3 : High-level design for complex subsystems Complex signal processing blocks created with much higher productivity Quickly migrate blocks to new FPGA devices or ASIC process nodes Phase adoption over time, retain value of existing HDL infrastructure Synopsys 2013 7
Overview Synphony Model Compiler High-Level Synthesis Synopsys 2013 8
Design Entry Using SMC Library High-level IP functions for wireless & communications applications SMC Library (Blockset) for High-Level Synthesis Fixed-point tools for easy control of algorithm precision Vector math for fast and concise parallelism Full multi-rate support simplifies multiple clock designs Language blocks: use RTL & M-Control for easier design of control & interfaces Example Digital Radio Receiver 2-3X Productivity Specification-to- Verified Fixed-Point Model Synopsys 2013 9
Optimized Signal Processing IP Portable Signal Processing Functions for FPGA and ASIC Easy to use with high capacity and advanced features Achieve excellent DSP hardware mapping on advanced FPGAs and ASIC IP Target multiple FPGA and ASIC vendors Synphony IP Cores Function (Blocks) Key Features/ Modes Architecture Optimizations FFT, FFT2 FIR, FIR2 DDS/NCO Viterbi Decoder / Convolutional Encoder Multichannel, Parallel / Serial Flow Control Dynamic Length Multi-Channel, Parallel / Serial Programmable Coef. Polyphase multirate Flow Control Symmetry-optimized Multichannel, Parallel / Serial SFDR mode Dithering Flow Control 1/K Rate, Mother Codes, Traceback Length, Speed (HLS retiming) Area (HLS Folding) Micro-architecture (IP-specific) Target Optimizations Xilinx Altera ASIC Synopsys 2013 10
Area Optimized Implementations SMC High-Level Synthesis Example Digital Radio Receiver System-Level Architecture Optimizations: User-specified architecture optimizations for timing, area, and power IP Micro-Architecture Optimizations IP blocks automatically select microarchitecture optimized for given target device and user-specified constraints Target-Specific Optimizations FPGA: Optimized mapping to on-chip resources (DSP, Memories, Shift Registers) ASIC: Real-time characterization, support for ASIC IP arithmetic units 5-10X Productivity Model-To- Optimized and Verified HW Synphony Model Compiler High-Level Synthesis RTL for multiple architectures and targets fft f B Sequential fft f f B B B Throughput f fft f Parallel Architectural Optimizations f f B B B B f Synopsys 2013 11
Adjac ent Chan nel -18db How SMC Optimizes for Different Targets SMC Accuracy for ASIC & FPGA Example: GSM Radio Receiver Problem: Accurate Timing and Area for HLS User- Specified Settings Required for timing closure and accurate optimization results User s logic synthesis settings might cause divergence (i.e. logic optimizations for power) DC or Synplify + Tech. Lib Timing & Area Info SMC Advanced Timing Mode SMC Advanced Timing Mode Dynamically characterizes blocks directly using logic synthesis tool Utilizes user settings for Design Compiler (ASIC) or Synplify (FPGA) Timing Closure & High QoR with Design Compiler and Synplify Flows High-Level Synthesis Optimizations fft f B Optimized HW Architectures fft f f B B B Logic Synthesis Timing Closure with User Specific Settings in Design Compiler (ASIC) or Synplify (FPGA) f fft f f f B B B B f Synopsys 2013 12
Using Synphony Model Compiler 1 Getting Started HLS Flow 6 HLS Multi-Rate Optimizations 2 Fixed-Point, Vectors, Latency 7 HLS IP-Level Optimizations 3 Multi-Rate Modeling 8 Verification Features 4 HLS Intro & Retiming 5 HLS - Folding 9 Using SMC for ASICs Synopsys 2013 13
Getting Started Basic HLS Flow with SMC 1 SMC Flow Overview Getting Started with Simulink Creating New Models Useful Settings Synphony Model Compiler Blockset Creating a synthesizable model Four steps of HLS implementation Specify Target Device/Technology Select Optimizations Synthesize RTL Implementation Logic Synthesis and Evaluating Results Architectural Exploration Overview Synopsys 2013 14
Simulink: Getting Started Start Matlab in the usual way often by using the Desktop Icon Access Simulink by Creating or Opening Models A new model provides a blank panel for starting a new blockbased system model. Simulink Models are saved as separate files with the extension.mdl In this Presentation we ll be using Matlab in the All Tabbed Desktop Layout Configuration: Desktop Desktop layout All tabbed Synopsys 2013 15
Creating New Synphony HLS Models Create a new model and save it under a new name Use the syn_get_dspstartup and syn_set_dspstartup commands to check and optimize the simulation mode. Popup Messages will display the result of the command and the function will return values at the command line. What they do: check and modify some of the parameters in the Simulation Configuration menu: Simulation Configuration Parameters The parameters are saved in the model file To automatically apply these defaults for every new model you create put dspstartup in matlabroot/toolbox/local/startup.m Synopsys 2013 16
Useful Simulink Settings Also make sure to turn on some of the Simulink display modes in Format Port/Signal Displays These are very useful for Synphony HLS designs: - Sample Time Colors will show the sample rate of signals and blocks - Port Data Types will show the fixed-point settings of signals - Signal Dimensions will show the vector or matrix size of signals Synopsys 2013 17
Choosing the Sample Rate Defining global variables Model Callbacks File (M-function) Model Properties Callbacks Synopsys 2013 18
Assigning the Sample Rate Others: the same sample time or -1 Sample time = 1 / Speed Synopsys 2013 19
Accessing the Synphony HLS Blockset Use shlslib to directly bring up the Synphony HLS Blockset browser window. The latest version will be used as indicated in the window and at the command line. Bring up the Simulink Library Browser by typing simulink at the command line or using the Library Browser Icon on the model or the Matlab window. The Simulink Library Browser can be used to access the Synphony HLS Blockset and all other installed Blocksets. Synopsys 2013 20
SMC Blockset Summary SMC s Feature-rich Blockset includes: CORDIC Communications Control Logic Filtering New in SMC / New in 14.03 Floating Point Add, Compare, Constant, Fused Mult Add, Mult, Port In/Out, FP<->Fixed Absolute, Accumulate, ArcTan, Div, Convert, InvSqrt, Log, Pow, SinCos, Sqrt, Square, Tan Filtering Parallel FIR, Parallel CIC2 Min Max Filter, Frame FIR Dynamic Farrow Resampler, FIR Resampler Math Functions Memories Ports & Subsystems Sources Signal Operations Transforms M Synthesis Math Functions Complex Abs, Complex Mult, Divider2 Memories Loadable Shift Register, RAM Based Delay, Interleaver2 Sources Parallel DDS2 Signal Operations Pulse Extender Transforms R2SDF FFT Parallel FFT Synopsys 2013 21
Basic Synphony Model Compiler Flow Creating a synthesizable model Four steps of HLS implementation Specify Target Device/Technology Select Optimizations Synthesize RTL Implementation Logic Synthesis and Evaluating Results Architectural Exploration Overview Synopsys 2013 22
Synphony Modeling Basics Port In and Port Out blocks define the boundaries of a design. The fixed point and sample rate settings you set here will be propagated to downstream blocks. SMC blocks include operations and IP that inherit input signal formats and sample rates. In this example the inputs are defined as a 18-bit signed number with 8-bit fraction length at 50Mhz sample rate. Then they are multiplied by constant gain of.333 and summed. Gain and Adder blocks will propagate full precision results to their outputs by default. This can be changed if desired. The red color indicates the sample rates. In this case they are all the same, i.e. no blocks have rate changes. Synopsys 2013 23
Verification using Simulink and Matlab You can use Simulink blocks to create stimulus waveforms to drive Port In blocks. Or you can read signals in from Matlab variables using the Signal From Workspace block in the Simulink Signal Processing Blockset/Signal Processing Sources You can connect Simulink analysis blocks to output ports or internal Synphony HLS signals to help analyze an verify the algorithm behavior. These will be ignored for implementation. The Signal To Workspace block can be used to dump any signal into Matlab for further analysis. Synopsys 2013 24
Creating Implementations Any Synphony HLS model can be implemented by instantiating the SHLSTool block inside the model. Double-clicking will open up a GUI which points to the model file. This can also be opened at the command line with shlstool Define an implementation by clicking on New Implementation to bring up the Implementation dialog box. Here you will specify: Implementation Name Device (Vendor, Technology, ) Output types (Verilog, VHDL, ) Design options (Global Reset, ) Synopsys 2013 25
Architectural Optimizations and Implementation Synthesis Click Run to create to Synthesize an implementation using the specified architectural optimizations. When an implementation is selected, the target device is displayed at the top. You can change it with Edit Implementation or delete it with Delete Implementation The implementations you create will be listed in the implementation window. Choose the architectural optimizations you want applied to the selected implementation. In this example none are applied. These will be explained in more detail later. Synopsys 2013 26
Fixed-Point, Vectors, and Latency 2 Fixed-Point, Vectors, Latency Fixed-Point Data Types Port In Block Data Type Propagation Convert and Recast Blocks Bit Extraction/Concatenation Vectors Model Latency Management Synopsys 2013 27
Fixed Point in Synphony Model Compiler Synphony Model Compiler uses the Simulink Fixed-Point features Synphony MC blocks that provide fixedpoint manipulation features: Convert Block Type conversion Quantization and Rounding Scaling Bit Manipulation Vector Manipulation IP blocks with internal data path propagation rules IP blocks with output quantization control Synopsys 2013 28
Fixed-Point Basics All Synphony HLS blocks support Simulink fixed-point data types and propagation features Up to 128-bit precision (most blocks) Simulink will update and propagate: Edit Update Diagram Hit <ctrl>-d Starting a simulation WL-1 WL-2 FL-1 2 1 0 MSB Sign Bit Value WL 2 Integer Length (IL) Binary Point FL Fraction Length (FL) Word Length (WL) LSB Simulink Annotation Format: s fix<wl>_en <FL> (signed) u fix<wl>_en <FL> (unsigned) boolean int8 int16 int32 int64 uint8 uint16 uint32 uint64 = ufix1 = sfix8 = sfix16 = sfix32 = sfix64 = ufix8 = ufix16 = ufix32 = ufix64 Synopsys 2013 29
Synphony Port In Block Converts from floating point stimulus Defines input to design Fixed point data type Word length Fraction length Sample rate Wrap on overflow Round on underflow Put Scope on signals of same data type only Synopsys 2013 30
Data Type Propagation Floating Point formats are used by other Simulink blocks and can drive input ports for stimulus & analysis. Port In and Port Out blocks defines data type and sample rates. All Synphony HLS blocks propagate full precision format to the output. Some blocks will offer different propagation rules that allow a choice of quantization. Most blocks support up to 128-bit precision with a few exceptions. A summary of quantization and precision support can be found in the Blockset Summary chapter. Synopsys 2013 31
Synphony Signal Conversion Convert Block Features Works on real world value of input Data type conversion Saturation/Rounding management Scaling by 2^N Inherited size variables Inherit port signed unsigned preserve inherit Size Inheritance Variables: syn_inp_wl syn_inp_fl syn_inh_wl syn_inh_fl Synopsys 2013 32
Synphony Convert Examples Examples converting an sfix4 type: Full Precision (nothing done) Convert to unsigned Convert to sfix3 with no saturation i.e. wrap (strips MSB) Convert to sfix3 with saturation Shift down by 1-bit before convert to sfix3 - scale and strip LSB Synopsys 2013 33
Recast Block Data Format Management Recast works on bit representation of input Synopsys 2013 34
Bit Concatenation and Extraction Extract Concat Works on bit representation of input Output is unsigned integer Vector LSB:MSB of result syn_inp_wl syn_inp_fl Synopsys 2013 35
Overflow, Minimum and Maximum Logging The Overflows, Minimums and Maximums are detected with Overflow, Minimum and Maximum logging feature. Overflows, Minimums and Maximums can be seen in the Fixed Point Toolbox of Simulink. Choose Minimums, maximums and overflows option as the Fixed Point instrumentation mode to log the Overflows, Minimums and Maximums Synopsys 2013 36
Creating Vectors A vector signal dimension is annotated as shown. Port In will accept vector input signals from Simulink Vector Concat will combine scalar and vector signals to create a vector signal Shift register has a vector output option that provides a sliding window function Decommutation with vector output will distribute samples into each element of the vector. The output is a slower sample rate of 1/N Vector Expand will replicate a scalar into a vector Synopsys 2013 37
Vector Processing 16-element vector inputs are propagated to downstream blocks 16-element vectorized multiply and add 16-element gain All memories also support vectorized operations All inputs must have the same vector size, including enables, resets, etc. Use Vector Expand to drive these inputs from scalar Synopsys 2013 38
FIR and IIR Filter Banks Decommutator converts input stream into slower rate vector Vector input infers an FIR for each element, i.e. a 256 bank FIR or IIR A single input adder will infer a sum-ofelements function creating a scalar output. Example 256-channel polyphase decimating filters Synopsys 2013 39
Multi-Rate Modeling 3 Multi-Rate Modeling Synopsys 2013 40
Multi-Rate Support Overview Sources Port In Counter, Constant, etc. Basic Rate Changes Upsample/Downsample De/Commutator Parallel2Serial/S2P Mux FIFO Dual Port RAM Application Specific (covered in later section) FIR Resampler/CIC Puncturing Synopsys 2013 41
Example - Downsampling Downsample ratio determines output rate Offset allows choice of which sample in the period to provide at the output Offset of zero picks the first sample and has zero latency Offset of one or higher picks the corresponding sample in the period and always has latency of 1 Synopsys 2013 42
Commutation and Decommutation Decommutation and Commutation is common behavior in multi-rate signal processing Synopsys 2013 43
Multi-Rate Memories FIFO and Dual Port RAM blocks support multiple rates on different ports Synopsys 2013 44
HLS Optimizations System Level Retiming 4 HLS Intro & Retiming HLS/SMC Optimization Terminology Basic Optimizations Constant propagation and coefficient optimization Multi-rate counter sharing Multi-rate filter transformations Memory/Register optimizations Retiming & Model Latency Management Synopsys 2013 45
HLS Optimizations Terminology Top-level optimization control is done by constraints in lower left panel SMC Optimization terminology and features: System-wide optimizations are directly controlled by the constraints. They are applied globally to the entire design to create a system-wide architecture IP-level optimizations are automatically done at the block level for more complicated IP-level functions All optimizations are target-aware based on the technology characterization of the selected target. Optimizations will sometimes rely on logic synthesis inferencing in the downstream tool to optimize operations to device resources A baseline implementation is created when no constraints are provided, but will still reflect many optimizations for target, inferencing, and IP. Advanced controls are also available (discussed in another module) Synopsys 2013 46
System-Wide Optimization Examples Optimization Constraint Architectural Optimizations Techniques Benefit Costs None (Baseline) Micro-architectural exploration at block and IP-level Automatic selection of best micro-architectures which gives best area for given sample rate. None Retiming Insertion of additional pipeline stages if needed to meet sample rates Higher Speed More Registers Folding Serialization using faster clock with resource sharing and scheduling Reduced Area Tighter Timing Constraint Control/Mux Overhead Folding (Multi-rate designs) Resource sharing and scheduling applied to all clock domains Reduced Area Control/Mux Overhead Multi- Channelization -Replication into multiple channels -Resource sharing and scheduling using faster clocks Reduced area Automatic multi-channel implementation Tighter Timing Constraint Control/Mux Overhead Synopsys 2013 47
Technology Characterization Options 1. Estimation Time Mode (EM) Uses a pre-calculated characterization for estimating timing Uses pre-determined heuristics for IP-architectural exploration 2. Advanced Timing Mode (AT) Uses more accurate timing estimates by calling Synplify PRO (FPGA) or Design Compiler (ASIC) Includes more precise device speed grade estimates Includes fanout information in the exploration Explores more micro-architectural options with more accurate feedback on inferencing and coding style Reports on timing loops Summarizes if block-level timing exploration meets constraints Synopsys 2013 48
Retiming Goal: Meet timing with minimal increase in area What does it do? System-Wide: Uses technology characterization to analyze timing for all data paths Insert new registers where needed Moves existing registers over data path Maintains consistent latency across parallel paths IP-Architectural: Explore block-level architecture options in target technology Includes pipelined versions of potential micro-architectures Choose the smallest option that meets timing Synopsys 2013 49
Multi-Rate Filter Optimizations FIR preceded or followed by up/down samplers have an equivalent polyphase version Exploits fact that rate conversions throw away samples Automatic transformation Filter bank running at slower rate gives With folding this can result in lower area Synopsys 2013 50
High-Level Synthesis - Folding 5 HLS - Folding Synopsys 2013 51
Folding Goal: Reduce area What does it do? System-Wide: Infers faster clock of F x sample rate Apply retiming algorithms to pipeline data paths Applies algorithms to detect which operations can be implemented using resource sharing Schedules operations and implements scheduling logic IP-Architectural: Explore block-level sequential architectural options in target technology Includes potential serialized and pipelined micro-architectures Synopsys 2013 52
32-Tap FIR Folding Example 32-Tap FIR 18-bit input Coefficients 18-bit fixed Low Pass frequency spec Positive Symmetric Synopsys 2013 53
Folding Exploration Results Results for 32-tap FIR into Spartan 3E: Baseline implementation barely fits into device Increasing folding factor begins using mults and storage for optimal serialized architecture implementation Culminates in a fully serial architecture which easily fits Target Device Sample Rate Folding Factor Inferred Clock Speed Achieved? Registers/SRL 16Es LUTs HW 18x18 Multipliers XC3S100E-4 2 MHz None None Yes 534/0 1889 0 98% XC3S100E-4 2 MHz 4 8 MHz Yes 512/108 2077 4 108% XC3S100E-4 2 MHz 8 16 MHz Yes 422/126 518 4 26% XC3S100E-4 2 MHz 16 32 MHz Yes 306/72 300 2 15% XC3S100E-4 2 MHz 32 64 MHz Yes 200/36 225 1 11% Max Util. % XC3S100E-4: Xilinx Spartan 3E Results after Logic Synthesis Synopsys 2013 54
Folding on Multi-Rate Designs Goal: Reduce area What does it do? System-Wide: Infers faster clock of F x fastest sample rate Apply retiming algorithms to pipeline data paths Applies folding clock optimizations to all sample rate domains Micro-Architectural: Counter sharing across upsample and downsample blocks Polyphase transformations on multi-rate FIR Explore block-level architectural options in target technology including potential serialized and pipelined micro-architectures Synopsys 2013 55
Multi-Rate Folding Results Results for 50 MHz 16-tap decimate x4 FIR: Polyphase transformation creates four 4-tap filters at 12.5MHz Fold x1 simply exploits fastest available clock Target Device Sample Rate Folding Factor Inferred Clock Speed Achieved? Registers LUTs HW Multipliers DSP48Es XC5VLX30-1 50 MHz 1 None Yes 166 88 4 XC5VLX30-1 50 MHz 4 200 MHz Yes 281 186 1 XC5VLX30-1: Xilinx Virtex 5 Results after Logic Synthesis Synopsys 2013 56
Verification Features 8 Verification Features Testbench generation C-model generation RTL encapsulation Importing RTL into Simulink Synopsys 2013 57
SMC C-Model Generation Create C-models of Simulink Model to: Increase simulation speeds Validate system integration in external simulators Verify SW/HW quickly Automated, flexible simulator support: Re-use in Simulink RTL simulators (ModelSim, VCS) Native ANSI-C direct executable SystemC simulators RTL for multiple architectures and targets fft fft fft filter filter filter SMC High-Level Model High-Level Design & Verificatiion Synphony Model Compiler High-Level Synthesis C S C S B Bit & cycle accurate C-Models for System Verification Synopsys 2013 58
C-Output - Creating C Output Files Generate C Code option (enabled only if the Generate RTL test bench option is selected) enables C- Output feature. Tool generates C output for Verilog RTL when Verilog and VHDL selected together. Synopsys 2013 59
C-Output - Creating C Output Files cont d For a model test.mdl, the directory structure of the C Output files shall be as follows Base folder having the test.mdl file Implementation folder having the cout folder cout folder has the C Output files File containing function definitions C driver file to verify the generated C Model File containing functions useful to debug Configuration Header file Header file containing the function declarations Supporting header file having variable declarations Makefile to run and verify the C Model VCProj file useful to create the executable in VC++ in windows Synopsys 2013 60
C-Model Results Design Complexity Digital Chaos Modem with large BER (bit-error rate) simulations Many instances of IP blocks: large filters, memories, Viterbi Decoder 11 sample rates with synchronous relationships, up to 300Mhz IP folding for area optimization Simulink Model 6165 sec (1X) Synphony Model Compiler High-Level Synthesis C-Model Benefits Up to 30X speedup over Simulink Up to 40X speedup over RTL simulator Over 10X higher verification productivity C-Models for System Verification C-Model Direct Exe. 206s (30X) C-Model in Simulink 230s (27X) C-Model in RTL Simulator 200s (40X) Optimized RTL RTL Simulation 8342s (.73X) Synopsys 2013 61
Using RTL in Simulink Models RTL Encapsulation Block RTL Encapsulation Features Insert RTL in SMC model Fast simulation No dependency on external RTL simulator Verilog and VHDL support RTL Block Use Models Use legacy IP or 3 rd party IP that was developed in RTL Add interface RTL to ease integrated system verification Add state machines and cycle-accurate control MATLAB-driven verification of RTL blocks Synopsys 2013 62
Advanced Features and IP for ASIC Targets 8 Using SMC for ASICs Synopsys 2013 63
User Flow for ASIC Targets Architectural Optimizations DSP Synthesis If FPGA Target Is selected FPGA RTL SynPRO.prj SynPRO.sdc Logic Optimizations RTL Synthesis Synplify Pro Synthesis Synphony HLS Synthesis Engine ASIC RTL Memory RTL.sdc Constraint ASIC Logic Synthesis ASIC ASIC Target Memory Extraction RTL Resolution /Retiming User Inserts 3 rd Party Memory Synphony HLS creates: Implementation with separate memory modules using RTL simulation model RTL testbench to verify with standard simulators User Selects Memory Implementation: Generate custom memories using preferred vendor compiler (i.e. Artisan, Virage, ) Replace memory references to point to new memory implementation modules Use testbench to verify ASIC Logic Synthesis Turn on Retiming fine tunes placement of pipeline registers inserted by Synphony HLS Use Synphony HLS generic.sdc file which defines all clock constraints Synopsys 2013 64
ASIC RAM Extraction The tool gives a user-control to selectively extract RAMs of different types from a model. Extracted RAMs satisfying user thresholds are reported in the shls.log file Verilog and VHDL simulation models are generated for the extracted RAM(s). EXTRACTED RAM INFORMATION ***************************************** RAM type "1RW" : 1 items of size 128x14 Total number of RAM modules = 1 Although RAM1 satisfies Memory Width threshold its depth is below the Memory Depth, so it s not extracetd. Synopsys 2013 65
Advanced Timing Mode for ASIC Example DC Setup file Users can also add their customizations to setup file (setting synthetic library) Synopsys 2013 66
Advanced Timing Mode Requirements Advanced Timing Mode requires Valid ASIC target libraries Valid Synopsys DC setup file Design Compiler Ultra license Advanced Timing Mode for ASIC can only run on supported x86 Linux environment. If license is available, Design Compiler instantiates basic DesignWare Building Blocks (Adders, Multipliers) Synopsys 2013 67
Advanced Timing Mode for ASIC Setup file can be defined in Timing Engine Configuration window or in implementation options. Synopsys 2013 68
Design Example Using SMC for ASIC Power Optimization for a Digital Downconverter Synopsys 2013 69
Multirate High-Level Synthesis for ASIC Easier Design Entry High-Level MR IP Library Concise Parallelism with Vector Notation MATLAB-Driven Verification Synphony IP Model Library 10X Reduction Implementation & Verification Effort Fast and Accurate Power Analysis & Optimization HLS MR System & Subsystem Optimizations HLS IP Optimizations MR Clock Implementation Automatic RTL and C-Model Testbench Creation Automatic Activity Data for Power Estimation DesignWare Datapath Optimizations Clock-Gating Power Optimizations Incorporate ASIC constraints back into HLS Synphony Model Compiler VCS DC Ultra Rapid Arch. Exploration More Optimal Designs Rapid Area & Power Exploration Fast FPGA Prototyping Retarget/Explore New Technologies Integrated HLS Solution Synopsys 2013 70
Digital Radio Example: High-Level Design Using SMC Library Synopsys 2013 71
Implementation Challenges Clocking? How many? Implement Clock Domain Crossings Impact on enables and resets Synchronization in testbench Implementing larger functions (i.e. building blocks or IP) Degrees of Parallelism vs. Sequential (Resource Sharing) How much in each rate domain? Are these building blocks available for given degree of parallelism? Pipelining for Timing Interfaces Testbench (block-level verification) Exploration How Much Time & Effort is Required? RTL Hand Coding Verification Hand Coding How Much Exploration is Possible? Synopsys 2013 72
Applying High-Level Synthesis Optimizations Target: 45nm ASIC (Estimation Mode) Apply Folding (Applied System-Wide) Apply Retiming/Pipelining RTL Optimized for Target Technology using Optimized HW Architecture RTL Testbench for HW Verification C-Model for System Verification Synopsys 2013 73
How Folding Works Across Multiple-Rates Goal: Reduce area Operations in high-level model: What does it do? Infers a system-wide clock of (Folding Factor * Sample Rate) Using the Faster Clock in each Sample Rate Domain: Apply time-domain multiplexing (TDM) to share HW for expensive operations (i.e. multiply) Also apply retiming to ensure timing is met Sharing increases in slower sample rate domains Ratio of System Clock to Sample Rate allows more sharing Synopsys 2013 74 c1 c2 c3 x x x Rst Rst Rst D Q + D Q D Q + En En En Require less HW 3Mult + 2Add x Rst + D Q Rst D Q Rst En D Q En En Fold x3: (1Mult +1Add)
Impact of Folding on Area and Power Folding Costs TDM overhead: muxes, state machines, and storage Higher clock frequencies Higher logic activity higher power Avoid Over-Folding Incur overhead costs w/little benefit Power increase Conclusion: Apply folding separately on slower rate subsystems Digital Radio Global Folding Optimization (Design & Power Compiler Results on TSMC 40nm LP) Area reduced but higher power Overfolding Parallel (No Folding) Fold x1 Fold x2 Area (Gates) Dynamic Power Synopsys 2013 75
Solution: Hierarchical Application of Folding Using HLS Subsystem Encapsulate Slower Filter Datapath in an HLS Subsystem Synopsys 2013 76
Solution: Hierarchical Application of Folding Using HLS Subsystem Encapsulate Slower Filter Datapath in an HLS Subsystem Apply Folding Separately to this Subsystem (and other HLS optimizations) Synopsys 2013 77
Solution: Hierarchical Application of Folding Using HLS Subsystem Encapsulate Slower Filter Datapath in an HLS Subsystem Apply Folding Separately to this Subsystem (and other HLS optimizations) Increases Exploration Capability Synopsys 2013 78
Hierarchical Folding Results Lowest Area (25% Reduction) Results using DC Ultra F-2011.09 compile_ultra default settings Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Area (Gates) DW_Mults Inferred Dynamic Power DDC results folding of filter subsystem, no folding of top-level Synopsys 2013 79
Hierarchical Folding Results Lowest Area (25% Reduction) Results using DC Ultra F-2011.09 compile_ultra default settings Lowest Power (34% Reduction) Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Area (Gates) DW_Mults Inferred Dynamic Power DDC results folding of filter subsystem, no folding of top-level Synopsys 2013 80
Hierarchical Folding Results Results using DC Ultra F-2011.09 compile_ultra default settings Lowest Area (25% Reduction) Overfolding (Limited Mult Reduction) Lowest Power (34% Reduction) Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Area (Gates) DW_Mults Inferred Dynamic Power DDC results folding of filter subsystem, no folding of top-level Synopsys 2013 81
Multi-rate Clock Implementation Choices Dedicated Clocking Slower rates are driven by dedicated clocks (i.e. PLL or clock divider) Advantage: lower activity resulting in lower power Disadvantage: more complex clock domain crossings, logic synthesis constraints, FPGA device support for multiple clock domains Enabled Clocking Slower rates are driven by fastest clock with clock enables determining the period Advantage: no clock domain crossings or multi-clock domain infrastructure support required by device Disadvantage: larger and more active enable nets can impact timing closure Global_EN Global_EN Dedicated Clocking FCLK SCLK EN EN Enable Clocking FCLK Enable Logic EN EN SCLK Fast Clock Domain Logic Slow Clock Domain Logic Fast Clock Domain Logic Slow Clock Domain Logic Synopsys 2013 82
SMC Multi-rate Clock Implementation Choice of Clock Strategy SMC Automatically Generates Clock / Reset Circuit Synchronizes clocks, resets, and enables Choice of: Clock Strategy Input Oscillator(s) Clock synthesis types Power-On and User Reset Polarities Synopsys 2013 83
Clocks & Resets: What SMC Generates New Top-Level Structure with: Core Design (i.e. with many clock domains) Clock_Reset Module that implements the selected clocking strategy Uses RTL coding styles for power optimization in ASIC flow Simplifies top-level system interface while allowing more complex MR HW architectures to be used in the design core Top level design osc1 porst g_en Clock- Reset Module.v/.vhd CLK1 CLK64 CLK128 CLK256 CLK8 CEN1 CEN64 Core Design Module.v/.vhd rst Synopsys 2013 84
Power Optimizations in ASIC RTL Flow Previous results were for a default flow Many option are available for lower power: (from the Power Compiler and DW MinPower User Guides): 1. Turn on automatic clock-gating conversion 2. Use DesignWare MinPower 3. Turn on datapath clock gating (DW components) 4. Add Activity data (significant accuracy improvement) 5. Turn on dynamic and leakage optimization 6. Constrain synthesis for area optimization Synopsys 2013 85
Automatic Activity Data Generation SMC Testbench can be used to automatically create activity data (using SAIF file) Adjace nt Chann el -18db Activity data enables significant improvement in power and area optimization in logic synthesis: Much higher accuracy in optimizations Enables some optimizations not available without activity data Benefits in using SMC: Very easy to create in a higher-level environment Enables early measurement of architecture decisions SMC HLS Testbench Gen. RTL & SDC IC Compiler IC Compiler Testbench VCS SAIF DC Ultra w/power Optimizations Synopsys 2013 86
Automated & Complete Power Synthesis Using Design Compiler Ultra RTL Design Compiler Ultra D_IN Register Bank D_OUT Power Compiler EN CLK Non clock gated implementation Netlist Optimized for Power, Timing, Area, Test Reduces clock-net EN CLK Latch D_IN ICG G_CLK Register Bank D_OUT Clock gated implementation switching power Reduces register internal power Reduces area Synopsys 2013 87
Turning On DesignWare minpower RTL DC Ultra Power Compiler Cost Function Met? minpower Use heuristic power models for architecture evaluation Integrate with DC and Power Compiler for architecture tradeoff Back off the solutions with timing impact Minimize area tradeoff (<1% on average) Optimize automatically when opportunity exists Requires no flow changes Synopsys 2013 88
SMC Advanced Timing Mode Correlation between DC and HLS is critical to timing closure The advanced power and area RTL optimization techniques can have significant impact on timing User DC Settings File High-Level Blocks High-Level Model SMC Advanced Timing Mode dynamically characterizes blocks based on user s DC settings Userspecified tech lib DC Synthesis Block Timing & Area High-Level Synthesis Optimizations Result: HLS Optimization results will have high correlation with DC using low power optimizations SMC in Advanced Timing Mode RTL Optimized for Target Technology using Optimized HW Architecture Synopsys 2013 89
Incorporating ASIC Power Optimizations Into SMC Flow Specify DC Setup file with user settings: TSMC 40nm Low Power lib. Turn on power optimizations Other Logic Synthesis Settings Turn on Advanced Timing Mode RTL Optimized for Target Technology using Optimized HW Architecture RTL Testbench for HW Verification Synopsys 2013 90
Results w/power Optimized DC Flow Results using power optimizations described previously 14.00 12.00 Power in mw 70% Power Reduction vs. non-power optimized result This was mostly reduction of net switching 10.00 8.00 6.00 4.00 Similar Area vs. Power Tradeoff Curve using Folding 2.00 0.00 Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Eclk Dynamic Pwr Opt. ASIC Flow Eclk Dynamic Pwr Std ASIC Flow Synopsys 2013 91
HLS Results Summary Results using power optimizations described previously Power in mw 4.50 Best Power Architecture Parallel-Dclk (2.9 mw) but with 57% greater area (~66K Gates) 4.00 3.50 3.00 2.50 2.00 Best Area Architecture Foldx16-Eclk (~42K Gates) But 43% higher power (4.26 mw) Full Parallel Fold x1 Fold x2 Fold x4 Fold x8 Fold x16 Fold x32 Dclk Area Eclk Area DClk Dynamic Power Eclk Dynamic Power 1.50 1.00 0.50 0.00 Synopsys 2013 92
Synphony Model Compiler Summary High-Level Synthesis for Model-Based Design Technology-Independent Model IP Model Library Synphony Model Compiler High-Level Synthesis User-Specified Optimizations Quickly create synthesizable multi-rate algorithms using optimized IP model library Verify & validate early using Simulink simulation and debugging RTL for multiple architectures and targets RTL Hardware Verification C-Models for Verification Globally optimize system architecture and IP using high-level synthesis fft fft fft C S C S filter filter filter B Achieve superior QoR and capacity using high quality RTL flows for ASIC and FPGA Implementation with automatically optimized system-wide architecture and IP cores Synopsys 2013 93