Platform-Based Behavior-Level and System-Level Synthesis. Prof. Jason Cong UCLA Computer Science Department

Size: px

Start display at page:

Download "Platform-Based Behavior-Level and System-Level Synthesis. Prof. Jason Cong UCLA Computer Science Department"

Maria Caldwell
5 years ago
Views:

1 Platform-Based Behavior-Level and System-Level Synthesis Prof. Jason Cong UCLA Computer Science Department

2 Outline Motivation xpilot system framework Behavior-level synthesis in xpilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xpilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions

3 ASICs SOC Example: Philips Nexperia General-purpose scalable RISC processor 50 to 300+ MHz 32-bit or 64-bit Library of device IP blocks Image coprocessors DSPs UART 1394 USB Courtesy Philips MIPS MIPS CPU D$ PRxxxx I$ DEVICE IP BLOCK DEVICE IP BLOCK.. DEVICE IP BLOCK. PI BUS DVP SYSTEM SILICON SDRAM MMI DVP MEMORY BUS PI BUS TriMedia TriMedia CPU D$ TM-xxxx I$ DEVICE IP BLOCK DEVICE IP. BLOCK DEVICE IP. BLOCK Philips Nexperia SoC platform for high-end digital video. Scalable VLIW media processor: MPEG VIDEO MSP MIPS 100 to 300+ MHz 32-bit or 64-bit Nexperia system buses bit ACCESS CTL. VLIW

4 Field-Programmable SOC Example: Xilinx Virtex-4 4 FPGA Soft core µproc MicroBlaze 180MHz < ~1300 LUTs 166 DMIPS IP IP IBM CoreConnect Bus Micro- Blaze H.264/AVC hardware blocks PowerPC 405 (PPC405) core 450 MHz, 700+ DMIPS RISC core (32-bit Harvard architecture) Courtesy Xilinx

5 Needs for Electronic System-Level (ESL) Design Automation Need executable models for system-level specification Need common specification for SW/HW co-design Need better complexity management

6 ESL Landscape Modeling SystemC -- OpenSource SystemVerilog Simulation and Verification Behavior-level simulation & verification System-level simulation & verification SystemC provides behavior-level and system-level synthesis capabilities for free -- rapidly gaining popularity Synthesis Behavior-level synthesis: from behavior specification (e.g. C, SystemC, or o Matlab) ) to RTL or netlists System-level synthesis: from system specification to system implementation ion

7 xpilot: Platform-Based Synthesis System SystemC/C Platform Description & Constraints xpilot xpilot Front End SSDM (System-Level Synthesis Data Model) Profiling Analysis Mapping Processor & Architecture Synthesis Processor Cores + Executables Interface Synthesis Drivers + Glue Logic Behavioral Synthesis Custom Logic Embedded SoC Uniqueness of xpilot Platform-based synthesis and optimization Communication-centric centric synthesis with interconnect optimization

8 Outline Motivation xpilot system framework Behavior-level synthesis in xpilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xpilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions

9 xpilot: Behavioral-to to-rtl Synthesis Flow Platform description SSDM Behavioral spec. in C/SystemC RTL + constraints FPGAs/ASICs Frontend compiler Presynthesis optimizations Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis Core synthesis optimizations Scheduling Resource binding, e.g., functional unit binding register/port binding µarch-generation & RTL/constraints generation Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys,

10 xpilot Advantages Advanced algorithms for platform-based, communication- centric optimization E.g. a versatile scheduling engine based on solving system of difference constraints (SDC) Platform-based behavior and system synthesis E.g. resource binding based on distributed register architecture Communication/interconnect-centric centric approach E.g. behavior and communication co-optimization optimization Complete validation through final P&R on FPGAs

11 Advanced Behavior System Algorithms: Example: Versatile Scheduling Algorithm Based on SDC Scheduling problem in behavioral synthesis is NP- Complete under general design constraints ILP-based solutions are versatile but very inefficient Exponential time complexity Our solution: An efficient and versatile scheduler based on SDC (system of difference constraints) Applicable to a broad spectrum of applications Computation/Data-intensive, intensive, control-intensive, memory- intensive, partially timed. Salable to large-size designs (finishes in a few seconds) Amenable to a rich set of scheduling constraints: Resource constraints, latency constraints, frequency constraints, relative IO timing constraints. Capable of a variety of synthesis optimizations: Operation chaining, pipelining, multi-cycle communication, incremental scheduling, etc. CS0 CS1 * *5 * + *1 *

12 Scheduling Our Approach Overall approach Current objective: high-performance Use a system of integer difference constraints to express all kinds of scheduling constraints Represent the design objective in a linear function + * Platform characterization: adder (+/ ) ) 2ns multipiler (*): 5ns Target cycle time: 10ns Resource constraint: Only ONE multiplier is available v 1 v 2 * v 3 v 5 + v Dependency constraint v 1 v 3 : x 3 x 1 0 v 2 v 3 : x 3 x 2 0 v 3 v 5 : x 4 x 3 0 v 4 v 5 : x 5 x 4 0 Frequency constraint <v 2, v 5 > : x 5 x 2 1 Resource constraint <v 2, v 3 >: x 3 x 2 1 X 1 X 2 X 3 X 4 X A x b Totally unimodular matrix: guarantees integral solutions

Platform Modeling & Characterization Target platform specification High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations Functional units: adders,

13 Platform Modeling & Characterization Target platform specification High-level resource library with delay/latency/area/power curve for various input/bitwidth configurations Functional units: adders, ALUs, multipliers, comparators, etc. Connectors: mux, demux,, etc. Memories: registers, synchronous memories, etc. Chip layout description On-chip resource distributions On-chip interconnect delay/power estimation ALU Two binding solutions for same behavior: Which one is better? Answer is platform-dependent: How large/fast are the MUX and ALU? MUX ALU ALU 3X3 Delay Matrix for Stratix-EP1S40

14 Communication- and Interconnect-Centric Synthesis: Example: Use of Distributed Register-File Architectures Island C Island A Island B 2 3 Local Local Register Register File File Data-Routing Logic Input Buffers Binding using discrete registers FUP MUX Functional Unit Pool ALU MUL ALU A scheduled DFG with register binding indicated on each variable (assume one-functional unit constraint) Binding using a register file: more efficient design! Distributed register-file micro-architecture: Efficiently use on-chip embedded memories Fully explore operation and data-transfer transfer parallelism

15 Distributed Register-File Microarchitecture Island B Island A Local Local Register Register File File Data-Routing Logic Input Buffers On-chip memory blocks Island C FUP MUX Xilinx XC-2V 2000 Functional Unit Pool ALU MUL ALU Island A Island B Island C #18Kb BRAM Dist. RAM(Kb) , ,456 FP-SoC On-chip RAM resource on Virtex II

16 Resource Binding for DRF-Microarchitecture Intra-island transfers Island (Chain) v 1 A v 2 v 3 v 4 v 6 v 7 v 5 v 8 v 10 B C D Inter-island connections = 5 (A,B)=(A,D)=1 (A,C)=1, two data transfers share one connection (C,D)=2 Inter-island transfers v 9 Facts under simplified assumptions Operations bound onto an island form a chain in the given scheduled DFG Inter-chain data transfers may share a physical inter-island island connection The number of inter-island island connections (IIC) is crucial to the QoR of a DRFM instance

17 Example: Behavior and Communication Co-Optimization in Platform-Based Interface Synthesis Focus on sequential communication media (SCM) FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect.. Altera Avalon, etc.) Order may have dramatic impact on performance Best order should guarantee that no data transmission on critical l path are delayed by non-critical transmission Interface synthesis for SCM Consider both behavior and communication to determine the optimal l transmission order for (int i=0; i <8; i++) { S1: data[i] = ; } P1 C data[8] int s07 = data[0] + data[7]; Int s16 = data[1] + data[6];.. P2 Custom Logic 1 PE1 FIFO DCT example Custom logic 2 PE2

18 Proposed SCM Co-Optimization Design Flow Process Network Platform Description & Constraints Front End System-Level Synthesis Data Model SCOOP (SCM CO-Optimization) Optimization) Communication order detection Code transformation and interface generation Indices compression for loop reordering Drivers + Glue Logics Process Behavior

19 Initial Results of Interface Synthesis Target for sequential communication channels In particular, FSL in VirtexII Consider two communicating processes Total latency (Cycle#) RAs Compress Designs Trad. SCOOP Reduction Before After DCT % 0 0 Haar % 0 0 DWT % 0 0 Mat_mul % DCT % Masking % Dot % An average of 26% improvement in total latency can be achieved.

20 SystemC/C-to to-rtl Design Flow SystemC/C specification xpilot behavioral synthesis Front-end compiler SSDM (System-Level Synthesis Data Model) SSDM/CDFG Behavioral synthesis SSDM/FSMD RTL generation Platform description & constraints FSM with Datapath in VHDL Floorplan and/or multi- cycle path constraints RTL synthesis ASICs/FPGAs platform

21 Preliminary Results of xpilot Shorter Simulation/Verification Cycle From From other projects: Simulation speed on behavior model 100X faster than RTL-based method [NEC, ASPDAC04] Our Our experience: Motion-compensation module in a Mpeg4-decoder Behavior level (in C language) simulation Less than 1 second per frame RTL SystemC simulation About 310 second per frame

22 Preliminary Results of xpilot Better Complexity Management Significant code size reduction RTL design Behavioral design: 10x code size reduction VHDL code generated by UCLA xpilot targeting Altera Stratix platform

23 Preliminary Results of xpilot Rapid System Exploration Quick evaluation of different hardware/software boundaries Example: Motion-JPEG implementation -All HW implementation -All SW implementation (using embedded processors) -SW/HW co-design: optimal partitioning? -Repeated manual RTL coding is not solution!

(MHZ) Model #1 : 5 Microblazes FSL-based communication Model #2 : 4 Microblazes + DCT on FPGA fabrics

24 Preliminary Results on Motion-JPEG Example Preprocess DCT Quant Huffman RAW Images Encoded JPEG Images OR Table Modification Preprocess HW-DCT Quant Huffman System Cycle# Table Modification Fmax (MHZ) Model #1 : 5 Microblazes FSL-based communication Model #2 : 4 Microblazes + DCT on FPGA fabrics Exe Time (ms) Area (Slice#) Model # Model #2 Xilinx XUP Board (-38%)

25 Preliminary Result of xpilot Better QoR (Comparison with UCI/UCSD SPARK) SPARK xpilot Delay Ratio Designs Slice Resource Usage Slice (LUT) Slice (FF) DSP Fmax (MHz) Slice Resource Usage Slice (LUT) Slice (FF) DSP Fmax (MHz) xpilot /SPARK PR WANG LEE MCM DIR Ave Ratio n/a Device setting: Xilinx Virtex-II pro (xc2v4000-6) Target frequency: 200 MHz

26 Outline Motivation xpilot system framework Behavior-level synthesis in xpilot Advantages of behavioral synthesis Scheduling Resource binding System-level synthesis in xpilot Synthesis for ASIP platforms Design exploration for heterogeneous MPSoCs Conclusions

27 Design Exploration for Heterogeneous MPSoC Platforms Heterogeneous MPSoCs exploration Processors Heterogeneous vs. homogeneous General-purpose vs. application-specific On-chip communication architecture (OCA) Bus (e.g. AMBA, CoreConnect), packet switching network (e.g. Alpha 21364) Memory hierarchy tasks µp µp OS Driver µp IP tasks µp µp OS µp Driver tasks µp OS Driver µp µp FPGA DSP Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Network Interface Communication Network

Loose integration using FIFOs/busses for communications Example: Xilinx MicroBlaze, etc.

28 Configurable SoC Platforms General General purpose processor cores + programmable fabric Tight integration using extended instructions (ASIPs( ASIPs) Example: Altera Nios / Nios II Loose integration using FIFOs/busses for communications Example: Xilinx MicroBlaze, etc. Custom instruction logic for Nios II [source: Xilinx MicroBlaze [source:

29 ASIP Compilation: Problem Statement Given: CDFG G(V, E) The basic instruction set I Pattern constraints: Number of inputs PI(pi) Nin; Number of outputs PO(pi) = 1; 1 Total area Objective: 1 i N area( p ) < A Generate a pattern library P Map G to the extended instruction set I P,, so that the total execution time is minimized i t 1 = a * b; t 2 = b * c; ; t 3 = d * e; t 4 = t 1 + t 2 ; t 5 = t 2 + t 3 ; t 6 = t 5 + t 4 ; a b c d e * * + ext-inst 1 (MAC 1 : 2 cycles) t 4 = ext-inst 1 (a, b, c); t 5 = ext-inst 2 (b, c, d, e); t 6 = t 4 + t 5 ; Performance speedup = 9 / 5 = 1.8X t 4 t * ext-inst 2 (MAC 2 : 2 cycles) t 6 * 2 clock cycles + 1 clock cycle

30 Target Core Processor Model Core processor model Classic single-issue issue pipelined RISC core (fetch / decode / execute / mem / write-back) The number of input and output operands of an instruction is pre-determined An instruction reads the core register file during the execute stage, s and commits the result during the write-back stage PC 4 Adder Inst Cache IF / ID RS1 RS2 Reg File ID / EX OP 1 OP 2 ALU EX / MEM Memory MEM / WB MUX Core Processor Result Custom Logic

31 ASIP Compilation Flow C code Front-end compilation CDFG 3. Application mapping & Graph covering Optimized CDFG Backend compilation µarch constraint 1. Pattern generation 2. Pattern selection Pattern library Pattern Generation Satisfying input/output constraints Pattern Selection Select a subset to maximize the potential speedup while satisfying the resource constraint Application Mapping Graph covering to minimize the total execution time Optimized assembly

32 Experimental Results on Altera Nios Altera Nios is used for ASIP implementation 5 extended instruction formats up to 2048 instructions for each format Small DSP applications are taken as benchmark Extended Instruction# Speedup Estimation Nios LE Resource Overhead Memory DSP Block fft_br % 65, % 16 iir % 4, % 40 fir % 1, % 8 pr % % 14 dir % % 16 mcm % % 56 Average % % -

33 Architecture Extension for ASIPs Data bandwidth problem Limited register file bandwidth (two read ports, one write port) ~40% of the ideal performance speedup will be lost Shadow-register register-based architectural extension Core registers are augmented by an extra set of shadow registers Conditionally written during write-back stage Low power/area overhead Novel shadow-register binding algorithms are developed PC 4 Adder Inst Cache IF / ID RS1 RS2 Reg File ID / EX OP 1 OP 2 ALU EX / MEM Memory MEM / WB MUX Core Processor Result k = hash(j) Hashing Unit SR SR 11 SR SR K Custom Logic

34 Ongoing Work : Mapping for Heterogeneous Integration with Multiple Processing Cores Given: A library of processing cores P and communication library C Task graph G(V, E) For each v in V,, execution time t(v, p i ) on p i For each (u,( v) in E,, communication data size s(u,v) Throughput constraint Problem: Select and instantiate the processing elements and communication channels from P and C respectively Map the tasks onto the processing elements and communications to the channels so that The optimal latency is achieved subject to the throughput constraint The implementation cost is minimized

35 MPEG-4 4 Simple Profile Decoder: Architecture Profiling C specification overview Module Name Orig. C Source File Orig. C line # Copy Controller copycontrol.c 287 Display Controller displaycontrol.c 358 Runtime Profiling (PowerPC/XUP board) Parser/VLD 59.0% Motion Comp. Parser /VLD Motion- Compensation.c parser.c texture_vld.c Texture/IDCT Motion Comp. Copy Controller 18.1% 15.7% 3.6% Texture /IDCT Texture Update texture_idct.c textureupdate.c

36 MPEG-4 4 Simple Profile Decoder: Hyprid HW/SW Impmentation HW block Integrated with PowerPC single process design: Software blocks running on PowerPC 15% speed improvement

37 MPEG-4 4 Simple Profile Decoder: Alternate Implementations Single ublaze 7-uBlaze Single PowerPC Single PowerPC w/ HW Motion Comp. Throughput (Frame per Second) Improvement % % % xpilot Synthesis Report of HW blocks C Line counts RTL SystemC RTL VHDL Slices ( FFs, LUTs) MUL Clock period (ns) Latency (Cycles) Motion Comp (1111, 1017) Block IDCT (2376, 2438) Texture Update (1696, 1931)

38 Conclusions xpilot has fairly mature and advanced behavior synthesis capability ity from C or SystemC to RTL code with necessary design constraints xpilot advantages include Platform-based behavior and system synthesis Communication/interconnect-centric centric approach Advanced algorithms for platform-based, communication-centric centric optimization Promising results demonstrated on available FPGAs xpilot system synthesis capabilities Performance simulation of multi-processor systems Exploration the efficient use of (multiple) on-chip processors Compilation and optimization for reconfigurable processors

39 Acknowledgements We would like to thank the supports from Gigascale Systems Research Center (GSRC) National Science Foundation (NSF) Semiconductor Research Corporation (SRC) Industrial sponsors under the California MICRO programs (Altera, Xilinx) Team members: Yiping Fan Guoling Han Wei Jiang Zhiru Zhang

Prof. Jason Cong UCLA Computer Science Department. Advantages of behavioral synthesis Scheduling Resource binding

Prof. Jason Cong UCLA Computer Science Department. Advantages of behavioral synthesis Scheduling Resource binding xpilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs Prof. Jason Cong cong@cs.ucla.edu UCLA Computer Science Department Outline Motivation xpilot system framework Behavior-level synthesis