Pilot: A Platform-based HW/SW Synthesis System

Similar documents
Platform-based SW/HW Synthesis

Regular Fabrics for Retiming & Pipelining over Global Interconnects

Retiming & Pipelining over Global Interconnects

Architecture and Synthesis for Multi-Cycle Communication

Architecture-Level Synthesis for Automatic Interconnect Pipelining

An Interconnect-Centric Design Flow for Nanometer Technologies. Outline

EE382V: System-on-a-Chip (SoC) Design

NANOMETER process technologies allow billions of transistors

An Interconnect-Centric Design Flow for Nanometer Technologies

Design Space Exploration Using Parameterized Cores

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10

Prof. Jason Cong UCLA Computer Science Department. Advantages of behavioral synthesis Scheduling Resource binding

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999

System-on Solution from Altera and Xilinx

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Co-synthesis and Accelerator based Embedded System Design

SoC Design for the New Millennium Daniel D. Gajski

Design of Transport Triggered Architecture Processor for Discrete Cosine Transform

Graduate Institute of Electronics Engineering, NTU Advanced VLSI SOPC design flow

Platform-Based Behavior-Level and System-Level Synthesis. Prof. Jason Cong UCLA Computer Science Department

CS310 Embedded Computer Systems. Maeng

An Introduction to Programmable Logic

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems

The Nios II Family of Configurable Soft-core Processors

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors

Platform-based Design

Hardware/Software Co-design

TKT-2431 SoC design. Introduction to exercises

Design Methodologies and Tools. Full-Custom Design

COE 561 Digital System Design & Synthesis Introduction

Architectural Synthesis Integrated with Global Placement for Multi-Cycle Communication *

Multi-level Design Methodology using SystemC and VHDL for JPEG Encoder

Digital Systems Design. System on a Programmable Chip

Codesign Framework. Parts of this lecture are borrowed from lectures of Johan Lilius of TUCS and ASV/LL of UC Berkeley available in their web.

Project design tutorial (I)

Embedded System Design

Lecture 7: Introduction to Co-synthesis Algorithms

FPGA Power and Timing Optimization: Architecture, Process, and CAD

Nios Soft Core Embedded Processor

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Hardware-Software Codesign. 1. Introduction

FPGA for Software Engineers

xpilot: A Platform-Based Behavioral Synthesis System

Embedded Systems. 7. System Components

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

High-Performance Linear Algebra Processor using FPGA

Lecture 20: High-level Synthesis (1)

asoc: : A Scalable On-Chip Communication Architecture

EE382V: System-on-a-Chip (SoC) Design

The Xilinx XC6200 chip, the software tools and the board development tools

Field Programmable Gate Array (FPGA) Devices

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

HIERARCHICAL DESIGN. RTL Hardware Design by P. Chu. Chapter 13 1

Outline HIERARCHICAL DESIGN. 1. Introduction. Benefits of hierarchical design

Multimedia Decoder Using the Nios II Processor

Hardware-Software Codesign. 1. Introduction

EEL 4783: Hardware/Software Co-design with FPGAs

Design Issues in Hardware/Software Co-Design

DE2 Board & Quartus II Software

FPGA Polyphase Filter Bank Study & Implementation

Platform Selection Motivating Example and Case Study

Altera FLEX 8000 Block Diagram

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

DIGITAL DESIGN TECHNOLOGY & TECHNIQUES

Calibrating Achievable Design GSRC Annual Review June 9, 2002

Automated Extraction of Physical Hierarchies for Performance Improvement on Programmable Logic Devices

FPGA. Agenda 11/05/2016. Scheduling tasks on Reconfigurable FPGA architectures. Definition. Overview. Characteristics of the CLB.

Embedded System Design Modeling, Synthesis, Verification

Digital Integrated Circuits

Hardware Software Codesign of Embedded Systems

Hardware Software Co-design and SoC. Neeraj Goel IIT Delhi

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Chapter 4 Implementation of a Test Circuit

IMPLEMENTATION OF TIME EFFICIENT SYSTEM FOR MEDIAN FILTER USING NIOS II PROCESSOR

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression

EITF35: Introduction to Structured VLSI Design

100M Gate Designs in FPGAs

Universiteit van Amsterdam 1

Efficient design and FPGA implementation of JPEG encoder

Veloce2 the Enterprise Verification Platform. Simon Chen Emulation Business Development Director Mentor Graphics

SPARK: A Parallelizing High-Level Synthesis Framework

VLSI Signal Processing

Long Term Trends for Embedded System Design

Linking Layout to Logic Synthesis: A Unification-Based Approach

101-1 Under-Graduate Project Digital IC Design Flow

Park Sung Chul. AE MentorGraphics Korea

An Interconnect-Centric Design Flow for Nanometer Technologies

Hardware/Software Partitioning and Scheduling of Embedded Systems

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

Easy Multicore Programming using MAPS

Embedded Computing Platform. Architecture and Instruction Set

FPGA Based Digital Design Using Verilog HDL

Study of GALS based FPGA Architecture Using CAD Tool

EE4380 Microprocessor Design Project

Design Methodologies. Full-Custom Design

Transcription:

Pilot: A Platform-based HW/SW Synthesis System SOC Group, VLSI CAD Lab, UCLA Led by Jason Cong Zhong Chen, Yiping Fan, Xun Yang, Zhiru Zhang ICSOC Workshop, Beijing August 20, 2002

Outline Overview The Platform Concept Pilot Design Flow System Data Model (SDM) FunState MOC Work Accomplished Example Jpeg Encoder Ongoing Research Architectural Synthesis with Multi-cycle Interconnect Communication Future Work

Overview Pilot: Pilot: Platform-based HW/SW Synthesis Start from system level design description Target to the highly programmable FPSoC platforms Automate the process as much as possible System System Data Model (SDM) Model of Computation (MOC) Incorporate Funstate MOC System-level synthesis algorithms Internal Representation Cover the whole life-cycle of the flow SDM-API supports inter-operatability of synthesis tools

The Platform Concept A A platform is a coordinate family of hardware-software architectures, which satisfies a set of architectural constraints, imposed to allow a the re- use of hardware and software components. Design Design regularity and pre-assembly of critical components and interconnections provides the necessary manufacturability, yield,, and predictability SIP Analog PLL CPU ASIC up Cache MEMORY FPGA Application-specific customization with various regularized components FPGA FPGA Source: Gigascale Silicon Research Center (GSRC)

Our Candidate Platform Excalibur Field Programmable Platform Candidate Platform: Excalibur FPSoC PLD: APEX EP20K200E (8320 LEs) Processor: Nios 16-bit or 32-bit configurable Memory: on-chip 106,496bits I/O: customizable, on-chip peripheral Up to 150K gates available for customization Pre-assembly of critical components plus programmable logic enables designers to quickly customize for different applications

Pilot Design Flow Design Spec. in SpecC SW Code Gen C Code System Data Model Altera s Platform Info. HW Code Gen VHDL Simulation Synthesis Estimation Partitioning Scheduling Interface Synthesis HW synthesis SW synthesis Tools Developed: Converter: Translate SpecC to SDM Simulator: Validate the design in SDM, Simulation design at different levels of abstraction SW code generator: Generate C Source Code from SDM for target platform HW code generator: Generate VHDL Source code from SDM for target platform Profiler: Generate profile based on generated SW/HW system Target SW Target PLD

System Data Model (SDM) Core Core MOC FunState (Function Driven by State Machine) Capable of representing several well-known computing paradigms (CDFG, SDF, CFSM, Petri Nets, SPI etc.) Supplementary Information Abstract Syntax Tree (AST) Platform Specification Capable of representing heterogeneous embedded system Separate communication from computation explicitly Handle the concurrency in the system FunState Language-specific info. AST Platform Spec. Component library Interconnect topology

FunState MOC: Definition Definition: The basic FunState component consists of a network N and a finite state machine M. The network N=(F,S,E) itself contains a set of storage units s S, s a set of functions f F f F and a set of directed edges e E e E where E (FE (F S) S) (S F). FunState An Internal Design Representation for Codesign, IEEE Transactions on VLSI systems, Vol 9, No 4, Aug. 2001, Karsten Strehl, etc.

FunState MOC: Filter Example Producer (pixles) in Filter (pixles) out Consumer Controller Coeff input byte in, coef; Output byte out; in line, pix; byte k; int buffer []; forever { if (present(coef, 1)) k = read (coeff, 1); buffer = read(in, 64); for (pix = 1; pix <= 64; pix++) buffer[pix] = buffer[pix] * k; write (out, buffer, npix); } Producer Controller 64 in 64 1 1 coeff Filter 64 64 Consumer out in# 64 coef# 1 / Filter out# 64 / Consumer /Producer,Controller

Work Accomplished: Jpeg Encoder Jpeg Jpeg Encoder: An example to validate the design flow BMP BMP Image Image File File Image Image Fragmentation Fragmentation JPEG: JPEG: an an standard standard for for image image compression compression DCT: DCT: Discrete Discrete Cosine Cosine Transform(ChenDCT) Transform(ChenDCT) Four Four mode mode of of the the operations operations in in JPEG JPEG standard standard Sequential Sequential DCT-based DCT-based mode mode Progressive Progressive DCT-based DCT-based mode mode Lossless Lossless mode mode Hierarchical Hierarchical mode mode DCT DCT Quantization Quantization Entropy Entropy Coding Coding JPG JPG Image Image File File

Jpeg Example: HW/SW Partitioning HW/SW HW/SW Partitioning: Implement the most computation-intensive intensive part in hardware Module Name HandleData DCT Quantization PC(PIII 650MHz) 391259.70/s 2.56 µs 1.72% 8659.61/s 115.48 µs 77.47% 138533.91/s 7.22 µs 4.84% NIOS (SW) 21422.59/s 46.68 µs 0.72% 194.82/s 5132.94 µs 79.18% 3229.26/s 309.67 µs 4.78% SW Input JPEG Receivedata JpegEncode- Stripe Data Input Jpeg Output Recv Output Send Send Recv DCT HW HuffmanEncode Total (times/s) Speedup 42010.25/s 23.8 µs 15.97% 31.62 42.16 1006.88/s 993.17 µs 15.32% 0.75 1 Jpeg representation in SDM Table: Run-time profiling of Jpeg program

2.Generate the program enclosed with BMP image data 1. Download the design through parallel cable to APEX configuration controller Apex configuration controller Contains the device programming data SRAM Contains the program and BMP data for running Parallel port for downloading design to APEX configuration controller 3. Download the program and data through serial cable 5. Return result JPEG image data through serial cable 116x96x8.bmp format (12214 Bytes) 116x96x8.jpg format (1704 Bytes) 4. Run program on the APEX device containing our design APEX device is a programmable device containing Excalibur platform Serial port for communication between PC and Nios board: Downloading program and data Return results Jpeg Example: Experiment Framework

Jpeg Example: Experimental Results Run-time result of Jpeg example NIOS(SW) NIOS(SW+HW1) NIOS(SW+HW2) NIOS(SW+HW3) Module Name time (10-6 s) rate(%) time (10-6 s) rate(%) time (10-6 s) rate(%) time (10-6 s) rate(%) HandleData DCT Quantization HuffmanEnco 50.31 3160.56 176.42 746.29 1.22% 76.46% 4.27% 18.05% 50.31 1641.04 176.42 746.29 1.92% 62.78% 6.75% 28.55% 50.31 1756.67 176.42 746.29 1.84% 64.35% 6.46% 27.34% 50.31 123.51 176.42 746.29 4.59% 11.26% 16.09% 68.06% (19878.67) (316.4) (5668.41) (1339.96) (19878.67) (609.37) (5668.41) (1339.96) (19878.67) (569.26) (5668.41) (1339.96) (19878.67) (8096.46) (5668.41) (1339.96) Total 4133.57 100.00% 2614.05 100.00% 2729.68 100.00% 1096.52 100.00% HW1: half DCT implementation with message passing communication HW2: Full DCT implementation with buffering communication HW3: Full DCT implementation with shared memory communication

Ongoing Research: Architectural Synthesis with Multi-cycle Interconnect Communication Architectural Synthesis with Multi-cycle Interconnect Communication Needs for multi-cycle interconnect communication Dominant role of interconnect delay in deep sub-micron(dsm) process technology Proposed solutions: Regular Distributed Register Architecture (RDR) Incorporate layout information to better guide the scheduling and d binding Perform simultaneous scheduling (binding) with placement

Motivation: How Far Can We Go in Each Clock Cycle 7 clock NTRS 97 0.07um Tech 6 clock 5 clock 5 G Hz across-chip clock 620 mm 2 (24.9mm x 24.9mm) IPEM BIWS estimations Buffer size: 100x Driver/receiver size: 100x From corner to corner: 7 clock cycles 4 clock 1 clock 2 clock 3 clock 0 7.52 15.04 22.56 24.9 (mm)

Regular Distributed Register Architecture FUC FUC FUC 1 cycle Island Register File 2 cycle. k cycle DIV MUX ADD Cluster with area constraint Global Interconnect Function Unit Cluster (FUC) H i FUC FUC FUC W i D intra island = Dlog ic + Dopt int Dlog ic + Dopt int(2w i + 2Hi ) Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, k cycle interconnect communication in each island Highly regular T

Example: Impact of Interconnect on Scheduling Data flow graph extracted from discrete cosine transformation (DCT) The delay of * operation is 2ns, the delay of + and operation is 1ns. The resources available are 2 multipliers and 2 ALUs. The nodes with the same color are assigned to the same functional unit. - 1 + 2 * 3 * 4-5 - 6 Mul2 3,7,12 Alu1 1,5,10 Alu2 2,6,9 * 7 * 8-9 * 11 * 12-10 Represents long Interconnect delay. The long interconnect delay is 2ns. Represents short Interconnect delay. Short Interconnect delay is 1ns. Mul1 4,8,11 FUC Wirelength-driven Placement

Single-cycle vs. Multi-cycle Interconnect Communication Represents registers. + 2 Cycle1-1 + 2 Cycle 1-1 Cycle2 * 3 * 4 Cycle2 * 3 * 4 Cycle3-5 - 6 Cycle3-5 - 6 Cycle 4 Cycle5 * 11 * 8 Cycle 4 * 7 * 11 Cycle6 * 7 * 12 Cycle5 * 8 * 12 Cycle7-9 - 10 Cycle6-10 Cycle8-9 Cycle9 Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns

Enhancement: Simultaneous Placement and Scheduling for Performance Optimization - 1 + 2 Cycle1 * 3 * 4 Cycle2 Mul2 3,7,12 Alu1 1,5,10-5 - 6 Cycle3 * 7 * 8 Cycle4 Cycle5 * 11 Cycle6 * 12 Mul1 4,8,11 Alu2 2,6,9-9 Cycle7-10 Cycle8 Simultaneous Placement and Scheduling With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.

Experimental Results DFG Nodes # DCT Loop1 35 Op Types # 3 (+ - *) Input DFG Resource ALU Multiplier Bit Width (bits) 24 24 Usage 7 Mem Register 64*24 24 Binding Result 19 Clock Period Latency 17.905 (ns)( 23 (cycles) Final Layout Scheduling Result

Future Work System-level Synthesis System-level scheduling Hardware/Software partitioning Performance estimation Communication Synthesis Protocol selection (generation) Software Software Synthesis Code optimization under resource constraints