Pilot: A Platform-based HW/SW Synthesis System

Pilot: A Platform-based HW/SW Synthesis System SOC Group, VLSI CAD Lab, UCLA Led by Jason Cong Zhong Chen, Yiping Fan, Xun Yang, Zhiru Zhang ICSOC Workshop, Beijing August 20, 2002

Outline Overview The Platform Concept Pilot Design Flow System Data Model (SDM) FunState MOC Work Accomplished Example Jpeg Encoder Ongoing Research Architectural Synthesis with Multi-cycle Interconnect Communication Future Work

Overview Pilot: Pilot: Platform-based HW/SW Synthesis Start from system level design description Target to the highly programmable FPSoC platforms Automate the process as much as possible System System Data Model (SDM) Model of Computation (MOC) Incorporate Funstate MOC System-level synthesis algorithms Internal Representation Cover the whole life-cycle of the flow SDM-API supports inter-operatability of synthesis tools

The Platform Concept A A platform is a coordinate family of hardware-software architectures, which satisfies a set of architectural constraints, imposed to allow a the re- use of hardware and software components. Design Design regularity and pre-assembly of critical components and interconnections provides the necessary manufacturability, yield,, and predictability SIP Analog PLL CPU ASIC up Cache MEMORY FPGA Application-specific customization with various regularized components FPGA FPGA Source: Gigascale Silicon Research Center (GSRC)

Our Candidate Platform Excalibur Field Programmable Platform Candidate Platform: Excalibur FPSoC PLD: APEX EP20K200E (8320 LEs) Processor: Nios 16-bit or 32-bit configurable Memory: on-chip 106,496bits I/O: customizable, on-chip peripheral Up to 150K gates available for customization Pre-assembly of critical components plus programmable logic enables designers to quickly customize for different applications

Pilot Design Flow Design Spec. in SpecC SW Code Gen C Code System Data Model Altera s Platform Info. HW Code Gen VHDL Simulation Synthesis Estimation Partitioning Scheduling Interface Synthesis HW synthesis SW synthesis Tools Developed: Converter: Translate SpecC to SDM Simulator: Validate the design in SDM, Simulation design at different levels of abstraction SW code generator: Generate C Source Code from SDM for target platform HW code generator: Generate VHDL Source code from SDM for target platform Profiler: Generate profile based on generated SW/HW system Target SW Target PLD

System Data Model (SDM) Core Core MOC FunState (Function Driven by State Machine) Capable of representing several well-known computing paradigms (CDFG, SDF, CFSM, Petri Nets, SPI etc.) Supplementary Information Abstract Syntax Tree (AST) Platform Specification Capable of representing heterogeneous embedded system Separate communication from computation explicitly Handle the concurrency in the system FunState Language-specific info. AST Platform Spec. Component library Interconnect topology

FunState MOC: Definition Definition: The basic FunState component consists of a network N and a finite state machine M. The network N=(F,S,E) itself contains a set of storage units s S, s a set of functions f F f F and a set of directed edges e E e E where E (FE (F S) S) (S F). FunState An Internal Design Representation for Codesign, IEEE Transactions on VLSI systems, Vol 9, No 4, Aug. 2001, Karsten Strehl, etc.

FunState MOC: Filter Example Producer (pixles) in Filter (pixles) out Consumer Controller Coeff input byte in, coef; Output byte out; in line, pix; byte k; int buffer []; forever { if (present(coef, 1)) k = read (coeff, 1); buffer = read(in, 64); for (pix = 1; pix <= 64; pix++) buffer[pix] = buffer[pix] * k; write (out, buffer, npix); } Producer Controller 64 in 64 1 1 coeff Filter 64 64 Consumer out in# 64 coef# 1 / Filter out# 64 / Consumer /Producer,Controller

Work Accomplished: Jpeg Encoder Jpeg Jpeg Encoder: An example to validate the design flow BMP BMP Image Image File File Image Image Fragmentation Fragmentation JPEG: JPEG: an an standard standard for for image image compression compression DCT: DCT: Discrete Discrete Cosine Cosine Transform(ChenDCT) Transform(ChenDCT) Four Four mode mode of of the the operations operations in in JPEG JPEG standard standard Sequential Sequential DCT-based DCT-based mode mode Progressive Progressive DCT-based DCT-based mode mode Lossless Lossless mode mode Hierarchical Hierarchical mode mode DCT DCT Quantization Quantization Entropy Entropy Coding Coding JPG JPG Image Image File File

Jpeg Example: HW/SW Partitioning HW/SW HW/SW Partitioning: Implement the most computation-intensive intensive part in hardware Module Name HandleData DCT Quantization PC(PIII 650MHz) 391259.70/s 2.56 µs 1.72% 8659.61/s 115.48 µs 77.47% 138533.91/s 7.22 µs 4.84% NIOS (SW) 21422.59/s 46.68 µs 0.72% 194.82/s 5132.94 µs 79.18% 3229.26/s 309.67 µs 4.78% SW Input JPEG Receivedata JpegEncode- Stripe Data Input Jpeg Output Recv Output Send Send Recv DCT HW HuffmanEncode Total (times/s) Speedup 42010.25/s 23.8 µs 15.97% 31.62 42.16 1006.88/s 993.17 µs 15.32% 0.75 1 Jpeg representation in SDM Table: Run-time profiling of Jpeg program

2.Generate the program enclosed with BMP image data 1. Download the design through parallel cable to APEX configuration controller Apex configuration controller Contains the device programming data SRAM Contains the program and BMP data for running Parallel port for downloading design to APEX configuration controller 3. Download the program and data through serial cable 5. Return result JPEG image data through serial cable 116x96x8.bmp format (12214 Bytes) 116x96x8.jpg format (1704 Bytes) 4. Run program on the APEX device containing our design APEX device is a programmable device containing Excalibur platform Serial port for communication between PC and Nios board: Downloading program and data Return results Jpeg Example: Experiment Framework

Jpeg Example: Experimental Results Run-time result of Jpeg example NIOS(SW) NIOS(SW+HW1) NIOS(SW+HW2) NIOS(SW+HW3) Module Name time (10-6 s) rate(%) time (10-6 s) rate(%) time (10-6 s) rate(%) time (10-6 s) rate(%) HandleData DCT Quantization HuffmanEnco 50.31 3160.56 176.42 746.29 1.22% 76.46% 4.27% 18.05% 50.31 1641.04 176.42 746.29 1.92% 62.78% 6.75% 28.55% 50.31 1756.67 176.42 746.29 1.84% 64.35% 6.46% 27.34% 50.31 123.51 176.42 746.29 4.59% 11.26% 16.09% 68.06% (19878.67) (316.4) (5668.41) (1339.96) (19878.67) (609.37) (5668.41) (1339.96) (19878.67) (569.26) (5668.41) (1339.96) (19878.67) (8096.46) (5668.41) (1339.96) Total 4133.57 100.00% 2614.05 100.00% 2729.68 100.00% 1096.52 100.00% HW1: half DCT implementation with message passing communication HW2: Full DCT implementation with buffering communication HW3: Full DCT implementation with shared memory communication

Ongoing Research: Architectural Synthesis with Multi-cycle Interconnect Communication Architectural Synthesis with Multi-cycle Interconnect Communication Needs for multi-cycle interconnect communication Dominant role of interconnect delay in deep sub-micron(dsm) process technology Proposed solutions: Regular Distributed Register Architecture (RDR) Incorporate layout information to better guide the scheduling and d binding Perform simultaneous scheduling (binding) with placement

Motivation: How Far Can We Go in Each Clock Cycle 7 clock NTRS 97 0.07um Tech 6 clock 5 clock 5 G Hz across-chip clock 620 mm 2 (24.9mm x 24.9mm) IPEM BIWS estimations Buffer size: 100x Driver/receiver size: 100x From corner to corner: 7 clock cycles 4 clock 1 clock 2 clock 3 clock 0 7.52 15.04 22.56 24.9 (mm)

Regular Distributed Register Architecture FUC FUC FUC 1 cycle Island Register File 2 cycle. k cycle DIV MUX ADD Cluster with area constraint Global Interconnect Function Unit Cluster (FUC) H i FUC FUC FUC W i D intra island = Dlog ic + Dopt int Dlog ic + Dopt int(2w i + 2Hi ) Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, k cycle interconnect communication in each island Highly regular T

Example: Impact of Interconnect on Scheduling Data flow graph extracted from discrete cosine transformation (DCT) The delay of * operation is 2ns, the delay of + and operation is 1ns. The resources available are 2 multipliers and 2 ALUs. The nodes with the same color are assigned to the same functional unit. - 1 + 2 * 3 * 4-5 - 6 Mul2 3,7,12 Alu1 1,5,10 Alu2 2,6,9 * 7 * 8-9 * 11 * 12-10 Represents long Interconnect delay. The long interconnect delay is 2ns. Represents short Interconnect delay. Short Interconnect delay is 1ns. Mul1 4,8,11 FUC Wirelength-driven Placement

Single-cycle vs. Multi-cycle Interconnect Communication Represents registers. + 2 Cycle1-1 + 2 Cycle 1-1 Cycle2 * 3 * 4 Cycle2 * 3 * 4 Cycle3-5 - 6 Cycle3-5 - 6 Cycle 4 Cycle5 * 11 * 8 Cycle 4 * 7 * 11 Cycle6 * 7 * 12 Cycle5 * 8 * 12 Cycle7-9 - 10 Cycle6-10 Cycle8-9 Cycle9 Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns

Enhancement: Simultaneous Placement and Scheduling for Performance Optimization - 1 + 2 Cycle1 * 3 * 4 Cycle2 Mul2 3,7,12 Alu1 1,5,10-5 - 6 Cycle3 * 7 * 8 Cycle4 Cycle5 * 11 Cycle6 * 12 Mul1 4,8,11 Alu2 2,6,9-9 Cycle7-10 Cycle8 Simultaneous Placement and Scheduling With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.

Experimental Results DFG Nodes # DCT Loop1 35 Op Types # 3 (+ - *) Input DFG Resource ALU Multiplier Bit Width (bits) 24 24 Usage 7 Mem Register 64*24 24 Binding Result 19 Clock Period Latency 17.905 (ns)( 23 (cycles) Final Layout Scheduling Result

Future Work System-level Synthesis System-level scheduling Hardware/Software partitioning Performance estimation Communication Synthesis Protocol selection (generation) Software Software Synthesis Code optimization under resource constraints