Folding. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Similar documents
Chapter 6: Folding. Keshab K. Parhi

Iteration Bound. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

Chapter 8 Folding. VLSI DSP 2008 Y.T. Hwang 8-1. Introduction (1)

Retiming. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction

Folding. Hardware Mapped vs. Time multiplexed. Folding by N (N=folding factor) Node A. Unfolding by J A 1 A J-1. Time multiplexed/microcoded

Iteration Bound. Lan-Da Van ( 倫 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

Optimized Design Platform for High Speed Digital Filter using Folding Technique

Memory, Area and Power Optimization of Digital Circuits

Verilog for Combinational Circuits

FOLDED ARCHITECTURE FOR NON CANONICAL LEAST MEAN SQUARE ADAPTIVE DIGITAL FILTER USED IN ECHO CANCELLATION

Register Transfer Level in Verilog: Part I

Exercises in DSP Design 2016 & Exam from Exam from

Vertex Shader Design I

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

Design of Efficient Fast Fourier Transform

Verilog Dataflow Modeling

Verilog Behavioral Modeling

Academic Course Description

High-Level Synthesis (HLS)

Digital Systems and Binary Numbers

Vertex Shader Design II

HIGH-LEVEL SYNTHESIS

Academic Course Description. VL2003 Digital Processing Structures for VLSI First Semester, (Odd semester)

Pracy II Konferencji Krajowej Reprogramowalne uklady cyfrowe, RUC 99, Szczecin, 1999, pp Implementation of IIR Digital Filters in FPGA

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

Take Home Final Examination (From noon, May 5, 2004 to noon, May 12, 2004)

RPUSM: An Effective Instruction Scheduling Method for. Nested Loops

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India

ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS. A Thesis Presented to The Graduate Faculty of The University of Akron

Topics. Verilog. Verilog vs. VHDL (2) Verilog vs. VHDL (1)

Unit 2: High-Level Synthesis

Gate-Level Minimization

Gate-Level Minimization

A Novel Area Efficient Folded Modified Convolutional Interleaving Architecture for MAP Decoder

ECE 341 Midterm Exam

High Level Synthesis

EEL 4783: HDL in Digital System Design

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

COE 561 Digital System Design & Synthesis Introduction

ECE 341 Midterm Exam

Graphing Linear Equations

Retiming. Adapted from: Synthesis and Optimization of Digital Circuits, G. De Micheli Stanford. Outline. Structural optimization methods. Retiming.

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Retiming Arithmetic Datapaths using Timed Taylor Expansion Diagrams

Review for Ray-tracing Algorithm and Hardware

MOST computations used in applications, such as multimedia

Introduction to Field Programmable Gate Arrays

On the Design of High Speed Parallel CRC Circuits using DSP Algorithams

HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE

Additional Slides to De Micheli Book

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

Chapter 4. Combinational Logic

Permutation Matrices. Permutation Matrices. Permutation Matrices. Permutation Matrices. Isomorphisms of Graphs. 19 Nov 2015

Efficient Radix-4 and Radix-8 Butterfly Elements

Reduction of Latency and Resource Usage in Bit-Level Pipelined Data Paths for FPGAs

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

Register Transfer Methodology II

Outline. Register Transfer Methodology II. 1. One shot pulse generator. Refined block diagram of FSMD

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 5, MAY

High-Level Synthesis

Digital Design Using Digilent FPGA Boards -- Verilog / Active-HDL Edition

Administrivia. What is Synthesis? What is architectural synthesis? Approximately 20 1-hour lectures An assessed mini project.

DESIGN OF 2-D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE. Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha

Algebraically Speaking Chalkdust Algebra 1 Fall Semester

Algorithms Transformation Techniques for Low-Power Wireless VLSI Systems Design

Multi Design Exploration and Register Minimization of Retimed Circuits Using GA in DSP Applications

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

Homework #2 Solution Due Date: Friday, March 24, 2004

Research Article Design of Synthesizable, Retimed Digital Filters Using FPGA Based Path Solvers with MCM Approach: Comparison and CAD Tool

MRPF: An Architectural Transformation for Synthesis of High-Performance and Low-Power Digital Filters

Rate-Optimal Unfolding of Balanced Synchronous Data-Flow Graphs

Example 1: Give the coordinates of the points on the graph.

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

Simulink-Hardware Flow

Implementation of Two Level DWT VLSI Architecture

Verilog for High Performance

Defect Tolerance in VLSI Circuits

CS 151 Final. (Last Name) (First Name)

Tree Structure and Algorithms for Physical Design

High Performance Integer DCT Architectures for HEVC

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

Word-Level Equivalence Checking in Bit-Level Accuracy by Synthesizing Designs onto Identical Datapath

VHDL for Synthesis. Course Description. Course Duration. Goals

Efficient complex multiplication and fast fourier transform (FFT) implementation on the ManArray architecture

Sample Solutions to Homework #4

Sardar Patel University S Y BSc. Computer Science CS-201 Introduction to Programming Language Effective from July-2002

ECE 545 Fall 2013 Final Exam

Maximally and Arbitrarily Fast Implementation of Linear and Feedback Linear Computations

VLSI Programming 2016: Lecture 3

Computer Science 160 Translation of Programming Languages

Outline. Introduction to Structured VLSI Design. Signed and Unsigned Integers. 8 bit Signed/Unsigned Integers

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION

VLSI DESIGN OF FLOATING POINT ARITHMETIC & LOGIC UNIT

Dynamic Pipeline Design of an Adaptive Binary Arithmetic Coder

0.1 Unfolding. (b) (a) (c) N 1 y(2n+1) v(2n+2) (d)

Transcription:

Folding ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall, 2010 ldvan@cs.nctu.edu.tw http://www.cs.nctu.tw/~ldvan/

Outline Introduction Folding Transformation Register Minimization Techniques Register Minimization in Folded Architecture Conclusions VLSI-DSP-6-2

Introduction (1/2) Systematically determine the control circuits in DSP architectures by folding transformation, where multiple algorithm operations are time-multiplexed to a single functional unit. Use for synthesis of DSP architectures that can be operated at single or multiple clocks. Use to reduce the number of hardware functional units (FUs) by a factor of N at the expense of increasing computation time by a factor of N. Lead to an architecture that uses a large number of registers and thus present the register minimization technique. VLSI-DSP-6-3

Introduction (2/2) VLSI-DSP-6-4

Outline Introduction Folding Transformation Register Minimization Techniques Register Minimization in Folded Architecture Conclusions VLSI-DSP-6-5

Folding Transformation (1/3) A systematic techniques for designing control circuits for hardware where several algorithm operations are time-multiplexed on a single functional unit. Notations U, V: nodes (operations) of the original DFG H U, H V : nodes (functional units) of the folded DFG W (x) : x-th iteration of node W e U V: an edge e from node U to noe V w(e): # of delays of the edge e Folding factor N # of operations that share one FU Folding set An ordered set of operations that executed by the same FU the position of an operation U in folding set is actually the folding order of U The folding set are typically obtained from a scheduling and allocation algorithm (ref. Appendix B) The folding set represents underlying folding transformation VLSI-DSP-6-6

Folding Transformation (2/3) P U : # of the pipeline stages of H U. P U = 0 indicates that H U is not pipelined. e D F (U V): (folding equation) # of cycles that the result of H U must be stored D F ( U e V ) [ N( l w( e))] Nw( e) P Negative value of folding equation D F is possible before retiming the folding equations. U v v] [ Nl u P U u] VLSI-DSP-6-7

Folding Transformation (3/3) U (l) w(e) V (l+w(e)) N folded N folded H U (Nl+u) P U +D F H V (N(l+w(e))+v) VLSI-DSP-6-8

Folding Retimed Biquad Filter (1/2) Folding factor N = 4 Folding set S 1 = {4, 2, 3, 1}, S 2 = {5, 8, 6, 7}, where S 1 denote all add operation and S 2 denote all multiply operation. Assume that addition and multiplication require 1 and 2 u.t. respectively. 1-stage adders and 2-stage pipelined multipliers are available. VLSI-DSP-6-9

Folding Retimed Biquad Filter (2/2) folding equations VLSI-DSP-6-10

Retiming (1/3) What situations will be suffered if the folding equation D F is negative? Retiming (moving delay elements) the original DFG prior to folding Constraint: e D F (U V)= Nw r (e) P U +v u>=0 -----(1) Substitute w r (e)=w(e)+r(v) r(u) into (1) r(u) r(v)<= D F (U V)/N Since the retiming values of the nodes are restricted to be integers, the above equations can be rewritten as r(u) r(v)<= D F (U V)/N e e VLSI-DSP-6-11

Retiming (2/3) Example: D F (1 2)=Nw(e)-P U +vu=0-1+1-3=-3 r(1)-r(2)<= floor{d F (1 2)/N} =floor{-3/4}=-1 VLSI-DSP-6-12

Retiming (3/3) r(1)=-1, r(2)=0, r(3)=-1, r(4)=0 r(5)=-1, r(6)=-1, r(7)=-2, r(8)=-1 VLSI-DSP-6-13

Outline Introduction Folding Transformation Register Minimization Techniques Register Minimization in Folded Architecture Conclusions VLSI-DSP-6-14

Lifetime Analysis Lifetime analysis is a procedure used to compute the minimum number of registers required to implement a DSP algorithm in hardware. Linear lifetimes analysis Circular lifetime analysis In lifetime analysis, the number of live variables at each time unit is computed, and the maximum number of live variables at any time unit is determined. Forward-backward register allocation technique VLSI-DSP-6-15

Linear Lifetime Analysis Variables {a, b, c} max {0,1,2,2,2,2,2,2}=2 Periodicity Implicit Three iterations with N=6 VLSI-DSP-6-16

Matrix Transpose Example (1/3) a b c d e f g h i Transpose a d g b e h c f i i h g f e d c b a Matrix Transpose i f c h e b g d a VLSI-DSP-6-17

Matrix Transpose Example (2/3) T zlout = zero-lantacy output time T diff = T zlout T input T output = T zlout + max{-t diff } VLSI-DSP-6-18

Matrix Transpose Example (3/3) Linear Lifetime Chart Circular Lifetime Chart The minimum register number is 4. VLSI-DSP-6-19

VLSI Digital Signal Processing Systems Procedures of Forward-Backward Register Allocation Steps: Step 1: Determinate the minimum number of registers using lifetime analysis. Step 2: Input each variable at time step according to the beginning of its lifetime. Step 3: Each variable is allocated in a forward manner until it is dead or it reaches the last register. Step 4: Since the allocation is periodic, the allocation of the current iteration also repeats itself in subsequent iterations. Thus, we hash the position for registers at period of N. Step 5: If a variable that reaches the last register and is still alive, then these variables are allocated to a register in a backwardly manner. Step 6: Repeat Steps 4 and 5 as required until the allocation is completed. VLSI-DSP-6-20

Register Allocation for Matrix Transpose Example VLSI-DSP-6-21

Outline Introduction Folding Transformation Register Minimization Techniques Register Minimization in Folded Architecture Conclusions VLSI-DSP-6-22

Procedures of Register Minimization in Folded Architectures Steps: Step 1: Perform retiming for folding Step 2: Write the folding equations Step 3: Use the folding equations to construct a lifetime table Step 4: Draw the lifetime chart and determine the required number of registers Step 5: Perform forward-backward register allocation Step 6: Draw the folded architecture that uses the minimum number of registers VLSI-DSP-6-23

Folding Architecture Example VLSI-DSP-6-24

Folded Architecture for Matrix Transpose Example VLSI-DSP-6-25

Biquad Filter Example (1/4) Step 1: Retiming Retiming Invalid folding: DF(1 2) = -3 DF(6 4) = -4 DF(8 4) = -3 DF(7 3) = -3 VLSI-DSP-6-26

Biquad Filter Example (2/4) Step 2: Folding Equations D F (U V) = Nw(e) P u + v - u Step 3: Construct the lifetime table T input = u + P u T output = u + P u + max v {D F (U V) } D F (1 2) = 4(1) 1 + 1 3 = 1 D F (1 5) = 4(1) 1 + 0 3 = 0 D F (1 6) = 4(1) 1 + 2 3 = 2 D F (1 7) = 4(1) 1 + 3 3 = 3 D F (1 8) = 4(2) 1 + 1 3 = 5 D F (3 1) = 4(0) 1 + 3 2 = 0 D F (4 2) = 4(0) 1 + 1 0 = 0 D F (5 3) = 4(0) 2 + 2 0 = 0 D F (6 4) = 4(1) 2 + 0 2 = 4 D F (7 3) = 4(1) 2 + 2 3 = 1 D F (8 4) = 4(1) 2 + 0 1 = 1 VLSI-DSP-6-27

Biquad Filter Example (3/4) Step 4: Draw the Lifetime Chart Step 5: Register Allocation Folding Factor = 4 The minimum number of registers is 2. VLSI-DSP-6-28

Biquad Filter Example (4/4) Step 6: Folded Architecture VLSI-DSP-6-29

IIR Filter Example (1/4) Step 1: Retiming Retiming Invalid folding: DF(3 1) = -3 DF(4 1) = -2 VLSI-DSP-6-30

IIR Filter Example (2/4) Step 2: Folding Equations Step 3: Construct the lifetime table D F (U V) = Nw(e) P u + v - u T input = u + P u T output = u + P u + max v {D F (U V) } D F (1 2) = 4(1) 1 + 1 3 = 0 D F (2 3) = 4(1) 1 + 0 3 = 5 D F (2 4) = 4(1) 1 + 2 3 = 2 D F (3 1) = 4(1) 1 + 3 3 = 1 D F (4 1) = 4(2) 1 + 1 3 = 0 VLSI-DSP-6-31

IIR Filter Example (3/4) Step 4: Draw the Lifetime Chart Step 5: Register Allocation Folding Factor = 2 The minimum number of registers is 3. VLSI-DSP-6-32

IIR Filter Example (4/4) Step 6: Folded Architecture VLSI-DSP-6-33

Conclusions Present a systematic transformation of timemultiplexed architectures Explore folding techniques to reduce # of functional units Explore register minimization technique to reduce # of registers VLSI-DSP-6-34

References K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, Wiley, 1999. S. Y. Huang, Handout of text book, 2004. VLSI-DSP-6-35