VLSI Signal Processing

Similar documents
An Efficient VLIW DSP Architecture for Baseband Processing

General Purpose Signal Processors

DSP Platforms Lab (AD-SHARC) Session 05

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Vertex Shader Design I

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Chapter 4. The Processor

The ARM10 Family of Advanced Microprocessor Cores

Divide: Paper & Pencil

EITF20: Computer Architecture Part2.2.1: Pipeline-1

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

SH4 RISC Microprocessor for Multimedia

Multiple Instruction Issue. Superscalars

DSP VLSI Design. Instruction Set. Byungin Moon. Yonsei University

Design of Transport Triggered Architecture Processor for Discrete Cosine Transform

High performance, power-efficient DSPs based on the TI C64x

05 - Microarchitecture, RF and ALU

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

EITF20: Computer Architecture Part2.2.1: Pipeline-1

PIPELINING AND VECTOR PROCESSING

TSEA 26 exam page 1 of Examination. Design of Embedded DSP Processors, TSEA26 Date 8-12, G34, G32, FOI hus G

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Better sharc data such as vliw format, number of kind of functional units

02 - Numerical Representation and Introduction to Junior

1. Micro Architecture and Finite Length. Olle Seger Andreas Ehliar Dake Liu, Rizwan Azhgar

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

COMPUTER ORGANIZATION AND DESI

REAL TIME DIGITAL SIGNAL PROCESSING

LOW-COST SIMD. Considerations For Selecting a DSP Processor Why Buy The ADSP-21161?

Vector IRAM: A Microprocessor Architecture for Media Processing

SH-5: A First 64-bit SuperH Core with Multimedia Extension. Fumio Arakawa Hitachi, Ltd.

Pipelining and Vector Processing

Embedded Systems Ch 15 ARM Organization and Implementation

COMPUTER ARCHITECTURE AND ORGANIZATION. Operation Add Magnitudes Subtract Magnitudes (+A) + ( B) + (A B) (B A) + (A B)

Number Systems and Computer Arithmetic

Reconfigurable Cell Array for DSP Applications

CSE 141 Computer Architecture Summer Session Lecture 3 ALU Part 2 Single Cycle CPU Part 1. Pramod V. Argade

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Math 230 Assembly Programming (AKA Computer Organization) Spring MIPS Intro

COMPUTER ORGANIZATION AND DESIGN

Chapter 4. The Processor

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

One instruction specifies multiple operations All scheduling of execution units is static

Chapter 13 Reduced Instruction Set Computers

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

M.Tech. credit seminar report, Electronic Systems Group, EE Dept, IIT Bombay, Submitted: November Evolution of DSPs

DUE to the high computational complexity and real-time

Chapter 5 : Computer Arithmetic

The Processor: Instruction-Level Parallelism

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Flexible wireless communication architectures

Chapter 3: Arithmetic for Computers

EC 513 Computer Architecture

Vertex Shader Design II

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Implementation of DSP Algorithms

Multithreaded Processors. Department of Electrical Engineering Stanford University

Computer Systems Architecture Spring 2016

MIPS Integer ALU Requirements

Digital Signal Processor Core Technology

Dynamic Control Hazard Avoidance

Instruction Set Principles and Examples. Appendix B

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

Contents. Chapter 9 Datapaths Page 1 of 28

General Purpose Processors

CS222: Processor Design

omputer Design Concept adao Nakamura

Processors for Embedded Digital Signal Processing

Single-Cycle Examples, Multi-Cycle Introduction

CS 101: Computer Programming and Utilization

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers.

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

ARM Processors for Embedded Applications

CS146 Computer Architecture. Fall Midterm Exam

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Chapter 10 - Computer Arithmetic

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

ISA and RISCV. CASS 2018 Lavanya Ramapantulu

Computer Organisation CS303

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

PowerPC 740 and 750

UC Berkeley CS61C : Machine Structures

Course Description: This course includes concepts of instruction set architecture,

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India

,1752'8&7,21. Figure 1-0. Table 1-0. Listing 1-0.

Advanced processor designs

(+A) + ( B) + (A B) (B A) + (A B) ( A) + (+ B) (A B) + (B A) + (A B) (+ A) (+ B) + (A - B) (B A) + (A B) ( A) ( B) (A B) + (B A) + (A B)

SAE5C Computer Organization and Architecture. Unit : I - V

Transcription:

VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University

Outline DSP Arithmetic Stream Interface Unit & A Simple DSP Core Register Organization for DSP

DSP Arithmetic Dynamic range Precision Floating-point (FP) arithmetic Fractional (1.x) operations with automatic rounding Automatic radix-point tracking in exponent with hardware to use the full precision of the mantissa Unnecessary dynamic range for most DSP applications with poor speed, silicon area, & power consumption Integer arithmetic Programmer must estimate the dynamic range to prevent overflow Tradeoffs between quality (precision) & speed (for explicit exponent tracking & data rounding)

Integer Arithmetic Assume a 256-level grayscale image is going to be filtered with a 7-tap FIR with symmetric coefficients [-0.0645, -0.0407, 0.4181, 0.7885] For 24-bit integer units, 8 bits reserved for inputs & 3 bits for 7-element accumulations Thus, the coefficients has at most 13-bit precision The fractional coefficients are multiplied by 2 13 (8192) to integer numbers [-528-333 3,425 6,459] After the inner product (i.e. multiply-accumulate), the result is divided by 8192 (arithmetically right shifted by 13 bits) to normalize the results (with radix point identical to inputs) If overflow occurs, the result is saturated (optional) Note: It is possible to use the 24 bits more efficiently for higher precision with more complicated data manipulations

Static FP Arithmetic Effective compromise between FP and integer units Fractional operations with automatic rounding as FPU Shrunk (1-bit) aligners & normalizers from FPU Static exponent tracking (radix point) as integer units Almost FP quality (precision) with much simpler hardware (thus speed, area, & power) as the integer units SFP arithmetic utilizes the bits more efficiently for precision No exponent (more free bits for precision) Automatic rounding with fractional multiplications & more frequent normalization than integer units to reduce leading sign-bits

SFPU for Linear Transforms For N-bit data samples (N+1)-bit adder/subtractor with input aligners & output normalizer (in our embodiment, all are 1-bit right shifters) N-bit fractional multiplier with 1-bit output normalizer (left shifter) >>1 >>1 (N+1)-bit >>1 sign bit <<1 radix point fraction N-bit fractional N-bit barrel shifter with arithmetic right shifts FP arithmetic Floating-point adder & multiplier only Integer unit Integer adder, multiplier & barrel shifter >> N

PEV Analysis Static simulation of FP arithmetic Each edge in the FP SDFG is associated with a peak estimation vector (PEV) [M r] M denotes the (worst-case) maximum magnitude r denotes the radix point (similar to the exponent in FP) PEV calculation rules Keep M should be in the range between 1 and 0.5 by carrying out M is divided (or multiplied) by 2 and r minus (or plus) 1 simultaneously r should be identical before addition or subtraction [M 1 r 1 ] [M 2 r 2 ] = [M 1 M 2 r 1 +r 2 ]

Example Align the radix point (r) before summation Keep the maximum magnitude (M) in the range between 0.5 & 1 to prevent overflow & maximize precision [1 0] [0.5-1] [1-1] [1.5-1] [0.75-2] [0.8 0] [0.6-1] [0.48-1] [0.96 0]

Over Estimation & Affine Arithmetic x y z min: -1 max: 1 min: -1 max: 1 min: -1 max: 1 x - y min: -2 max: 2 min: -2 max: 2 y + z min: -4 max: 4 over-estimated x + z min: -2 max: 2 Represent the intermediates in the affine form to record the respective contribution of each independent variable [1 0] [1 0] [1 0] Ignoring the signal correlation may over-estimate the worst-case magnitude [1 0 0] [0 1 0] [0 0 1] affine form [x y z] [1-1 0] [0 1 1] [1-1] [1-1] [1 0 1] [1-1] PEV [M r]

Results 2-D DCT Kernel Single-precision FP 16-bit fixed-point (integer) 32-bit fixed-point (integer) 16-bit SFP 24-bit SFP Cycle Count 672 848 672 720* 720* PSNR (db) - 33.22 36.10 38.12** 62.14 * 1,120 cycles without embedded 1-bit aligners & normalizers ** 36.58dB without affine arithmetic The first three SDFG are derived from the C source codes from IJG The two SFP SDFG are derived from the single-precision FP SDFG (through PEV analysis)

Outline DSP Arithmetic Stream Interface Unit & A Simple DSP Core Register Organization for DSP

Review Folding Transform Procedure Operation scheduling Delay calculation IN REG SIU REG OUT D F e ( U V ) = N w( e) P + v u U ADD MUL Dataflow optimization REG REG N = 4 W, w = 0 V, v = 2 W iteration 0 iteration 1 iteration 2 D F = 0 V W V W V U, u = 1 w(e)=1d U D F = 3 U U 0 1 2 3 4 5 6 7 8 9 time

Stream Interface Unit (SIU) Data generation by stream interface unit (SIU) Routing (interconnection) Buffering (storage) i.e. functional units perform data manipulations only Basic SIU architectures Queue length: L SIU-based DSP Engine Stream Inteface Unit (SIU) I/O N (L+1) Queue length: L M-Bus N functional units N functional units A-Bus IN N Reg OUT CLK

Memory-based SIU Static queue Using pointer instead of data movement Queue length: L N N N N Reg Reg Reg Reg pointer log L MUX Multi-port memory Output-queue: N read; 1 write Input-queue: 1 read; N write

DSP-lite Core source image 1 2 3 DSP-lite Core ping pong image block # 3 image block # 2 I DCT coef. DCT coef block # 1 DCT coef block # 2 O 1 Main Memory I/O Buffer Embedded DMA Controller 4-by-4 switch 2R/1W memory 1R/1W memory coefficient memory AMBA AHB Microinstruction Memory 1R/1W memory > SIU-based DSP Datapath

Memory Remapping 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 0 Physical 1 2 6 7 0 1 2 6 7 0 1 2 6 7 0 3 3 4 5 3 3 4 5 3 4 5 3 1 2 6 7 0 4 5 3 Iteration # 0 1 2 3 4 Virtual Addresses

Software Generation Floating-Point (FP) SDFG FP to SFP Converter SFPU Configuration Static FP (SFP) SDFG ILP-based Scheduler Datapath Configuration Scheduled Fixed-point Scheduled SFP SDFG SDFG (for debug) Micro-code Generator Memory Configuration Micro-code

Scheduling ASAP, ALAP, and scheduling ranges time x 0 x 1 x 2 x 3 x 4 x 5 x 6 0 I 0 I 1 1 A 0 A 1 I 0 I 1 2 O 0 M 0 I 0 I 1 A 0 A 1 M0 3 O 1 A 1 O 0 4 A 0 M 0 O1 5 O 0 O 1 (a) (b) (c)

ILP-based Scheduler The Boolean variable x i.j indicates if the vertex i is scheduled at time j x 0 x 1 x 2 x 3 x 5 x 4 x 6 x 0 x 1 x 2 x 3 x 4 x 5 x 6 I 0 I 1 A 1 A 0 M0 O 0 O1 time 0 1 2 3 4 5 Resource constraints (operations cannot exceed the resources) x 0.0 + x 1.0 1; x 0.1 + x 1.1 1; x 0.2 + x 1.2 1 (for input) x 2.1 + x 3.1 1; x 2.2 + x 3.2 1; x 2.3 + x 3.3 1 (for adder) x 5.3 + x 6.3 1; x 5.4 + x 6.4 1; x 5.5 + x 6.5 1 (for output) Allocation constraints (each node executes only once) x 0.0 + x 0.1 + x 0.2 = 1 x 1.0 + x 1.1 + x 1.2 = 1 x 2.1 + x 2.2 + x 2.3 + x 2.4 = 1 x 3.1 + x 3.2 + x 3.3 = 1 x 4.2 + x 4.3 + x 4.4 = 1 x 5.2 + x 5.3 + x 5.4 + x 5.5 = 1 x 6.3 + x 6.4 + x 6.5 = 1 Dependency constraints (for each edge) x 0.0 + 2 x 0.1 + 3 x 0.2-2 x 2.1-3 x 2.2-4 x 2.3-5 x 2.4-1 x 1.0 + 2 x 1.1 + 3 x 1.2-2 x 2.1-3 x 2.2-4 x 2.3-5 x 2.4-1 x 0.0 + 2 x 0.1 + 3 x 0.2-2 x 3.1-3 x 3.2-4 x 3.3-1 x 1.0 + 2 x 1.1 + 3 x 1.2-2 x 3.1-3 x 3.2-4 x 3.3-1 2 x 2.1 +3 x 2.2 +4 x 2.3 +5 x 2.4-3 x 5.2-4 x 5.3-5 x 5.4-6 x 5.5-1 2 x 3.1 +3 x 3.2 +4 x 3.3-3 x 4.2-4 x 4.3-5 x 4.4-1 3 x 4.2 +4 x 4.3 +5 x 4.4-4 x 6.3-5 x 6.4-6 x 6.5-1

Port Constraints To prevent memory conflict in the 1R/1W SIU memory in the DSP-lite core For adder memory, I 0 & I 1 can not be scheduled at the same time x 0.0 + x 1.0 1, x 0.1 + x 1.1 1 x 0.2 + x 1.2 1 x 0 x 1 x 2 x 3 For output memory, A 0 & M 0 cannot be schedule at the same time x 2.2 + x 4.2 1, x 2.3 + x 4.3 1 x 2.4 + x 4.4 1 x 5 x 4 x 6 x 0 x 1 x 2 x 3 x 4 x 5 x 6 I 0 I 1 A 1 A 0 M0 O 0 O1

Performance Evaluation DSP-lite I/O registered to allow full-cycle routing in SIU Latency: 2 cycles for registered I/O ADSP-218x Single-cycle multiplier, adder, and barrel shifter with distributed register files # of cycles 3rd-order Lattice Filter 2nd-order Biquad Filter 16-point complex FFT 8-point 1-D DCT 8x8 2-D DCT ADSP-218x (80MHz) 32 13 874 154 2,452 DSP-lite (100MHz) 13 14 268 47 720

Silicon Implementation TSMC 0.35um 1P4M CMOS Technology Area: 2.8mm 2 Freq: 100MHz Power: 122mW Microinstruction Memory I/O Buffer DMA controller AMBA Slave Static Floating-Point (SFP) Units SIU Memory Multi-Processor Interface

Outline DSP Arithmetic Stream Interface Unit & A Simple DSP Core Register Organization for DSP

Centralized Register File (RF) N functional units () including load/store units (L/S) & arithmetic units (AU) P access ports (~3N) n registers (αn) CRF Hardware complexity Area: n P 2 Delay: n 1/2 P word lines Power: n P 2 bit lines

Banking Mapping of logical to physical ports P-port register file can be constructed with p-port banks (smaller SRAM blocks) p = P centralized RF p < P and B banks; arbitrary port mapping requires a P-to-Bp crossbar Examples RF organized as even/odd banks to support multi-word data operands p = P and non-accessed banks are powered off to save energy p < P and port conflicts are resolved either in HW (stall) or SW (carefully operation scheduling & register allocation) Bank 0 Bank 1 Bank 2 Bank 3

Clustering Explore the spatial locality of computations Each cluster has its RF with minimal inter-cluster communication RF0 RF1 L/S L/S L/S Hierarchical RF L/S units and AU have distinct RFs The RF for L/S can be regarded as an additional memory hierarchy Distributed RF (DSP-lite) RF1 RF0 AU AU AU

Inter-Cluster Communication (1/2) Copy operations STM & HP: Lx BOPS: ManArray (dedicated instruction slot) RF0 RF1 For hierarchical RF, copy operations can be viewed as special load/store instructions Extended accesses TI C 6x (extended reads) RF0 RF1 C C For hierarchical RF, extended accesses can be viewed as has limited operands from memory (similar to CISC) RF0 RF1

Inter-Cluster Communication (2/2) Shared storage Sun MAJC (with replicates for separate reads) For hierarchical & clustered RF, the cache RF may work as the shared storage Register permutation Special case of shared storage Each cluster access an exclusive bank Shared RF RF0 RF1 L/S L/S RF0 RF1 RF RF0 RFn For hierarchical RF with permutation functions AU AU AU AU as a ping-pong buffer

Ring-structure RF (1/2) Instruction Dispatcher 24 24 24 24 Control / LS LS ALU / MAC ALU / Enhanced MAC local registers R0~R7 (private, 32-bit) R0~R7 (private, 32-bit) R0~R7 (private, 40-bit) R0~R7 (private, 40-bit) ring registers R8~R15 (ring, 32-bit) R8~R15 (ring, 32-bit) R8~R15 (ring, 32-bit) R8~R15 (ring, 32-bit) Ring-structure Register File 8 RF (each with 8 elements) are implemented, 4 local (private) for each functional unit 32-bit address registers & GPR in control & LS units 40-bit accumulators in ALU/MAC units 4 shared are concatenated as a ring with dynamic port mapping Otherwise, a 16-port (8R/8W) centralized register file is required

64-tap FIR Example Syntax: #, ring offset, instr0, instr1, instr2, instr3 (memory is half-word addressed) i0 0; MOV r0,coef; MOV r0,coef; MOV r0,0; MOV r0,0; i1 0; MOV r1,x; MOV r1,x+1; NOP; NOP; i2 0; MOV r2,y; MOV r2,y+2; NOP; NOP; // assume half-word (16-bit) input & word (32-bit) output i3 RPT 512,8; // 2 outputs per iteration & total 1024 outputs i4 0; LW_D r8,r9,(r0)+2; LW_D r8,r9,(r0)+2; MOV r1,0; MOV r1,0; i5 RPT 15,2; // loop kernel: 60 MAC_V, including 120 multiplication (2 output) i6 2; LW_D r8,r9,(r0)+2; LW_D r8,r9,(r0)+2; MAC_V r0,r8,r9; MAC_V r0,r8,r9; i7 0; LW_D r8,r9,(r0)+2; LW_D r8,r9,(r0)+2; MAC_V r0,r8,r9; MAC_V r0,r8,r9; i8 2; LW_D r8,r9,(r0)+2; LW_D r8,r9,(r0)+2; MAC_V r0,r8,r9; MAC_V r0,r8,r9; i9 0; MOV r0,coef; MOV r0,coef; MAC_V r0,r8,r9; MAC_V r0,r8,r9; i10 0; ADDI r1,r1,-60; ADDI r1,r1,-60; ADD r8,r0,r1; ADD r8,r0,r1; i11 2; SW (r2)+4,r8; SW (r2)+4,r8; MOV r0,0; MOV r0,0; Remarks: 35 instruction cycles for 2 output; i.e. 17.5 cycle/output or 3.66 taps/cycle SIMD MAC: MAC_V r0,r8,r9; r0=r0+r8.hi*r9.hi & r1=r1+r8.lo*r9.lo

Ring-structure RF (2/2) Control / LS LS ALU / MAC ALU / Enhanced MAC local registers ring registers R0~R7 (private, 32-bit) R8~R15 (ring, 32-bit) R0~R7 (private, 32-bit) R8~R15 (ring, 32-bit) R0~R7 (private, 40-bit) R8~R15 (ring, 32-bit) R0~R7 (private, 40-bit) R8~R15 (ring, 32-bit) L/S R0~R7 (addr) Ring-structure Register File R8~R15 (ping) R0~R7 (addr) L/S R8~R15 (ping) R8~R15 (pong) L/S R8~R15 (ping) R8~R15 (pong) R0~R7 (addr) Each cluster has a pingpong hierarchical RF R8~R15 (pong) AU R0~R7 (accum.) R0~R7 (accum.) AU AU R0~R7 (accum.)

Performance Evaluation Implementation Results Centralized RF Ring-Structure RF Delay 38.46 ns 8.71 ns Gate Count 591K 48K 32Kbyte Instruction Memory (256 1024-bit bundles) Area 17.76mm 17.76mm 1.9mm 1.9mm (2.34mm 1.53mm) Instruction Dispatcher Power N.A. 356mW @3.3V 100MHz Instruction Decoder 32Kbyte Data Memory (4 banks) TI C 55x TI C 64x NEC SPXK5 Intel/ADI MSA Proposed L/S L/S ALU/MAC ALU/eMAC FIR NT/2 NT/4 NT/2 NT/2 NT/4 Local Registers 4-by-4 Switch Network FFT 4,768 2,403 2,944 3,176 2,340 Ring Registers Viterbi 1 (0.4) N.A. 1 (1) 1 (N.A.) 1 (0.84) ME N.A. 2 2 4 2

Summary Static FP arithmetic Software (i.e. static) tracks the exponents of the intermediate variables to maximize the precision while preventing overflow 62.14dB & 38.12dB for 24-bit & 16-bit SFPU respectively Simple DSP core DSP-lite Software techniques are extensively investigated to reduce the hardware complexity SIU-based computing engine has about 3X performance over conventional DSP with similar functional units Ring-structure RF for N functional units Partitions the 4N-port CRF into 2N sub-blocks, and each has 2R2W ports only Has comparable performance with state-of-the-art DSP processors (estimated in cycles) when N=4,and saves 91.88% area & 77.35% access time