INTRODUCTION TO DIGITAL SIGNAL PROCESSORS

Size: px

Start display at page:

Download "INTRODUCTION TO DIGITAL SIGNAL PROCESSORS"

Evan Quinn
5 years ago
Views:

1 INTRODUCTION TO DIGITAL SIGNAL PROCESSORS Accumulator architecture Memoryregister architecture Prof. Brian L. Evans in collaboration with Niranjan DameraVenkata and Magesh Valliappan Loadstore architecture Embedded Signal Processing Laboratory The University of Texas at Austin Austin, T

2 Outline n Signal processing applications n Conventional DSP architecture n Pipelining in DSP processors n RISC vs. DSP processor architectures n TI TMS320C6x VLIW DSP architecture n Signal and image processing applications n Signal processing on generalpurpose processors n Conclusion 2

3 Signal Processing Applications n Lowcost embedded systems 4 Modems, cellular telephones, disk drives, printers n Highthroughput applications 4 Halftoning, radar, highresolution sonar, tomography n PC based multimedia 4 Compression/decompression of audio, graphics, video n Embedded processor requirements 4 Inexpensive with small area and volume 4 Deterministic interrupt service routine latency 4 Low power: ~50 mw (TMS320C5402/20: 0.32 ma/mip) 3

4 Conventional DSP Architecture n High data throughput 4 Harvard architecture n n Separate data memory/bus and program memory/bus Three reads and one or two writes per instruction cycle 4 Short deterministic interrupt service routine latency 4 Multiplyaccumulate (MAC) in a single instruction cycle 4 Special addressing modes supported in hardware n n Modulo addressing for circular buffers (e.g. FIR filters) Bitreversed addressing (e.g. fast Fourier transforms) 4 Instructions to keep the 34 stages of the pipeline full n n Zerooverhead looping (one pipeline flush to set up) Delayed branches 4

5 Conventional DSP Architecture (con t) n Modulo addressing 4 implementing circular buffers and delay lines Datashifting Time Buffer contents Next sample n=n x NK+1 x NK+1 x N1 x N n=n+1 x NK+2 x NK+3 x N x N+1 n=n+2 x NK+3 x NK+4 x N+1 x N+2 x N+1 x N+2 x N+3 Modulo addressing n Bit reversed addressing 4 used to implement the radix2 FFT Time Buffer contents Next sample n=n x N2 x N1 x N x NK+1 x NK+2 n=n+1 x N2 x N1 x N x N+1 x NK+2 n=n+2 x N2 x N1 x N x N+1 x N+2 x NK+3 x NK+3 x NK+4 x NK+4 x N+1 x N+2 x N+3 5

6 Conventional DSP Architecture (con t) FixedPoint FloatingPoint Cost/Unit $5 $10 Architecture accumulator loadstore or memoryregister Registers 2 4 data 8 address 8 or 16 data 8 or 16 address Data Words 16 or 24 bit integer and fixedpoint 32 bit integer and fixed/floatingpoint OnChip Memory 2 64 kwords data * 2 64 kwords program 8 64 kwords data 8 64 kwords program Address Space 16 kw 8 Mw data kw program ** 16 Mw 4Gw data 16 Mw 4 Gw program Compilers C compilers; poor code generation C, C++ compilers; better code generation Examples TI TMS320C5x; Motorola TI TMS320C3x; Analog Devices SHARC 6

7 Conventional DSP Architecture (con t) n Market share: 95% fixedpoint, 5% floatingpoint n Each processor has dozens of configurations 4 Size and map of data and program memory 4 A/D, input/output buffers, interfaces, timers, and D/A n Drawbacks to conventional DSP processors 4 No byte addressing (desirable for image and video) 4 Limited onchip memory 4 Limited addressable memory on fixedpoint DSPs: except Motorola (16 Mw data; 32 Mw program) and C548/C549/C54xx (8 Mw data; 256 kw program) 4 Nonstandard C extensions to support fixedpoint data 7

8 Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSP processors) Fetch Decode Read Execute Superscalar (Pentium, MIPS) Fetch Decode Read Execute Superpipelined (CDC7600) Managing Pipelines compiler or programmer interlocking hardware instruction scheduling Fetch Decode Read Execute 8

9 Pipelining: Operation n Timestationary pipeline model 4 Programmer controls each cycle 4 Motorola DSP56001 MAC 0,Y0,A :(R0)+,0 Y:(R4),Y0 n Datastationary pipeline model 4 Programmer specifies data operations 4 TMS320C30/40 MPYF *++AR0(1),*++AR1(IR0),R0 n Interlocked pipeline 4 Programmer is protected from pipeline effects Fetch F D E F G H I J K L L Decode D R E C D E F G H I J K L B C D E F G H I J K L Read Execute A B C D E F G H I J K L 9

10 Pipelining: Hazards n A control hazard occurs when a branch instruction is decoded 4 Flush the pipeline 4 or: Delayed branch (expose pipeline) n A data hazard occurs because an operand cannot be read yet 4 Intended by programmer 4 or: Interlock hardware inserts bubble TMS320C5x example LAC #064h SAMM AR2 NOP LACC * LAR AR2, DATA LACC * Fetch F D R E D E F br G Y Y Z Decode C D E F br Y Z B C D E F br Y Z Read Execute A B C D E F br Y Z 10

11 Pipelining: Avoiding Control Hazards A key factor in the numeric performance of DSPs is the provision of special hardware to perform looping. RPT COUNT TBLR *+ n A repeat instruction repeats one instruction or a block of instructions after repeat n The pipeline is filled with repeated instruction (or block of instructions) n Cost: one pipeline flush only Fetch F D R E D E F rpt Decode C D E F rpt Read B C D E F rpt A B C D E F rpt Execute 11

12 RISC vs. DSP: Instruction Encoding n RISC: Superscalar Reorder Load/store FP Unit n DSP: Horizontal microcode Integer Unit Load/store Load/store ALU Multiplier Address 12

13 RISC vs. DSP: Memory Hierarchy n RISC Registers Out of order I/D Cache Physical memory TLB TLB: Translation Lookaside Buffer n DSP I Cache Internal memories Registers External memories DMA Controller DMA: Direct Memory Access 13

14 TI TMS320C6x VLIW DSP Architecture Simplified Architecture External Memory Sync Async Addr Data Program RAM Data RAM or Cache Internal Buses Regs (A0A15).D1.D2.M1.M2.L1.L2.S1.S2 Control Regs CPU Regs (B0B15) DMA Serial Port Host Port Boot Load Timers Pwr Down 14

15 TI TMS320C6x VLIW DSP Architecture n One instruction cycle per clock cycle n Two parallel data paths with singlecycle units: 4 Data unit 32bit address calculations (modulo, linear) 4 Multiplier unit 16 bit x 16 bit with 32bit result 4 Logical unit 40bit (saturation) arithmetic & compares 4 Shifter unit 32bit integer ALU and 40bit shifter n 16 32bit general purpose registers in each path 4 40 bits can be stored in adjacent even/odd registers n 32bit addressing of 8/16/32 bit data n Fixedpoint (C62x) and floatingpoint (C67x) n C67x computes floatingpoint multiply in 4 cycles 15

16 TI TMS320C6x VLIW DSP Architecture n TMS320C6211: $21 in volume MHz, 300 million MACs/sec, 1200 RISC MIPS 4 onchip: 4k x 8 bits program and 4k x 8 bits data (plus 64k x 8 bits L2 cache) n Deep pipeline stages in C62x: fetch 4, decode 2, execute stages in C67x: fetch 4, decode 2, execute If a branch is in the pipeline, interrupts are disabled (the latency of a branch instruction is 5 cycles) 4 Avoid branches by using conditional execution n No hardware protection against pipeline hazards 4 Compiler and assembler must prevent pipeline hazards 16

17 C5x and C6x Addressing Modes n Immediate 4 The operand is part of the instruction n Register 4 The operand is specified in a register n Direct 4 The address of the operand is part of the instruction (added to imply memory page) n Indirect 4 The address of the operand is stored in a register TMS320C5x TMS320C6x ADD #0FFh add.l1 13,A1,A6 (implied) add.l1 A7,A6,A7 ADD 010h not supported ADD * ldw.l1 *A5++[8],A1 17

18 TMS320C6x vs. Pentium MM Processor Peak MIPS BDTI marks ISR latency Power Unit Price Area Volume Pentium MM 233 Pentium MM 266 C62x 150 MHz C62x 200 MHz µs 4.25 W $ x in µs 4.85 W $ x in µs 1.45 W $ x in µs 1.94 W $ x in 3 BDTImarks: Berkeley Design Technology Inc. DSP benchmark results (larger means better)

19 n Each tap requires Application: FIR Filter 4 Fetching one data sample 4 Fetching one operand 4 Multiplying two numbers 4 Accumulating multiplication result 4 Shifting one sample in the delay line z 1 z 1 z 1 n Computing an FIR tap in one instruction cycle 4 Three data memory accesses 4 Autoincrement or decrement addressing modes 4 Modulo addressing to implement delay line as circular buffer 4 Eleven RISC instructions 19

20 Application: FIR Filter on a TMS320C5x Coefficients Data COEFFP.set 02000h ; Program mem address.set 037Fh ; Newest data sample LASTAP.set 037FH ; Oldest data sample LAR AR3, #LASTAP ; Point to oldest sample RPT #127 MACD COEFFP, * ; Do the thing APAC SACH Y,1 ; Store result note shift 20

21 Application: FIR Filter on a TMS320C62x Coefficients Data SingleCycle Loop... C7: ldh.d1 *A1++, A2 ; Read coefficient ldh.d2 *B1++, B2 ; Read data [B0] sub.l2 B0, 1, B0 ; Decrement counter [B0] B.S2 c7 ; Branch if not zero mpy.m1x A2, B2, A3 ; Form product add.l1 A4, A3, A4 ; Accumulate result... 21

22 Ordered Dithering on a TMS320C62x 1/8 5/8 7/8 3/8 SingleCycle Loop Array of thresholds... C7: ldb.d1 *A1++, A2 ; Read pixel ldb.d2 *B1++, B2 ; Read threshold [B0] sub.l2 B0, 1, B0 ; Decrement counter [B0] B.S2 c7 ; Branch if not zero cmpgtu.l1x A2, B2, A3 ; Threshold and store stb.d1 A3, *A5++ ; Accumulate result... 22

23 DSP Cores n ASIC with: 4 Programmable DSP 4 RAM 4 ROM 4 Standard cells 4 Codec 4 Peripherals 4 Gate array 4 Microcontroller 23

24 DSP on General Purpose Processors n Multimedia applications on PCs 4 Video, audio, graphics and animation 4 Repetitive parallel sequences of instructions n Native signal processing examples 4 Sun Visual Instruction Set (UltraSPARC 1/2) 4 Intel MM (Pentium I/II/III) 4 Intel Concurrent SIMDFP (Pentium III) n Single Instruction Multiple Data (SIMD) 4 One instruction acts on multiple data in parallel 4 Wellsuited for graphics 24

25 DSP on General Purpose Processors (con t) n Programming is considerably tougher 4 C/C++ compilers do not generate native signal processing code except Metrowerks CodeWarrior 4 gives MM code 4 Libraries of routines using native signal processing 4 Hand code using inline assembly for best performance 4 Pack/unpack data not aligned on SIMD word boundaries 4 50cycle penalty to switch out of MM; 0 penalty for VIS 4 Saturation arithmetic in MM; not supported in VIS 4 Extendedprecision accumulation in MM; none in VIS n Speedup for applications 4 Signal and image processing 1.5:1 to 2:1 4 Graphics 4:1 to 6:1 (no packing/unpacking) 25

26 Intel MM Instruction Set n 64bit SIMD register (4 data types) 4 64bit quad word 4 Packed byte (8 bytes packed into 64 bits) 4 Packed word (4 16bit words packed into 64 bits) 4 Packed double word (2 double words packed into 64 bits) n 57 new instructions 4 Pack and unpack 4 Add, subtract, multiply, and multiply/accumulate n Saturation and wraparound arithmetic n Maximum parallelism possible 4 8:1 for 8bit additions 4 4:1 for 8 x 16 multiplication or 16bit additions 26

27 Concluding Remarks n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4 Excel at onedimensional processing 4 Have instructions tailored to specific applications n TMS320C6x VLIW DSP 4 High performance vs. cost/volume 4 Excel at multidimensional signal processing 4 A maximum of 22 RISC instructions per cycle n Native Signal Processing 4 Available on desktop computers 4 Excels at graphics 4 A maximum of 8 RISC instructions per cycle n Inline assembly code for best performance 27

28 Concluding Remarks n Digital signal processor market 4 40% annual growth rate since $3.5 billion revenue in % TI, 25% Lucent, 10% Motorola, 8% Analog Devices n Independent benchmarking by industry 4 Berkeley Design Technology Inc. 4 EDN Embedded Microprocessor Benchmark Consortium n Web resources 4 comp.dsp newsgroup: FAQ 4 embedded processors and systems: 4 online courses and DSP boards: 28

29 References 1 G. E. Allen, B. L. Evans, and D. C. Schanbacher, RealTime Sonar Beamforming on a Unix Workstation, Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, pp , R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, Evaluating MM Technology Using DSP and Multimedia Applications, Proc. IEEE Sym. On Microarchitecture, pp. 3746, W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, Native Signal Processing on the UltraSPARC in the Ptolemy Environment, Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, B. L. Evans, EE379K17 RealTime DSP Laboratory, UT Austin. 5 B. L. Evans, EE382C Embedded Software Systems, UT Austin. 6 A. Kulkarni and A. Dube, Evaluation of the Code Generation Domain in Ptolemy, 7 P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals, IEEE Press,

INTRODUCTION TO DIGITAL SIGNAL PROCESSORS

INTRODUCTION TO DIGITAL SIGNAL PROCESSORS Accumulator architecture Memoryregister architecture Prof. Brian L. Evans in collaboration with Niranjan DameraVenkata and Magesh Valliappan Loadstore architecture