General Purpose Signal Processors

Size: px

Start display at page:

Download "General Purpose Signal Processors"

Ezra Palmer
5 years ago
Views:

1 General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros: Cheap Software support Short development time Easy to modify Cons: Power Board real estate I/O and memory limitations Speed

2 History in a Nutshell First generation: AMD2900, NEC7720, TMS32010 (TI) Harvard architecture Hardwired multiplier/accumulator Second generation: TMS320C25, MC56001, DSP16 (AT&T) Concurrency, multiple busses, on-chip memory Third generation: TMS320C30, MC96002, DSP32C (AT&T) On-chip floating point operations Fourth generation: Multi-processing features Image and video processors Low-power DSPs (AT&T)

3 How is DSP Different? Infinite streams of data processed in real time. Relatively small programs and data storage requirements. Intensive arithmetic processing - low amount of control and branching. High amount of I/O with A/D and D/A conversion often required. Loosely coupled multi-processor operation. Alternatives to DSProcessors - General purpose micro-processors - ASIC s with general purpose processor core - Special purpose chips.

4 Advantages of GP Micro-processor over DSP Lower cost because of higher volume. Excellent high level language programming and tool support. Able to efficiently perform non-dsp tasks. Higher performance DSP (TMS320C40) Micro (DECAlphs) Clock Frequency FIR filter length N 25 Mhz 7+N Mhz 23+2N point FFT (µsec)

5 Advantages of DSP over Micro- Processors Software and development system support for signal processing applications. Variety of versions allow cost/performance/ power trade-offs. Low cost versions have adequate performance. Single cycle multiply-accumulate (multiple data busses and array multiplier). Complex instructions for standard DSP functions (FIR filter, 2nd order section, FFT). Specialized memory addressing: bit reversed (FFT), circular buffers (delay lines). Zero-overhead loops I/O support: serial and parallel ports, DMA, A/ D and D/A interface. Limited use of data and instruction caches.

6 Architectural Features High-performance datapaths High throughput I/O bandwidth Controller optimized to implement fast iteration Performance depends on: Flexibility of communication/data access Amount of parallel operations Cycle time How to compare? There are a set of benchmarks: FIR filter Biquad filter Adaptive LMS filter FFT (or DCT for image processors)

7 Harvard Architecture Two simultaneous fetches PC ACU PROG. MEM DATA MEM REG IR EXU CONTROL PATH DATA PATH

8 Modified Harvard Architecture PC ACU ACU PROG. MEM DATA MEM DATA MEM IR EXU CONTROL PATH DATA PATH Two- or three-operand instructions executed in one instruction cycle.

9 External address Program MEM (12K) X MEM 6K RAM 6K ROM Address ALU Y MEM 6K RAM 6K ROM Internal data bus switch X data Y data P data Global External data bus switch I/O controller Program controller Data ALU Architecture of Motorola DSP 56001

10 Scaling Overflow: many successive multiply and accumulate operations can cause an overflow. Example: Many DSPs put extra bits in the accumulator to prevent this. The position of the binary point is not set by the hardware! The user must keep track of it. The assembler has a format for getting data into memory, but this does not imply a position for the binary point. Working with integers is difficult. Repeated multiplications imply increasing precision. What about rounding? It requires extra operation in TMS320, but is part of multiply/accumulate operation in MC

11 Floating Point DSPs Data representation: sfff...f eee...e 24 8 N = M x 2 e-128 Advantages: Large dynamic range No scaling required Problems: Cost Power Slower Wide range of processors available: AT&T DSP 32C Motorola TI TMS320C30

12 TMS320C30 Program Cache RAM Block RAM Block ROM Block Address Registers Multiplier ALU Address Arithmetic General Purpose Registers 33 MFLOPS/ 16.7 MIPS. 60 nsec single cycle instruction time. 32-bit instruction and data word. 24-bit addresses - 16M word address space. 40/32-bit floating point multiplier and ALU. 180 pins package in 1µ CMOS technology.

13 Pipelined Processing PC Instruction fetch Decode + 1 Fetch operands Multiplier ALU Accumulator Write result Pipeline bubbles Conditional branch

14 Pipeline Hazards DSP32 When an accumulator is used as an operand to the multiplier, the value of the accumulator is that established three instructions earlier. When a result is written to memory, the updated values of the memory location cannot be accessed until four instructions later. When any branch control instructions (if, call, return, goto) is executed, the instruction immediately following is also executed before the branch occurs. An DA condition tested by a conditional branch instruction will be established by the last DA instruction four instructions prior to the test.

15 Programming Styles Interlocking: TMS320 Store data to RAM immediately followed by fetching data from the same memory. Control hardware delays the execution of instructions without the programmer being aware of it. Time-Stationary: MC96000, AT&T DSP16 Instruction specifies the operations that occur simultaneously in one instruction cycle. MAC X0,Y0,A X:(R0)-,X0 Y:(R4)-,Y0 a0=a0+p p=x*y y=*r0++ x=*r1++ The programmer has more explicit control over the pipeline stages. Data-Stationary: AT&T DSP32C Instruction specifies all of the operations performed on a set of operands from memory. r5++ = a1 = a0 + *r7**r10++ r17 Possible pipeline hazards!

16 Reservation Table for DSP 32C Instr. Fetch Addr. Transfer Write Result PC Transfer Instr. Transfer Fetch Operand Addr. Transfer Low Memory High Memory Data Bus Addr Bus Mult-stage1 Mult-stage2 Adder Registers PC Fetch Register Access Data Transfer Multiply Add Data Transfer

18 Pipelined Interleaved Processor PC Instruction fetch Decode + 1 Fetch operands Multiplier ALU Write result Accumulator

19 Software Support Present: Simulator and assembler Real-time emulator Prototyping board (in system testing) Macro library Coming up: C compiler ( ~ 50% efficiency) Adapt architectures to allow easier support? Competitiveness with GP micro-processors? Hope for a better future!

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal