An introduction to DSP s Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures
DSP example: mobile phone
DSP example: mobile phone with video camera
DSP: applications
Why a DSP? It s easy: we want an architecture optimized for Digital Signal Processing Some versions are further optimized for some specific applications - e.g. very low power consumption for mobile phones
Which is the difference between a DSP and a general purpose processor? (1/4) Memory architecture and bus The first processors (in the 40) had a Harvard architecture: separate memories for program and data But it s complex -> soon replaced by Von Neumann architecture: no real difference between program and data (an instruction has two fields: operation and data) Problem: the processor cannot access instructions and data simultaneously To improve performance: Harvard architecture again! In particular - separate memories and busses for program and data - possibly, another separate bus for the DMA
Which is the difference between a DSP and a general purpose processor? (2/4) A DSP is often used to realize a linear filter The convolution integral is actually a sum: y n =Σ i x n-i h i - if the number of sums is finite: FIR filter (finite impulse response), - otherwise: IIR (infinite impulse response), - which can be realized using two finite sums: y n =Σ i x n-i b i + Σ i y n-i a i
Which is the difference between a DSP and a general purpose processor? (3/4) A common operation in a FIR or IIR filter is A=BC+D: we need - a hardware multiplier (introduced in DSPs in the '70) - a multiply and accumulate in only one clock cycle: MAC instruction. Actually, the MAC is in a loop: we also need a zero overhead loop: - H/W for address generation (the access to memory is not random) - loop management - auto-increment; circular addressing Other possible H/W: - H/W saturation - Instructions to perform a division quickly - Bit reversal for FFT
Which is the difference between a DSP and a general purpose processor? (4/4) Other possible features: Often, data are 16- o 8-bit wide (e.g., audio or images) - a 32-bit ALU can be splitted in two 16-bit ALUs or four 8-bit ALUs, -> 2 o 4 operations in parallel several ALUs which work in parallel fixed point ALUs, o 16-bit ALUs, to reduce power consumption and costs optimized versions: - cost: for consumer applications - power: for mobile applications - for specific applications, e.g. electric motor control
Example: C30 (Texas Instruments, 1982)
Example: FIR filter using a C30
Note: several of these characteristics, which were born on DSPs, have been ported to general purpose processors E.g.: the cache in the Pentium processor is Harvard-like
Another example.: several units working in parallel, and splittable ALUs (see. MMX extensions) in the Pentium 4 processor
Pipeline Example of a 4-stage pipeline (TI C30) each instruction is executed in 4 clock cycles, but (normally) can be put just 1 cycle after the previous one (data are needed only 3 cycles later)
Pipeline: branch (e.g. on the C30) Standard branch: the pipeline is flushed to correctly handle the PC -> 4 cycles Delayed branch: the pipeline is not flushed, and the 3 following instructions are loaded before modifying the PC -> only 1 cycle needed! BRD label ; delayed branch MPYF ; executed ADDF ; executed SUBF ; executed AND ; not executed label MPYF ; fetched after SUBF
Two architectures In order to exploit the instruction level parallelism (ILP): two possible architectures - Superscalar: the parallelism is dynamically managed by the hardware - Very Long Instruction Word (VLIW): the parallelism is statically managed by the compiler Which is the problem? Dependences in data or control can generate conflicts - on data (an instruction needs the result of a previous instruction, but the results is not ready yet), or - on control (conditional jump, but the condition is not ready yet) -> pipeline stall
Superscalar The analysis of the independent instructions is dynamically done by hardware (which is complex!) The sequence of instructions can be executed out-of-order; then, the completion of the instructions (commit) is done inorder to correctly update the state of the CPU
VLIW Very Long Instruction Word (VLIW): the parallelism is statically managed by the compiler The analysis of independent instructions is statically realized during the compilation phase; - the instructions which can be realized in parallel are assembled in long instructions and send to the various functional units in-order Convenient solution for DSP programs (fixed length cycles, few conditional operations); less convenient for general purpose applications Simpler hardware! But a specific compilation for each platform is needed Deterministic behaviour -> exact computation of execution times