VLSI Programming 2016: Lecture 6

Size: px

Start display at page:

Download "VLSI Programming 2016: Lecture 6"

Solomon Curtis
5 years ago
Views:

1 VLSI Programming 2016: Lecture 6 Course: 2IMN35 Teachers: Kees van Berkel c.h.v.berkel@tue.nl Rudolf Mak r.h.mak@tue.nl Lab: Kees van Berkel, Rudolf Mak, Alok Lele www: Lecture 6 T3, T4, digital signal processors 1

2 VLSI Programming (2IMN35): time table in Tue: h5-h8; MF.07 out 2016 in Thu: h1-h4; Gemini-Z3A-08/10/13 out 19-Apr introduc/on, DSP graphs, bounds, 21-Apr pipelining, re/ming, transposi/on, J-slow, unfolding 26-Apr tools Introduc/ons to L1: audio filter L1 28-Apr T1 unfolding, look-ahead, L1 cntd installed FPGA and Verilog simula/on L2 + T2 strength reduc/on 3-May folding L2: audio filter 5-May on XUP board 10-May T3 + T4 DSP processors L2 cntd L3 12-May L3: sequen/al FIR + strength-reduced FIR 17-May L3 cntd 19-May L3 cntd L4 24-May systolic computa/on T5 26-May L3 L4 31-May T5 L4: 2-Jun L4 cntd L5 audio sample rate convertor 7-Jun L5: 1024x audio sample rate convertor 9-Jun L4 L5 cntd 14-Jun 16-Jun L5 deadline report L5 T1 + T2 T3 + T4 2

3 Outline Lecture 6 T3, T4 The SW-HW performance spectrum an architecture-morphing exercise Mandatory reading (reminder): Edward A. Lee and David G. Messerschmitt. Synchronous Data Flow. Proc. of the IEEE, Vol. 75, No. 9, Sept 1987, pp

4 T3: Parallel IIR assignment Consider IIR: y(n) = x(n) + a*y(n-2) Assume add and multiply time: 2 and 5 nsec resp. 1. Derive parallel look-ahead IIR, L=4, 2. Pipeline and retime for maximal throughput using a minimum number of D-elements. 3. Include throughput and latency calculation. Return deadline: Tuesday May 10 4

5 IIR assignment 4 (from lecture 3) 4. pipeline and retime unfolded IIR; draw DFG; throughput? y(2(k-1)) no pipelining possible a D same DFG x(2k) + y(2k) y(2(k-1)+1) same troughput! a D f sample 2 T M +T A = 2 7 x(2k+1) = 286 MHz + y(2k+1) 5

6 Parallel IIR assignment unrolling 3x (n n+1) note: rewrite a la Parhi: u(n) = x(n+2) y(n+2) = a*y(n) + u(n) y(n+3) = a*y(n+1) + u(n+1) y(n+4) = a*y(n+2) + u(n+2) = a 2 *y(n) + a*u(n) + u(n+2) y(n+5) = a*y(n+3) + u(n+3) = a 2 *y(n+1) + a*u(n+1) + u(n+3) unfolding (L=4: n 4k) y(4k+2) = a *y(4k) + u(4k) y(4k+4) = a 2 *y(4k) + a*u(4k) + u(4k+2) y(4k+3) = a *y(4k+1) + u(4k+1) y(4k+5) = a 2 *y(4k+1) + a*u(4k+1) + u(4k+3) 6

7 Parallel IIR assignment u(4k+2) a u(4k) a 2 f 4 + 2T sample = = 444 TM A 9 4 MHz y(4k) u(4k+3) u(4k+1) a 2 y(4k+2) a y(4k+1) y(4k+5) y(4k+3) 7

8 Parallel IIR assignment u(4k+2) a u(4k) a 2 f sample 4 T M +T A = 4 7 = 571 MHz u(4k+3) a u(4k+1) a 2 y(4k+5) y(4k) -4) y(4k+2) -4) +4 D elements y(4k+1) -4) y(4k+3) -4) 8

9 Parallel IIR assignment 2 slow! u(4k+3) u(4k+2) u(4k+1) u(4k) a 2 f 2 + 0T sample = = 400 TM A 5 2 MHz a y(4k) y(4k+2) y(4k+1) y(4k+3) 9

10 T4: Strength-reduced FIR assignment Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-6) + d*x(n-7) Assume add and multiply times: 2 and 5 nsec resp. 1. Draw DFG of FIR, calculate throughput. 2. Apply strength reduction, L=2. 3. Pipeline and retime for maximal throughput using a minimum number of D-elements. 4. Include throughput and latency calculation. Return deadline: Tuesday May 10 10

11 Assignment T4: Strength-reduced FIR 1a Consider FIR: y(n) = a*x(n) + b*x(n-1) + c*x(n-6) + d*x(n-7) Assume add and multiply times: 2 and 5 nsec resp. 1. Draw DFG of FIR, calculate throughput. Transposed form (for high throughput): d c b a f sample 5D 1 = 1 T M +T A 5+ 2 = 1 7 =143 MHz 11

12 Assignment T4: Strength-reduced FIR 1b 2. Pipeline and retime FIR for maximal throughput.. d c b a 5D -1) f 1 + 0T sample = = = 200 TM A MHz +4 D elements 12

13 Assignment T4: Strength-reduced FIR 2 y(n) = a*x(n) + b*x(n-1) + c*x(n-6) + d*x(n-7) y(2k) = a*x(2k) + b*x(2k-1) + c*x(2k-6) + d*x(2k-7) = a*x(2k) + b*x(2(k-1) +1) + c*x(2(k-3)) + d*x(2(k-4)+1) y(2k+1) = a*x(2k+1) + b*x(2k) + c*x(2k-5) + d*x(2k-6) = a*x(2k+1) + b*x(2k) + c*x(2(k-3)+1) + d*x(2(k-3)) y(2k) = (a+b) *x(2k) + (c+d) *x(2(k-3)) + b*[ x(2(k-1)+1)) - x(2k) ] + d*[ x(2(k-4)+1)) - x(2(k-3)) ] y(2k+1) = (a+b) *x(2k) + (c+d) *x(2(k-3)) + a*[x(2k+1) - x(2k) ] + c*[x(2(k-3) +1) - x(2(k-3)) ] 13

14 Assignment T4: Strength-reduced FIR 2 y(2k) = (a+b) *x(2k) + (c+d) *x(2(k-3)) + b*[ x(2(k-1)+1)) - x(2k) ] + d*[ x(2(k-4)+1)) - x(2(k-3)) ] y(2k+1) = (a+b) *x(2k) + (c+d) *x(2(k-3)) + a*[x(2k+1) - x(2k) ] + c*[x(2(k-3) +1) - x(2(k-3)) ] When we assume (a+b) and (c+d) are pre-computed constants: 3 2 multipliers for sub-firs 3 adders for sub-firs 2 adders +2 subtractors strength reduction overhead = 6 multipliers + 7 adds/subs (versus 8 multipliers + 6 adds/subs) 14

15 Assignment T4: Strength-reduced FIR 2 x(2k+1) + - c a 3D + + y(2k+1) x(2k) c+d a+b D - + 3D + d b f 2 + 3T sample = = 182 TM A 11 2 MHz 3D + + y(2k) 15

16 Assignment T4: Strength-reduced FIR 3,4 x(2k+1) + - c a Pipelining : +9 D-elements 3D + + y(2k+1-4) x(2k) c+d a+b D - + 3D + d b f 2 + 0T sample = = 400 TM A 5 2 MHz 3D + + y(2k-4) 16

17 Assignment T4: Strength-reduced FIR 3,4 y(2k+1-6) c a 3D x(2k+1) y(2k-6) + + D - c+d a+b + 3D + Transposition: 12 D-elements d b 3D x(2k) 17

18 DIGITAL SIGNAL PROCESSORS 18

19 The SW-HW Spectrum: outline FIR on a microprocessor (MIPS, DLX) FIR on a DSP DSP Arithmetic DSP memory addressing and organization DSP control Zürich zip and Eindhoven zip REAL and Motorola DSP programming FIR on a Vector DSP FIR in VLSI or on an FPGA 19

20 Typical DSP algorithms: FIR Filters Filters reduce signal noise and enhance image or signal quality by removing unwanted frequencies. Finite Impulse Response (FIR) filters compute y[i] : y( i) = h( k) x( i k ) = h( n) * k = 0 where x is the input sequence y is the output sequence h is the impulse response (filter coefficients) N is the number of taps (coefficients) in the filter N 1 x( n) Output sequence depends only on input sequence and impulse response. 20

21 FIR filter in ANSI C #define N 16 int X[N]; int C[N]; int sum; int FIRstep(b) int b; tap { int s=0; int i; for (i=0; i<n; i++){s=s + C[i]*X[(b+i)%N]; } return s; } main() { int base=0; Cinit(); while (1){scanf ("%d",&x[base]); sum=firstep(base); printf("%d\n",sum); base=(base+1)%n; } } 21

22 FIRSTEP in MIPS assembler (from C-compiler) # 16 s=s + C[i]*X[(b+i)%N]; lw $14, 0($sp) addu $15, $4, $14 rem $24, $15, 8 mul $25, $24, 4 la $8, X addu $9, $25, $8 lw $10, 0($9) mul $11, $14, 4 la $12, C addu $13, $11, $12 lw $15, 0($13) mul $24, $10, $15 lw $25, 4($sp) addu $8, $24, $25 sw $8, 4($sp) # 15 for (i=0; i<n; i++){ sw $0, 0($sp) lw $9, 0($sp) addu $14, $9, 1 sw $14, 0($sp) blt $14, 8, $33 # 16 s=s + C[i]*X[(b+i)%N]; # 17 } # 18 return s; lw $2, 4($sp) addu $sp, 8 22

23 FIR filter on a MIPS micro processor 15 instructions per tap per sample of which 7 load/store 2-cycle instructions 22 clock cycles/tap What does this mean? How to appreciate this? A brush up on Computer architecture: Computer Architecture, Hennessey and Patterson 23

24 Instruction Set Architecture An Instruction Set Architecture (ISA) = interface between hardware and software. Hence, a good ISA: allows easy programming (compilers, OS,..); allows efficient implementations (hardware); has a long lifetime (survives many HW generations); is general purpose. 24

25 Reduced Instruction Set Computer 1980: Patterson and Ditzel: The Case for RISC fixed 32-bit instruction set, with few formats load-store architecture large register bank (32 registers), all general purpose On processor organization: hard-wired decode logic pipelined execution single clock-cycle execution 25

26 DLX [MIPS-like] instruction formats 31 26, 25 21, 20 16, 15 11, 10 0 R-type Opcode rs1 rs2 rd function Reg-reg ALU operations I-type Opcode rs1 rd Immediate loads, stores, conditional branch,.. J-type Opcode offset Jump, jump and link, trap, return from exception 26

27 Example DLX instructions Example instruction name meaning LW I R1, 30(R2) Load Word Reg[R1] := Mem[30 + Reg[R2]] SW I 500 (R4), R3 Store Word Mem[500+Reg[R4]] := Reg[R3] ADD R R1, R2, R3 Add Reg[R1] := Reg[R2] + Reg[R3] ADDI I R1, R2, #3 Add Immediate Reg[R1] := Reg[R2] + 3 BEQZ I R4, imm. Branch Equal 0 if Reg[R4] = 0 then pc:= imm. fi J J offset Jump pc := offset 27

28 DLX instruction mixes instruction SPECInt92 average freq. [%] load 26 cond branch 16 add 14 compare 13 store 9 or 5 shift 4 load imm. 3 subtotal 90 [from H&P, Figs 2.26, 2.27] instruction SPECfp92 average freq. [%] load FP 23 mul FP 13 add 14 store FP 9 cond branch 8 add FP 8 sub FP 6 compare FP 6 subtotal 84 28

29 DLX interface, state Instruction memory address instruct. pc DLX CPU r0 r1 r2 Reg r31 address data r/w Mem (Data memory) clock interrupt 29

30 DLX: 5-step sequential execution stage IF ID any instruction read instruction update PC (depending on branch condition) read register values, sign-extend of immediate alu instruction load/store instruction branch instruction EX do alu operation compute address compute branch condition MM read data memory or write data memory WB write back in REG write back in REG ALU result loaded value 30

31 DLX: 5-step sequential execution IF ID EX MM WB 4 npc 0? cond pc Instr. mem ir Reg A B aluo Mem lmd Imm 31

32 DLX: pipelined execution Program execution [instructions] Time [in clock cycles] IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB IF ID EX MM WB IF ID EX MM IF ID EX NB. This ignores dependencies among successive instructions 32

33 DLX: pipelined execution Instruction Fetch Inst.Decode EXecute Memory Write 4 0? Back Mem Reg Instr. mem pc 33

34 FIRSTEP in MIPS assembler # 16 s=s + C[i]*X[(b+i)%N]; lw $14, 0($sp) addu $15, $4, $14 rem $24, $15, 8 mul $25, $24, 4 la $8, X addu $9, $25, $8 lw $10, 0($9) mul $11, $14, 4 la $12, C addu $13, $11, $12 lw $15, 0($13) mul $24, $10, $15 lw $25, 4($sp) addu $8, $24, $25 sw $8, 4($sp) # 15 for (i=0; i<n; i++){ sw $0, 0($sp) lw $9, 0($sp) addu $14, $9, 1 sw $14, 0($sp) blt $14, 8, $33 # 16 s=s + C[i]*X[(b+i)%N]; # 17 } # 18 return s; lw $2, 4($sp) addu $sp, 8 34

35 FIR on DLX: manually optimized assembly # 16 s=s + C[i]*X[(b+i)%N]; pseudo assembly $33: addu $12, $8, $10 ca := &C + ci addu $13, $9, $11 xa := &C + ci lw $14, 0($12) cv := M[ca] lw $15, 0($13) xv := M[xa] mul $16, $14, $15 p := cv xv addu $17, $17, $16 s :=s+p add $10, $10, 4 ci :=ci+4 add $11, $11, 4 xi :=xi+4 blt $11, 40, $34 if xi<40 goto $34 addi $11, $0, 0 xi := 0 $34 blt $10, 40, $33 if ci<40 goto $33 sw $17, 4($sp) M[$sp+4] := s 35

36 FIR on DLX: manually optimized assembly 11 instructions per tap 2 loads, 0 stores (load/store takes 2 cycles) 2 branches (branch-delay slots can be used) (from 22 clock cycles to) 13 clock cycles Aim of classical DSPs: reduce cycle count to 1 clock cycle on a fully programmable processor 36

37 Basic features of DSP processors Compared to a micro processor, a DSP has fixed/floating point arithmetic fast multiply accumulate specialized addressing modes multiple-access memory architecture specialized program control DSP ICs (versus embedded DSPs) also have: On-chip peripherals and I/O devices 37

38 Fixed-point arithmetic Representation for fractions in the range [-1.. 1). sign bit radix point bit weights Examples: = = = = = =

39 Fixed-point arithmetic (cntd) Fixed point word size: typically 16 or 24 bits. Hardware for integer and fixed-point arithmetic is very similar (details on multiplication differ). Support for saturation, rounding, scaling, etc. Most DSPs support both integer and fixed-point; some also support floating point arithmetic. Fixed-point is the norm for consumer applications; algorithms are often converted from floating to fixed point. 39

40 Multiply Accumulate instruction operand registers x0 x1 y0 y1 shifter alu Accumulator A Accumulator B Instruction A := A + x y in a single clock cycle 40

41 Multiply-Accumulate: FS: s := 0 ci:= 0 xi:= 4*b $33 ca:= &C+ci xa:= &X+xi cv:= M[ca] xv:= M[xa] s :=s+cv xv ci:=ci+4 xi:=xi+4 if xi<40 then goto $34 xi:=0 $34 if ci<40 then goto $33 41

42 Memory addressing Specialized addressing modes to support: register-indirect addressing with post increment, post decrement modulo N addressing: start address must be a power of 2; N must be stored in a dedicated modulo register. 42

43 Memory addressing: bit words, post increment, modulo xa:= &X {must be a 16 fold} xm:= 10 {modulo register} FS: s := 0 ci:= 0 ca:= &C $33 cv:= M[ca], ca:=ca+1 xv:= M[xa], xa:=(xa+1) mod xm s :=s+cv xv ci:=ci+1 $34 if ci<10 then goto $33 xa:=(xa+1)mod xm 43

44 Memory Organizations Processor address bus data/instruction bus Memory Von Neumann architecture Processor address bus 1 data bus address bus 2 instruction bus D Memory I Memory Harvard architecture 44

45 Memory Organizations (cntd) Dual Harvard (or Modified Harvard): 1 instruction bus + 2 data buses (named X, Y). X,Y buses connected to separate memories (RAM). More difficult to program: programmer must specify which RAM for each variable. Instruction bus to program memory (ROM, Flash). 45

46 Dual Harvard Memory: 7 4 xa:= &X {must be a 16 fold} xm:= 10 {modulo register} FS: ci:= 0 ca:= &C $33 cv:= Y[ca], ca:=ca+1, xv:= X[xa], xa:=(xa+1)mod xm s :=s+cv xv ci:=ci+1 $34 if ci<10 then goto $33 xa:=(xa+1)mod xm 46

47 Control: zero-overhead looping Zero-overhead looping (hardware looping): repeat L instructions N times: sometimes L restricted to 1; usually N < 2 16 ; usually some form of nesting allowed; N must be stored in a dedicated modulo register. 47

48 Zero-overhead looping : 4 2 xa:= &X xm:= 10 {must be a 16-fold} {modulo register} FS: s := 0 ca:= &C repeat 10 times { cv:= Y[ca], ca:=ca+1, xv:= X[xa], xa:=(xa+1)mod xm s :=s+cv*xv } xa:=(xa+1)mod xm 48

49 Pipelining: 2 1 xa:= &X xm:= 10 {must be a 16-fold} {modulo register} FS: s := 0 ca:= &C cv:= Y[ca], ca:=ca+1, xv:= X[xa], xa:=(xa+1)mod xm repeat 9 times { s:= s+cv xv, cv:= Y[ca], ca:=ca+1, xv:= X[xa], xa:=(xa+1) mod xm } s:= s+cv xv 49

50 Programming a DSP Today: usually in assembler. Most popular high-level language: C. However, C lacks common DSP data types (fixed point, complex numbers). Compilers are rather inefficient (see next slide). Emerging practice: use optimized hand-coded kernels (e.g. from libraries); use compiled code for non-critical parts; use Real-Time Operating Systems to schedule multiple tasks. 50

51 DSPs and compilers DSP architectures are compiler unfriendly: multiple memory spaces, small number of dedicated registers, non-orthogonal instruction sets, no hardware support for stacks. Furthermore, it is hard to make good use of: multi-operation instructions, parallel data moves, hardware looping, small local data memories. New DSPs, based on Very Long Instruction Words(VLIW) more compiler friendly at the expense of area and code size. 51

52 SIMD parallelism (Vector processing) A 1 A 2 A 3 A 4 A 5 A 6 A 7... A N B 1 B 2 B 3 B 4 B 5 B 6 B 7... B N ADD A 1 +B 1 A 2 +B 2 A 3 +B 3 A 4 +B 4 A 5 +B 5 A 6 +B 6 A 7 +B 7... A N +B N SIMD = Single Instruction stream, Multiple Data stream + 1 {program memory, instruction decoder, L1 controller) (no/less SRAM fragmentation) + simple single-thread program model (e.g. task switch)? less general (how much SIMD parallelism in application?)? suffers from Amdahl s Law? Efficient! Flexible? 52

53 Amdahl s Law overall speedup (S) S= (1 - f + f/p) -1 P=32 easy, small benefit desired, feasible?? fraction vectorized (f) 53

54 EVP = (scalar + vector ) VLIW vector memory VLIW parallelism mul shuffle scalar vector parallelism operations/clock cycle: 50 typical 100 max 54

55 EVP architecture [ST-Ericsson] program memory ACU VLIW controller 16 words wide vector memory vector registers [16] Load/Store Unit ALU MAC/Shift Unit Shuffle Unit Intra-Vector Unit 1 word wide scalar regs[32] Load/Store U ALU MAC U VLIW EVP 16 in 90 nm CMOS: 600 k gates 2.5 mm MHz (worst case) 0.5 mw/mhz core only 1 mw/mhz typ memory configuration Code-Generation Unit AXU 55

56 Many algorithms can be vectorized 56 Communication algorithms rake receiver UMTS acquisition cordic (I)FFT Fast Hadamard Transform OFDM symbol (de)mapping 16 QAM equalization symbol-timing estimation interference cancellation Viterbi decoder etc. Performance typically scales well with vector size Media algorithms DCT SAD (incl. bilinear interpol.) motion estimation (feasible) video scaling vertical peaking disparity matching for mobile RGB2YUV, RGB rendering color segmentation noise filtering (morphology) object filtering (i.e. aspect ratio) color interpolation etc.

57 hardware FIR filters x1 c1 c2 tap cn-1.. y D + D. + D + k x2 X N-1 x3... cn x N Depending on performance requirements: (up to) N multiplier-accumulators Hence N taps per clock cycle 57

58 hardware FIR filters x1 c1 c2 tap cn-1.. y D + D. + D + k x2 X N-1 x3... cn x N Depending on performance requirements: Compute M output samples in parallel (block processing) M outputs per clock cycle M N taps per clock cycle 58

59 FIR taps per clock cycle VLSI FPGA reconfigurable HW vector DSP Conventional DSP embedded CPU micro processor 0.01 (hardwired) specific generic (programmable) 59

60 System = HW + vector processor + DSP + GP load [ops] 100G 10G 1G hardware efficiency (area, power) vector processor application scope, generality DSP function migration [over time] micro-controller 100M 100 1k 10k 100k 1M 10M code size [Bytes] 100M 60

61 DSP references Keshab K. Parhi. VLSI Digital Signal Processing Systems, Design and Implementation. Wiley Inter-Science Richard G. Lyons. Understanding Digital Signal Processing (2 nd edition). Prentice Hall John G. Proakis and Dimitris K Manolakis. Digital Signal Processing (4th edition), Prentice Hall, Simon Haykin. Neural Networks, a Comprehensive Foundation (2 nd edition). Prentice Hall

62 Computer Architecture and DSP references Hennessy and Patterson, Computer Architecture, a Quantitative Approach. 6th edition. Morgan Kaufmann, Phil Lapsley, Jeff Bier, Amit Sholam, Edward Lee. DSP Processor Fundamentals, Berkeley Design Technology, Inc, Jennifer Eyre, Jeff Bier, The Evolution of DSP Processors, IEEE Signal Processing Magazine, Kees van Berkel et al. Vector Processing as an Enabler for Software-Defined Radio in Handheld Devices, EURASIP Journal on Applied Signal Processing 2005:16,

63 2IN35: reporting guidelines 2013 (1) 1. Submit one report per team (1 or 2 students) 2. Respect deadlines: Assignment L3: Tuesday June 4, 2013 Assignment L4: Tuesday June 11, 2013 Assignment L5: Tuesday June 18, Make sure that assignments L3, L4, and L5 are demonstrated to and signed of by Alok, Hrishikesh, Rudolf, or Kees. 4. Submit two printed copies on paper (electronic copies will not be accepted). 5. Report on lab assignments L3, L4, and L5. 63

64 2IN35: reporting guidelines 2013 (2) General guidelines (each assignments), to be followed strictly: 6. Analyze the specifications and requirements. 7. Present/motivate key ideas/decisions, design options, alternatives, trade-offs. 8. Draw architecture block diagram (= picture!). 9. Explain functional correctness of your Verilog programs (include your complete Verilog programs in an appendix). 10. Report, analyze & explain FPGA-resource usage and utilization {#multipliers, #BRAMS, #LUTs} in relation to your design. 11. Report, analyze & explain (min) sample time T s and (max) sample frequency f s, both after synthesis and after placement & routing. 12. Include simulation results: both wave forms in time domain, and in frequency domain (apply FFT) (assignments 3 and 4 only). 13. Include answers to the inline questions 64

65 THANK YOU

VLSI Programming 2016: Lecture 3

VLSI Programming 2016: Lecture 3 Course: 2IMN35 Teachers: Kees van Berkel c.h.v.berkel@tue.nl Rudolf Mak r.h.mak@tue.nl Lab: Kees van Berkel, Rudolf Mak, Alok Lele www: http://www.win.tue.nl/~wsinmak/education/2imn35/