Architecture Techniques

Size: px

Start display at page:

Download "Architecture Techniques"

Silvia Reed
5 years ago
Views:

1 EE29A Spring 2008 Special Topics in Circuits and Signal Processing Lecture 3 Architecture Techniques Dejan Markovic dejan@ee.ucla.edu Announcements Class wiki up and running Go to: EEWeb / Online Lab Please sign up using your UCLA EE user name (I need this or veriication purposes) Homework # coming up on Wed Background, circuit and microarchiecture techniques Slide 2

Real Lie Example: 802.a Baseband Viterbi Decoder MAC Core DMA PCI Time/Freq Synch ADC/DAC FSM AGC FFT Direct mapped architecture 200 MOPS/mW 80 MHz clock! 40 GOPS Power = 200mW 0.

2 Real Lie Example: 802.a Baseband Viterbi Decoder MAC Core DMA PCI Time/Freq Synch ADC/DAC FSM AGC FFT Direct mapped architecture 200 MOPS/mW 80 MHz clock! 40 GOPS Power = 200mW 0.25µm CMOS The architecture has to track technology Atheros 802.a baseband processor Slide 3 Wireless Baseband Chip Design Direct mapping is the most energy-eicient Technology is too ast or dedicated hardware Opportunity to urther reduce energy and area Energy Eiciency Speed o technology Microprocessors Programmable DSPs Hardwired Logic GHz 00 s o MHz 0 s o MHz Clock Period Slide 4

3 A Number o Variables to Consider How to optimally combine all variables? E Lk W e V Vth V 3 dd T D ( V E dd V V W 0 2 Sw V dd W dd α th) Speed o technology (ast) Required speed (slow) pipelining parallelism time-multiplexing Vdd sizing Vth Clock Period 2 orders o magnitude (and growing ) Slide 5 Introduction to Architecture Optimization E-A-D E-D Algorithm Modeling DSP Architectures Circuit Optimization Tsample Tclk Power Area Timing Simulink Cadence RTL Slide 6

4 Architectural Feedback rom Technology Simulink hardware library implicitly carries inormation only about latency and wordlength (we can later choose sample period when targeting an FPGA) For ASIC low, block characterization also has to include technology eatures such as speed, power, and area But, technology parameters scale each generation Need a general and quick characterization methodology Propagate results back to Simulink to avoid iterations Slide 7 Architecture-Circuit Co-design behavioral E-D DSP Architectures Circuit Optimization Tclk HDL logical L physical Architectural Feedback Pre-layout Post-layout Speed Power Area Re- synthesis Speed Power Area Slide 8

5 Starting Point: Datapath Characterization Balance tradeos due to gate size (W) and supply (V DD ) Energy 0 W Min delay W V DD re V DD scaling Target delay Delay Circuit Level Optimal design point Curves due to W and V DD are tangent (equal sensitivity) Goal: keep all pipelines at the same E-D point Slide 9 Cycle Time is Common or All Blocks Simulink RTL Synopsys latency add Area Power Speed mult cycle time (norm.) netlist HSPICE Switch-level accuracy Speed Power Area Slide 0

6 Next Step: Block Characterization Goal: balance logic depth within a block Latency 0 Target T Clk add Speed Power Area mult Cycle time Micro-Architecture Level Select block latency to achieve target T Clk Balances pipeline logic depth Apply W and V DD scaling to the underlying pipelines Slide Architectural Feedback to Simulink Characterize blocks with predetermined wordlength Translate timing speciication to a target supply voltage Determine optimal latency or a given cycle time Energy (norm.) V Simulated FO4 inverter (Vdd scaling) Target speed speed (nominal Vdd) (nominal V DD ) 0.6V Desired point (optimal Target Vdd) speed (a) Delay (norm.) Latency (b) m=8 add a=2 Synthesized blocks (nominal Vdd) Target speed Area Power Speed mult Cycle time (norm.) Slide 2

Basic Micro-Architectural Techniques Parallelism, pipelining, time-multiplexing A B (a) reerence A B A B (c) pipeline A A (d) reerence or time-mux 2 2 2 A B (b) parallel A 2 (e) time-multiplex Slide

7 Basic Micro-Architectural Techniques Parallelism, pipelining, time-multiplexing A B (a) reerence A B A B (c) pipeline A A (d) reerence or time-mux A B (b) parallel A 2 (e) time-multiplex Slide 3 Architecture Trade-Os : Reerence Datapath Critical path delay T adder + T comparator (=25ns) Total capacitance being switched = C re V DD = V DD,re = 5V Power or reerence datapath = P re = C re V DD,re2 re [A.Chandrakasan, S.Sheng, R.Brodersen, JSSC 4/92] Slide 4

8 Parallel Datapath The clock rate can be reduced by hal with the same throughput par = re /2 V DD,par = V DD,re /.7, C par = 2.5C re P par = 2.5C re (V DD,re /.7) 2 ( re /2) ~ 0.36P re Slide 5 Parallelism Adds Latency Clk time A REG Add Z A A2 A3 A4 A5 ½ Clk Z Z2 Z3 Z4 Z5 A REG Add Clk Z A A3 A5 REG Add Level o parallelism P=2 A2 A4 Z Z2 Z3 Z4 Z5 Slide 6

9 Increasing Level o Parallelism Area: A N N A Re E Op (norm.) Parallelism Improves throughput or the same energy Improves energy or the same throuhtput Cost: increased area Throughput (/FO4) The more parallel the better? Slide 7 The More Parallel the Better? Total Energy Reerence Parallel Supply voltage, V DD Leakage and overhead start to dominate at high levels o parallelism, causing min E to increase Optimum voltage also increases with parallelism Slide 8

Pipelined Datapath Critical path delay is less max (T adder, T comparator ) Keeping clock rate constant: pipe = re Voltage can be dropped V DD,pipe = V DD,re /.7 Capacitance slightly higher: C pipe =.

10 Pipelined Datapath Critical path delay is less max (T adder, T comparator ) Keeping clock rate constant: pipe = re Voltage can be dropped V DD,pipe = V DD,re /.7 Capacitance slightly higher: C pipe =.5C re P pipe = (.5C re )(V DD,re /.7) 2 re ~ 0.39P re Slide 9 Pipelining Real Lie Example Superscalar processor determine optimal pipeline depth and target requency Power model PowerTimer toolset developed at IBM T.J. Watson RC Methodology to build energy models based on results o circuit-level power analysis tool [V. Srinivasan et al., MICRO 02] Slide 20

Timing Model Analytical pipeline model Time per stage o pipe is Ti = ti/si + ci Time to complete FXU operation in presence o stalls T xu = T + Stall xu-xu *T + Stall xu-pu *T2 + + Stall xu-bru *T4

11 Timing Model Analytical pipeline model Time per stage o pipe is Ti = ti/si + ci Time to complete FXU operation in presence o stalls T xu = T + Stall xu-xu *T + Stall xu-pu *T2 + + Stall xu-bru *T4 Stall xu-xu = *(s -)+ 2 *(s -2)+ i cond prob. That an FXU instruction m depends on FXU instruction (m-i) Throughput = u /T xu + u 2 /T pu + u 3 /T lsu + u 4 /T bru u i raction o time pipe I has instructions arriving rom FE o the machine u i =0 unutilized pipe, u i = ully utilized [V. Srinivasan et al., MICRO 02] Slide 2 Simulation Results Sec 2000 More stages or lower power! Power 8 FO4 Perormance 0 FO4 Slide 22

12 Simulation Resutls TPC-C Optimal pipeline depth is application dependent Power 23 FO4 Perormance 0 FO4 Slide 23 Choosing a Pipeline Register Faster latch = shallower pipe = higher perormance Slide 24

13 Conclusions rom the Paper Perormance-driven design leads to short pipelines Optimal pipeline depth or a superscalar processor Power: around 20FO4 Perormance: around 0FO4 Reerence: Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose, Victor Zyuban, Philip N. Strenski, and Philip G. Emma, Optimizing Pipelines or Power and Perormance, in Proc. 35 th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO- 35), November Slide 25 Architecture Summary (Simple Datapath) [A.Chandrakasan, S.Sheng, R.Brodersen, JSSC 4/92] Slide 26

14 Summary: Parallelism and Pipelining A B reerence (a) A B A B (c) pipeline pipeline 2 2 A parallel B (b) parallel Energy/Op reerence It is important to link back to E-D tradeo parallel/pipeline Time/Op Slide 27 Minimum Energy: E Lk /E Sw ~ 0.5 re E Op / nominal E Op re V th -80mV max 0.8V dd re V th -95mV max 0.57V dd 0.2 nominal re V th -40mV parallel max 0.52V pipeline dd E Leakage /E Switching Large (E Lk /E Sw ) opt Flat E op minimum Topology Inv Add Dec (E Lk /E Sw ) opt Optimal designs have high leakage Must adapt to process and activity variations Slide 28

15 Time Multiplexing A A (a) reerence or time-mux reerence A 2 2 time-mux (b) time-multiplex Energy/Op reerence time-mux Time/Op Slide 29 Data Stream Interleaving PE = recursive operation symbol N SVD SVD SVD SVD PE N blocks symbol N symbol PE too ast Large area Interleaved Architecutre N P/S PE S/P N Reduced area P/S overhead Highly pipelined symbol N symbol symbol Slide 30

16 PE Perorms Recursive Operation Interleave = upsample & pipeline / s 2 s C 2 C C 2 C a Slide 3 Data Stream Interleaving Example x(k) time index k x N x 2 x z N z 2 z a+b+m=n N Clk y(k ) c Clk Recursive operation: z(k) = x(k) + c z(k ) N data streams: x, x 2,, x N a z(k) y y 2 y N time index k m c b z Slide 32

17 Folding symbol PE = recursive operation PE PE PE too ast Large area symbol N blocks N symbol Folded Architecutre 0 PE 2 N symbol Reduced area Highly pipelined N symbol Slide 33 Folding Example 6 data streams data sorting c 6 c 2 c y 4 (k) y 3 (k) 64 clk cycles s= s= y 2 (k) s= c 6 y (k) s=0 c y (k) 0 s in PE * 4 Clk y 4 (k) y 3 (k) in y (k) y 2 (k) Folding = upsampling & pipelining Reduced area (shared datapath logic) Slide 34

18 Area Beneit o Interleaving and Folding Area: A = A logic + A registers Interleaving or Folding o level N A = A logic + N A registers Timing and Energy stay the same Energy/Op upsample pipeline Time/Op Slide 35 Architectural Transormations Procedure: move toward desired E-D point while minimizing area Energy V DD scaling reerence reerence Area 0 Delay Slide 36

19 Architectural Transormations Parallelism & Pipelining reduce Energy, increase Area Energy V DD scaling reerence parallel pipeline Area 0 reerence pipeline, parallel Delay Slide 37 Architectural Transormations Time-Multiplexing increase Energy, reduce Area Energy time-mux reerence parallel pipeline Area 0 V DD scaling time-mux reerence pipeline, parallel Delay Slide 38

20 Architectural Transormations Interleaving & Folding const Energy, reduce Area parallel reerence Energy time-mux pipeline intl, old Area 0 intl, old V DD scaling time-mux reerence pipeline, parallel Delay Slide 39 Back to Sensitivity Analysis small T Op with E Op small E op with T Op (Sens > ) (Sens < ) parallelism good to save energy time-mux good to save area Slide 40

21 Energy-Area Tradeo High throughput: Parallelism = Large Area 4 3 parallelism 2 2 time-mux b ALU Max E op A = A = 5 3 A re A re T target Low throughput: Time-Mux = Small Area Slide 4 It is Basically a Time-Space Tradeo re E op / E op Higher throughput re 3T op re T op /3 re T op /4 re 4T op re T op 0. re A / A 0 op op Slide 42

22 Another Requirement: Flexibility Determining how much to include and how to do it in the most eicient way possible Claims (to be shown) There are good reasons or lexibility The cost o lexibility is orders o magnitude o ineiciency over an optimized solution There are many dierent ways to provide lexibility [Remaining slides: courtesy o Pro. Bob Brodersen, UCB] Slide 43 Good Reasons or Flexibility One design or a number o SoC customers more sales volume Customers able to provide added value and uniqueness Unsure o speciication or can t make a decision Backwards compatibility with debugged sotware Risk, cost and time o implementing hardwired solutions Important to note: these are business, not technical reasons Slide 44

23 So, What is the Cost o Flexibility? We need technical metrics that we can use to compare lexible and non-lexible implementations A power metric because o thermal limitations An energy metric or portable operation A cost metric related to the area o the chip Perormance (computational throughput) Let s use metrics normalized to the amount o computation being perormed so now lets deine computation Slide 45 Deinitions Computation Operation = OP =algorithmically interesting computation (i.e. multiply, add, delay) MOPS = Millions o OP s per Second N op = Number o parallel OP s in each clock cycle Power P chip = Total power o chip = A chip C sw V DD2 clk C sw = Switched Capacitance / mm 2 =P chip / (A chip V DD2 clk ) Area A chip = Total area o chip A op = Average area o each operation = A chip / N op Slide 46

24 Energy Eiciency Metric: MOPS/mW How much computing (number o operations) can we can do with a inite energy source (e.g. battery)? Energy eiciency = = Number o useul operations Energy required Number o operations NanoJoule = OP/sec nj/sec = MOPS mw = Power eiciency = OP nj Energy eiciency = Power eiciency Slide 47 Energy and Power Eiciency OP/nJ = MOPS/mW Interestingly, the energy eiciency metric or energy constrained applications (OP/nJ) or a ixed number o operations, is the same as that or thermal (power) considerations when maximizing throughput (MOPS/mW). So let s look at a number o chips to see how these eiciency numbers compare. Slide 48

25 ISSCC Chips (0.8µm 0.25µm) Chip Year Paper Description Chip Year Paper Description S/ Strong-Arm PPC Comm G Graphics Alpha Multimedia P Multimedia Alpha MPEG Dec PPC Multimedia Microprocessors General purpose DSPs Dedicated designs Encryption Hearing Aid FIR MPEG Dec a Slide 49 Energy Eiciency (MOPS/mW or OP/nJ) 000 Dedicated Energy (Power) Eiciency MOPS/mW Microprocessors General Purpose DSP 3 orders o Magnitude! Chip Number Slide 50

26 Why Such a Big Dierence? Lets look at the components o MOPS/mW The operations per second: MOPS = clk N op The power: P chip = A chip C sw V DD 2 clk The ratio (MOPS / P chip ) gives the MOPS/mW = ( clk N op ) / (A chip C sw V DD2 clk ) Simpliying, MOPS/mW = / (A op C sw V DD2 ) So lets look at the 3 components: V DD, C sw and A op Slide 5 Supply Voltage, V DD 3 MOPS/mW = / (A op C sw V DD2 ) Vdd (Volts) Microprocessors General Purpose DSP Dedicated Chip Number Supply voltage isn t the cause o the dierence. (it s actually a bit higher or the dedicated chips) Slide 52

27 Switched Capacitance, C sw (pf/mm 2 ) MOPS/mW = / (A op C sw V DD2 ) 0 Csw (p/mm 2 ) General Purpose DSP Dedicated 30 Microprocessors Chip Number C sw is lower or dedicated, but only by a actor o 2-3 Slide 53 A op = Area per operation (A chip /N op ) MOPS/mW = / (A op C sw V DD2 ) 000 Aop (mm 2 per operation) Microprocessors General Purpose DSP Dedicated Chip Number A op explains the dierence: more parallelism (higher N op ) in a smaller chip area (less overhead) Slide 54

Let s Look at Some Chips to Actually See the Dierent Architectures We ll look at one rom each category Energy (Power) Eiciency ( MOPS/mW ) 000 00 0 0.

28 Let s Look at Some Chips to Actually See the Dierent Architectures We ll look at one rom each category Energy (Power) Eiciency ( MOPS/mW ) Microprocessors PPC General Purpose DSP NEC DSP MUD Dedicated Chip Number Slide 55 Microprocessor: MOPS/mW = 0.3 The only circuitry which supports useul operations All the rest is overhead to support the time multiplexing N op = 2 clock = 450 MHz (2 way) => 900 MIPS Two operations each clock cycle, so A op = A chip /2= 42mm 2 Power = 7 Watts Slide 56

3mm 2 Power = 0 mw Slide 57 Dedicated Design: MOPS/mW=200 Complex mult/add (8 ops) Fully parallel mapping o

29 General Purpose DSP: MOPS/mW = 7 Same granularity (a datapath), more parallelism 4 Parallel processors (4 ops each) N op = 6 clock = 50 MHz => 800 MOPS Sixteen operations each clock cycle, so A op = A chip /6= 5.3mm 2 Power = 0 mw Slide 57 Dedicated Design: MOPS/mW=200 Complex mult/add (8 ops) Fully parallel mapping o adaptive correlator algorithm. No time multiplexing. N op = 96 clock = 25 MHz => 2400 MOPS A op = 5.4 mm 2 /96 =0.5 mm 2 Power = 2 mw Slide 58

30 The Basic Problem is Time Multiplexing Processor architectures obtain perormance by increasing the clock rate, because the parallelism is low Results in ever increasing memory on the chip, high control overhead and ast area consuming logic But doesn t time multiplexing give better area eiciency? Slide 59 Area Eiciency SOC based devices are oten very cost sensitive So we need a $ cost metric => or SOC s that is equivalent to the eiciency o area utilization Area Eiciency Metric: Computation per unit area = MOPS/mm 2 How much o a $ cost (area) penalty will we have i we put down many parallel hardware units and have limited time multiplexing? Slide 60

31 Surprisingly, the Area Eiciency Roughly Tracks the Energy Eiciency MOPS/mm2 00 Microprocessors ~2 orders o magnitude 0 General Purpose DSP Chip Number Dedicated The overhead o lexibility in processor architectures is so high that there is even an area penalty Slide 6 Hardware / Sotware Conclusion: There is no sotware/hardware tradeo. The dierence between hardware and sotware in perormance, power and area is so large that there is no tradeo. It is reasons other than power, energy, perormance or cost that drives a sotware solution (e.g. business, legacy, ). The Cost o Flexibility is extremely high, so the other reasons better be good! Slide 62

ECE 747 Digital Signal Processing Architecture. DSP Implementation Architectures

ECE 747 Digital Signal Processing Architecture DSP Implementation Architectures Spring 2006 W. Rhett Davis NC State University W. Rhett Davis NC State University ECE 406 Spring 2006 Slide 1 My Goal Challenge