Adaptive Robustness Tuning for High Performance Domino Logic

Size: px

Start display at page:

Download "Adaptive Robustness Tuning for High Performance Domino Logic"

Marvin Rose
5 years ago
Views:

1 Adaptive Robustness Tuning for High Performance Domino Logic Bharan Giridhar 1, David Fick 1, Matthew Fojtik 1, Sudhir Satpathy 1, David Bull 2, Dennis Sylvester 1 and David Blaauw 1 1 niversity of Michigan, SA 2 ARM, nited Kingdom Symposia on VLSI Technology and Circuits

2 Motivation Targeted use of Domino logic remains the designer s choice to implement speed critical paths in power constrained designs Samsung Hummingbird, AMD Bulldozer Golden et al., ISSCC 2011 With increasing process variation, domino logic is getting less beneficial Increasing margins for leakage, charge sharing and noise Slide 1

3 Normalized Delay Normalized Keeper Size Motivation Margining the keeper for robustness under worstcase PVT can degrade performance by ~32% A0 Keeper Margining A Normalized Delay Normalized Keeper Size ~32% performance degradation CLK 0.77 Nominal Process Process, Temp Process, Temp, Noise ART Domino tracks PVT and trades extra robustness and timing margins for performance 1 Slide 2

4 ART Domino Gate Fast, speculative evaluation followed by fully margined (safe) evaluation to detect errors A B CLK Slide 3

5 ART Domino Gate Fast, speculative evaluation followed by fully margined (safe) evaluation to detect errors A B CLK Slide 4

6 ART Domino Gate Fast, speculative evaluation followed by fully margined (safe) evaluation to detect errors Headers/footers shared across gates to minimize overhead T V X EC ECB CLK Speculative Precharge () Speculative Evaluation () Checker Precharge () Checker Evaluation () A B EC CLK CHECK (ECB) V Y V X V Y ECB EC A V DYN T ( ) Y Slide 5

7 ART Domino Gate V X < and V Y > speed critical transitions at both nodes by reducing voltage swings V Y > speeds the following gate by trading its noise margin for speed T Remove margins EC V X (<) A CLK B ECB Leakier Stack CLK EC CHECK (ECB) Speculative Precharge () Speculative Evaluation () Checker Precharge () Checker Evaluation () V Y (>) V X V Y ECB EC A V DYN T ( ) Y Slide 6

8 ART Domino Gate V X < and V Y > speed critical transitions at both nodes by reducing voltage swings V Y > speeds the following gate by trading its noise margin for speed T Fast, speculative evaluation EC V X (<) A CLK B ECB Leakier Stack Y (Speculate) CLK EC CHECK (ECB) Speculative Precharge () Speculative Evaluation () Checker Precharge () Checker Evaluation () V Y (>) V X V Y ECB EC A V DYN T ( ) Y Slide 7

9 ART Domino Gate Slower, safe evaluation performed in the background No impact on computation latency T Restore margins EC V X (=) ECB Y (Speculate) CLK Speculative Precharge () Speculative Evaluation () Checker Precharge () Checker Evaluation () A CLK B Leakier Stack EC CHECK (ECB) V Y (=) V X V Y ECB EC A V DYN T ( ) Y Slide 8

10 ART Domino Gate Slower, safe evaluation performed in the background No impact on computation latency T Slower, safe evaluation EC V X (=) ECB Y (Speculate) CLK Speculative Precharge () Speculative Evaluation () Checker Precharge () Checker Evaluation () A CLK B Leakier Stack Y (Safe) EC CHECK (ECB) V Y (=) V X V Y ECB EC A V DYN T ( ) Y Slide 9

11 ART Domino Gate In case of errors, the errant computation is flushed from the pipeline The result of safe evaluation is propagated, guaranteeing forward progress T Error Detection EC ECB V X Y (Speculate) V Y A CLK B Y (Safe) Compare and Detect Errors ECB EC T ( ) Slide 10

12 ART Pipeline Overlapping clocks Latchless hand-off Pipe Stage N Pipe Stage N+1 Conventional Domino Pipeline Slide 11

13 ART Pipeline Pipe Stage N Pipe Stage N+1 Slide 12

14 ART Pipeline Pipe Stage N X DOMBF M X B F 1 X Pipe Stage N+1 DOMBF M X B F 1 Slide 13

15 ART Pipeline Speculative Evaluation () Pipe Stage N DOMBF M X B F 1 Stores speculative result of half-stage Overlapping clocks Latchless hand-off Pipe Stage N+1 DOMBF M X B F 1 During, DOMBF snoops on the value propagated forward through the mux Overlapping clocks eliminate latches between pipe stages, provide skew tolerance Slide 14

16 ART Pipeline Speculative Evaluation () Pipe Stage N DOMBF M X B F 1 Stores speculative result of half-stage Overlapping clocks Latchless hand-off Pipe Stage N+1 DOMBF M X B F 1 During, DOMBF snoops on the value propagated forward through the mux Overlapping clocks eliminate latches between pipe stages, provide skew tolerance Slide 15

17 ART Pipeline Checking Evaluation () Pipe Stage N DOMBF Pipe-stage Split: Parallel evaluations during phase M X B F 1 DOMBF value from propagated Latchless hand-off Pipe Stage N+1 DOMBF M X B F 1 To allow the slower safe evaluation to complete, each pipe stage is split in half during Value stored on DOMBF propagated forward cutting the stage depth by half Slide 16

18 ART Pipeline Checking Evaluation () Pipe Stage N DOMBF M X B F 1 Latchless hand-off Pipe Stage N+1 DOMBF M X B F 1 Both halves of each pipe stage perform safe evaluations simultaneously Slide 17

19 Checking Evaluation () Pipe Stage N DOMBF M X B F 1 ART Pipeline Latchless hand-off Pipe Stage N+1 Result Result B F 1 M X DOMBF Results from and to check for errors in the segment Both halves of each pipe stage perform safe evaluations simultaneously Error detector at each DOMBF checks the segment till the preceding DOMBF for errors Slide 18

20 DOMBF BF1 ART Pipeline Fully margined Domino XOR Error Detection Error BIT0 BF2 (identical to BF1) Bit1 Bit0 BitN Fully Margined Domino OR tree Stage_N error SYSCLK STG_1 Stage_1 error Error Detection in next clock cycle in parallel with next set of gate evaluations GATE1 GATE2 GATE3 GATE4 GATE5 GATE6 GATE7 GATE8 Pipe Stage N : Precharge error logic (in red) : Copy Gate4 TO BF2 and DOMBF to BF1 +: Evaluate Domino XOR, Domino OR tree Slide 19

21 ART Pipeline Pipe Stage N DOMBF M X B F 1 Pipe Stage N+1 DOMBF M X B F 1 SIG N T A0 SIG N V X A1 ECB_N V Y SIG_B_N BFIN BF1 EC_N BFOT DOMBF, BF1(2) are fully margined EC_N+1 V X V Y A0 STG_N+1 ECB_N+1 ECB_N+1 A1 EC_N+1 STG_N T ( ) Slide 20

22 ART Clock Generation GCLK ( from clock ring) Ф1 (= STG_N) Ф2 ( = STG_N+1) STG_1 STG_2 Domino gates clocked by the global clock generator Ф3 ( = EC_N) EC_N STG_N SIG N SN SN D Q D Q GCLK Ф1 Ф2 CK QN RN CK QN RN SN D Q CK QN RN Ф4 ( = EC_N+1) ECB_N STG_N SIG N Power gates and other locally derived signals clocked using locally generated clocks ECB_N STG_N EC_N STG_N SIG N SIG N Slide 21

23 System Implementation X[15:0] NEG[15:0] BOOTH ENCODER ZERO[15:0] SHIFT[15:0] (T 1,T 1 ) (T 2,T 2 ) (T 3,T 3 ) (T 4,T 4 ) Pipe Stage 1 Pipe Stage 2 Y[15:0] BOOTH SIGN[15:0] DECODER PARTIAL PRODCTS[15:0][32:0] TEST HARNESS A[63:0] MLTIPLIER ARRAY RADIX-2 KOGGE STONE PRODCT[63:0] 32 32b multiplier in 65nm CMOS 4 T/T voltage domains, 2 pipeline stages Slide 22

24 Die Micrograph MLTIPLIER ARRAY KOGGE STONE BOOTH DECODER BOOTH DECODER TEST HARNESS BOOTH ENCODER 1.5mm CLKGEN + CLK MESH BFFERS 1.5mm SCAN 65nm CMOS process Die Size: 1496μm X 1496μm Slide 23

25 1.2-T (V) Measured Results Measured Max. Performance: 1192 MHz Error region (Robustness failure) 1040 MHz 1072 MHz 1136 MHz 1104 MHz Performance with ART up by 34% by eliminating robustness margins at nominal PVT 0.1 No Adapting 890 MHz 944 MHz 976 MHz 1008 MHz MHz T (V) Slide 24

26 Required T (V) Power (mw) Required T (V) No Adapting Measured Results ~34% performance gain with 47% Power increase Simulated Power Standard Domino ~17% Performance gain with 12% Power reduction Performance (MHz) Minimum Power at each frequency Method applicable to performance constrained logic where performance cannot be obtained by any other means Power increases sharply at higher frequencies due to increased short circuit current on the output inverter T T 1, T 2, T 3, T 4 T 1, T 2, T 3 Performance (MHz) Performance (MHz) ΔT/ΔT at each measured power-frequency point Slide 25

27 Number of Chips % Performance Gain (Robustness speculation) ART Performance Gain 10 8 At 27C, 1.2V Slow Die Fast Die Typical Die 6 20 Dies % Performance Gain (Robustness Speculation) Temperature (C) Measured average % gain due to robustness tuning across 20 dies is ~28% % Gain decreases at higher temperatures as gates become less robust Slide 26

28 Performance (MHz) 1300 Overall Performance Gain Worst Process,Voltage,Temp* Process Gain Temp Gain Voltage Gain ART Gain X X X ART gain Robustness tuning X 1.04X 1X X 1.12X 1.07X X 1.17X 1.12X Voltage gain Temp gain Process gain Timing Speculation Total gains of 49% to 71% over conventionally margined designs 300 Slow Nominal Fast * Worst process was set by the slowest die. Temperature was set to 85C and Supply was degraded by 10% to 1.08V Slide 27

29 % Error (robustness failures) % Error (timing failures) Error Rate T tuning T tuning T[1-4] = 1.2V T[1-4] = 0V MHz T 1-4 = 0.9V T 1-3 = 0.15V, T 4 = 0.17V T (V), T (V)* * Δ T = 0.9V-T, Δ T = T-0.17V Operating Frequency (MHz) Error rate is more sensitive to T than T T tuning also affects robustness of following gate Slide 28

30 Conclusion A new high speed design style called ART Domino was presented ART Domino provides overall gain of up to 1.71x over conventional domino 3.2x faster than static CMOS Design overhead is amortized over pipeline stage depth Slide 29

31 Thank You Questions? Slide 30

32 BACKP SLIDES Symposia on VLSI Technology and Circuits Slide 31

33 ART Clock Generation Ф1 (= STG_N) Example 4-Stage Pipeline Operation Phases GCLK ( from clock ring) Ф2 ( = STG_N+1) SYSCLK (GCLK 2) STG_1 STG_2 Globally generated clocks with relaxed skew constraints STG_3 (=STG_1 CLK) STG_4 (=STG_2 CLK) Ф3 ( = EC_N) EC_N STG_N SIG N SN SN D Q D Q GCLK Ф1 Ф2 CK QN RN CK QN RN SN D Q CK QN RN Ф4 ( = EC_N+1) ECB_N STG_N SIG N Locally generated clocking signals with stricter skew constraints ECB_N STG_N EC_N STG_N SIG N SIG N Slide 32

34 System Implementation Conventional Domino Other Logic FF Conventional Domino Logic FF Other Logic Slide 33

35 System Implementation Conventional Domino STG1 Eval Prchg Slide 34

36 System Implementation Conventional Domino STG2 Prchg Eval Slide 35

37 System Implementation ART Domino Other Logic FF ART Domino Logic FF Other Logic Slide 36

38 System Implementation ART Domino STG1 Slide 37

39 System Implementation ART Domino STG2 Slide 38

40 System Implementation ART Domino STG3 Slide 39

41 System Implementation ART Domino STG4 Slide 40

42 System Implementation ART Domino SYSCLK (GCLK 2) STG_1 STG_2 STG_3 (=STG_1 CLK) Error Detect STG1 Error Detect STG1 Error Detect STG1 Error Detect STG1 STG_4 (=STG_2 CLK) Slide 41

43 Architectural Replay Cycle0 Cycle1 Cycle2 SYSCLK (GCLK 2) STG_1 Error detected in STG2 STG_2 STG_3 STG_4 Slide 42

44 Architectural Replay Cycle0 Cycle1 Cycle2 SYSCLK (GCLK 2) STG_1 Error detected in STG2 STG_2 STG_3 STG_4 Copy safe value Slide 43

45 Architectural Replay Cycle0 Cycle1 Cycle2 SYSCLK (GCLK 2) STG_1 Error detected in STG2 STG_2 STG_3 STG_4 Copy safe value Slide 44

46 Error Detection Scenarios Slide 45

Centip3De: A 64-Core, 3D Stacked, Near-Threshold System

1 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold System Ronald G. Dreslinski David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman