Fast Dual-V dd Buffering Based on Interconnect Prediction and Sampling

Size: px

Start display at page:

Download "Fast Dual-V dd Buffering Based on Interconnect Prediction and Sampling"

Byron O’Brien’
5 years ago
Views:

1 Based on Interconnect Prediction and Sampling Yu Hu King Ho Tam Tom Tong Jing Lei He Electrical Engineering Department University of California at Los Angeles System Level Interconnect Prediction (SLIP), 2007

2 Outline Introduction

3 Motivation Motivation Major Contributions Aggressive buffering increases interconnect power 35% cells are buffers at 65nm technology [Saxena, TCAD 04] 2 Previous work for singal-v dd buffering Power-optimal single-v dd buffer insertion [Lillis, JSSC 96] Delay-optimal buffered tree generation [Cong, DAC 00; Alpert, TCAD 02] 3 Previous work for dual-v dd buffering Power-optimal dual-v dd buffer insertion and buffered tree construction [Tam, DAC 05] 4 Problem remains Power aware buffering suffers from the expensive computational cost Dual-V dd buffers dramatically increase computational complexity

4 Motivation Motivation Major Contributions Aggressive buffering increases interconnect power 35% cells are buffers at 65nm technology [Saxena, TCAD 04] 2 Previous work for singal-v dd buffering Power-optimal single-v dd buffer insertion [Lillis, JSSC 96] Delay-optimal buffered tree generation [Cong, DAC 00; Alpert, TCAD 02] 3 Previous work for dual-v dd buffering Power-optimal dual-v dd buffer insertion and buffered tree construction [Tam, DAC 05] 4 Problem remains Power aware buffering suffers from the expensive computational cost Dual-V dd buffers dramatically increase computational complexity

5 Motivation Motivation Major Contributions Aggressive buffering increases interconnect power 35% cells are buffers at 65nm technology [Saxena, TCAD 04] 2 Previous work for singal-v dd buffering Power-optimal single-v dd buffer insertion [Lillis, JSSC 96] Delay-optimal buffered tree generation [Cong, DAC 00; Alpert, TCAD 02] 3 Previous work for dual-v dd buffering Power-optimal dual-v dd buffer insertion and buffered tree construction [Tam, DAC 05] 4 Problem remains Power aware buffering suffers from the expensive computational cost Dual-V dd buffers dramatically increase computational complexity

6 Motivation Motivation Major Contributions Aggressive buffering increases interconnect power 35% cells are buffers at 65nm technology [Saxena, TCAD 04] 2 Previous work for singal-v dd buffering Power-optimal single-v dd buffer insertion [Lillis, JSSC 96] Delay-optimal buffered tree generation [Cong, DAC 00; Alpert, TCAD 02] 3 Previous work for dual-v dd buffering Power-optimal dual-v dd buffer insertion and buffered tree construction [Tam, DAC 05] 4 Problem remains Power aware buffering suffers from the expensive computational cost Dual-V dd buffers dramatically increase computational complexity

7 Motivation Major Contributions Major Constributions Focus on speedup for power-aware dual-v dd buffer insertion and buffered tree construction 2 Propose three speedup techniques for power optimized dual-v dd buffer insertion based on interconnect prediction and sampling Pre-buffer Slack Pruning (PSP) extended from the one presented in [Shi, DAC 03] Predictive Min-delay Pruning (PMP) 3D sampling Runtime grows linearly w.r.t. the tree-size and we achieve 50x speedup compared with [Tam, DAC 05] 3 Incorporate the fast buffer insertion and grid reduction in power optimized buffered tree construction algorithm

8 Motivation Major Contributions Major Constributions Focus on speedup for power-aware dual-v dd buffer insertion and buffered tree construction 2 Propose three speedup techniques for power optimized dual-v dd buffer insertion based on interconnect prediction and sampling Pre-buffer Slack Pruning (PSP) extended from the one presented in [Shi, DAC 03] Predictive Min-delay Pruning (PMP) 3D sampling Runtime grows linearly w.r.t. the tree-size and we achieve 50x speedup compared with [Tam, DAC 05] 3 Incorporate the fast buffer insertion and grid reduction in power optimized buffered tree construction algorithm

9 Motivation Major Contributions Major Constributions Focus on speedup for power-aware dual-v dd buffer insertion and buffered tree construction 2 Propose three speedup techniques for power optimized dual-v dd buffer insertion based on interconnect prediction and sampling Pre-buffer Slack Pruning (PSP) extended from the one presented in [Shi, DAC 03] Predictive Min-delay Pruning (PMP) 3D sampling Runtime grows linearly w.r.t. the tree-size and we achieve 50x speedup compared with [Tam, DAC 05] 3 Incorporate the fast buffer insertion and grid reduction in power optimized buffered tree construction algorithm

10 Modeling Dual-V dd Buffering Problem Formulation Introduction 2 Modeling Dual-V dd Buffering Problem Formulation 3 4 5

11 Delay, Slew and Power Modeling Elmore Delay Model The delay of wire with length l is The delay of a buffer d buf is Modeling Dual-V dd Buffering Problem Formulation d(l) = ( 2 c w l + c load ) r w l () d buf = d int + r o c load (2) Bakoglus slew metric (Elmore delay times ln9) 2 Power Model Wire power dissipation E w = 0.5 c w l V 2 dd (3) Lumped buffer dynamic/short-circuit power Can be easily extended to leakage power

12 Dual-V dd Buffering Modeling Dual-V dd Buffering Problem Formulation Intuitions for power aware dual-v dd buffering High V dd buffers drive critical interconnect edges for timing optimization 2 Low V dd buffers drive non-critical interconnect edges for power 3 Achieves power saving since power is proportional to α V 2 dd 4 Suffer no loss of delay optimality Constraints in dual-v dd buffering Disallowing low V dd drives high V dd Not affect optimality [Tam, DAC 05] Exclude power and delay overhead of level converters

13 Problem Formulation Modeling Dual-V dd Buffering Problem Formulation Dual-V dd buffer insertion and sizing (dbis) Given An interconnect fanout tree with source n src, sink nodes n s and Steiner points n p Buffer stations n b 2 Find A buffer size and V dd level assignment 3 Such that (objective) The RAT qn src at the source n src is met The power consumed by the interconnect tree is minimized Slew rate at every input of the buffers and the sinks n s is upper bounded by the slew rate bound s

14 Problem Formulation Modeling Dual-V dd Buffering Problem Formulation Dual V dd Buffered Tree Construction (dtree) Given An un-routed net with source n src and sink nodes n s Buffer stations n b 2 Find A routing for the buffered tree A buffer size and V dd level assignment 3 Such that (objective) The RAT qn src at the source n src is met The power consumed by the interconnect tree is minimized Slew rate at every input of the buffers and the sinks n s is upper bounded by the slew rate bound s

15 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Introduction 2 3 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for 4

16 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Baseline Algorithm Flow Based on [Lillis, JSSC 96] 2 Dynamic programming with partial solution (option) pruning 3 Options must now record downstream V dd levels for buffering to prevent V dd L V dd H, which removes unnecessary search in solution space 4 Still quite slow for large nets Definition (Domination) In node n, option Φ = (rat, cap, pwr, θ) dominates Φ 2 = (rat 2, cap 2, pwr 2, θ), if rat rat 2, cap cap 2, and pwr pwr 2

17 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Options are indexed by capacitive values 2 Few options left after power-delay sampling [Tam, DAC 05] under the same capacitive index 3 A linear list is used to store options under the same capacitive value 4 Capacitive values are organized by a binary search tree

18 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Options are indexed by capacitive values 2 Few options left after power-delay sampling [Tam, DAC 05] under the same capacitive index 3 A linear list is used to store options under the same capacitive value 4 Capacitive values are organized by a binary search tree Percentage of options (00%) > Number of options indexed by the same capative value

19 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Options are indexed by capacitive values 2 Few options left after power-delay sampling [Tam, DAC 05] under the same capacitive index 3 A linear list is used to store options under the same capacitive value 4 Capacitive values are organized by a binary search tree 40 Percentage of options (00%) > Number of options indexed by the same capative value C 0 = (p, q ), (p 2, q 2 ),... C = 8 C 2 = (p, q ), (p 2, q 2 ),... (p, q ), (p 2, q 2 ),... C 3 = 5 C 4 = 9 C 5 = 2 (p 3, q 3 ), (p 3, q 3 ),... (p 4, q 4 ), (p 4, q 4 ),... (p 5, q 5 ), (p 5, q 5 ),

20 Motivation for 3D Sampling Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Power-delay sampling has shown effectiveness [Tam, DAC 05] 2 The size of the third dimenstion (capacitive values) increases significantly for large testcases node# sink# > 00 > % 62% % 64% % 65% % 7% 3 Sampling on all dimensions in power-delay-capacitance solution space is necessary

21 Motivation for 3D Sampling Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Power-delay sampling has shown effectiveness [Tam, DAC 05] 2 The size of the third dimenstion (capacitive values) increases significantly for large testcases node# sink# > 00 > % 62% % 64% % 65% % 7% 3 Sampling on all dimensions in power-delay-capacitance solution space is necessary

22 Motivation for 3D Sampling Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Power-delay sampling has shown effectiveness [Tam, DAC 05] 2 The size of the third dimenstion (capacitive values) increases significantly for large testcases node# sink# > 00 > % 62% % 64% % 65% % 7% 3 Sampling on all dimensions in power-delay-capacitance solution space is necessary

23 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for The idea of 3D sampling is to pick only a certain number of options among all options uniformly over the power-delay-capacitance space for upstream propagation power dissipation power dissipation required arrival time capacitance required arrival time capacitance 50 (a) 2D sampling (b) 3D sampling

24 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Pre-buffer Slack Pruning (PSP) [Shi, ASPDAC 05] Suppose R min is the minimal resistance in the buffer library. For two non-redundant options Φ = (rat, cap, pwr, θ ) and Φ 2 = (rat 2, cap 2, pwr 2, θ 2 ), where rat < rat 2 and cap < cap 2, then Φ 2 is pruned, if (rat 2 rat )/(cap 2 cap ) R min Extension for Dual-V dd PSP Choose a proper high / low V dd buffer resistance R H / R L for PSP to avoid overly aggressive pruning 2 If θ = true, R min = R H. Otherwise, R min = R L.

25 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Assume a continuous number of buffers and buffer sizes 2 The optimum unit length delay, delay opt, is given by [Bakoglou, book, 990] delay opt = 2 r s c o rc( + 2 ( + c p )) (4) c o 3 (4) calculates the lower bound of delay from any node to the source Predictive min-delay pruning (PMP) A newly generated option Φ = (rat, cap, pwr, θ) is pruned if rat delay opt dis(v) < RAT 0, where RAT 0 is the target RAT at the source, dis(v) is the distance of the node-to-source path

26 Experimental Settings Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Extract technical parameters by BSIM4 and SPICE under 65nm Table: Settings for the 65nm global interconnect. Settings Values Interconnect r w = 0.86Ω/µm, c w = 0.059fF/µm V dd H Buffer c in = 0.47fF, Vdd H =.2V (min size) ro H = 4.7kΩ, dh b = 72ps, EH b = 84fJ V dd L Buffer c in = 0.47fF, Vdd L = 0.9V (min size) Level converter ro L = 5.4kΩ, dl b = 98ps, EL b = 34fJ c in = 0.47fF, E LC = 5.7fJ (min size) d LC = 220ps 2 Randomly generate and route 0 nets by GeoSteiner

27 Experimental Settings Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for Extract technical parameters by BSIM4 and SPICE under 65nm Table: Settings for the 65nm global interconnect. Settings Values Interconnect r w = 0.86Ω/µm, c w = 0.059fF/µm V dd H Buffer c in = 0.47fF, Vdd H =.2V (min size) ro H = 4.7kΩ, dh b = 72ps, EH b = 84fJ V dd L Buffer c in = 0.47fF, Vdd L = 0.9V (min size) Level converter ro L = 5.4kΩ, dl b = 98ps, EL b = 34fJ c in = 0.47fF, E LC = 5.7fJ (min size) d LC = 220ps 2 Randomly generate and route 0 nets by GeoSteiner

28 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for runtime (s) delay power net DVB sam pmp+sam psp+sam all sam pmp+sam psp+sam all sam pmp+sam psp+sam all s s s s s s s s s ave Sampling grid size is 20x20x20 for 3D sampling and 20x20 for DVB 2 Accompanied with substantial speedup, 3D sampling brings significant error for large testcases 3 Combining PSP/PMP with 3D sampling improves runtime and solution quality 4 PSP / PMP prunes many redundant options and 3D sampling can always select option samples from a good candidate pool

29 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for runtime (s) delay power net DVB sam pmp+sam psp+sam all sam pmp+sam psp+sam all sam pmp+sam psp+sam all s s s s s s s s s ave Sampling grid size is 20x20x20 for 3D sampling and 20x20 for DVB 2 Accompanied with substantial speedup, 3D sampling brings significant error for large testcases 3 Combining PSP/PMP with 3D sampling improves runtime and solution quality 4 PSP / PMP prunes many redundant options and 3D sampling can always select option samples from a good candidate pool

30 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for runtime (s) delay power net DVB sam pmp+sam psp+sam all sam pmp+sam psp+sam all sam pmp+sam psp+sam all s s s s s s s s s ave Sampling grid size is 20x20x20 for 3D sampling and 20x20 for DVB 2 Accompanied with substantial speedup, 3D sampling brings significant error for large testcases 3 Combining PSP/PMP with 3D sampling improves runtime and solution quality 4 PSP / PMP prunes many redundant options and 3D sampling can always select option samples from a good candidate pool

31 Baseline Algorithm Data Structure for Pruning 3D Sampling Pre-buffer Slack Pruning Predictive Min-delay Pruning Experimental Results for runtime (s) delay power net DVB sam pmp+sam psp+sam all sam pmp+sam psp+sam all sam pmp+sam psp+sam all s s s s s s s s s ave Sampling grid size is 20x20x20 for 3D sampling and 20x20 for DVB 2 Accompanied with substantial speedup, 3D sampling brings significant error for large testcases 3 Combining PSP/PMP with 3D sampling improves runtime and solution quality 4 PSP / PMP prunes many redundant options and 3D sampling can always select option samples from a good candidate pool

32 Buffered Tree Construction Baseline Algorithm Routing Grid Reduction Experimental Results Introduction Buffered Tree Construction Baseline Algorithm Routing Grid Reduction Experimental Results 5

33 Buffered Tree Construction Baseline Algorithm Routing Grid Reduction Experimental Results Baseline Algorithm Extend the delay optimization buffered tree construction algorithm [Cong, DAC 00] Build Hanan Graph w / buffer insertion nodes according to locations of buffer stations Path search on the grid by option propagation 2 Option growth is exponential 3 Power and dual-v dd buffers further accelerate option growth Definition (Domination) In node n, option Φ = (S, R, rat, cap, pwr, θ ) dominates Φ 2 = (S 2, R 2, rat 2, cap 2, pwr 2, θ 2 ), if S S 2, rat rat 2, cap cap 2, and pwr pwr 2

34 Routing Grid Reduction Buffered Tree Construction Baseline Algorithm Routing Grid Reduction Experimental Results Option growth is exponential w.r.t. routing grid size 2 Restrict the progation direction always towards the source (c) Before reduction (d) After reduction

35 Experimental Results Buffered Tree Construction Baseline Algorithm Routing Grid Reduction Experimental Results Our Fast stree / dtree algorithm runs over 00x faster than S-Tree/D-Tree [Tam, DAC 05] within % power overhead 2 Be able to solve the buffered tree construction problem on 0 sinks with 400+ buffer stations test cases runtime(s) RAT*(ps) power(fj) name n# s# nl# S-Tree stree D-Tree dtree S-Tree stree S-Tree stree D-Tree dtree grid grid grid grid grid grid < 00 < 00 > 99% < 0% < 0%

36 Conclusions Future Work Introduction Conclusions Future Work

37 Conclusions Future Work Conclusions Presented efficient algorithms to dual-v dd buffering for power optimization Studied three pruning techniques including PSP, PMP and 3D sampling Proposed grid reduction for buffered tree construction speedup Experimental results show that we obtain over 50x and 00x speedup compared with the most efficient existing algorithms [Tam, DAC 05] for dual-v dd buffer insertion and buffered tree construction, respectively

38 Conclusions Future Work Future Work Further improve the efficiency of buffered tree construction by adapting hierarchical tree generation algorithm 2 Slack allocation for more power reduction Chip level FPGA dual-v dd assignment [Lin, DAC 05] Fix buffer location, assign V dd levels Consider multiple critical path Solve as a linear programming problem More freedom of ASIC buffering introduces more challenge to algorithms

Fast Algorithms for Power-optimal Buffering

Submission under Review, Please DO NOT Distribute Fast Algorithms for Power-optimal Buffering Yu Hu and Jing Tong Electrical Engineering Dept. Tsinghua University, Beijing, P.R. China {matrix98, jingtong}@tsinghua.edu.cn