Time Algorithm for Optimal Buffer Insertion with b Buffer Types *

Similar documents
Making Fast Buffer Insertion Even Faster Via Approximation Techniques

Fast Dual-V dd Buffering Based on Interconnect Prediction and Sampling

PERFORMANCE optimization has always been a critical

An Efficient Routing Tree Construction Algorithm with Buffer Insertion, Wire Sizing and Obstacle Considerations

A Novel Performance-Driven Topology Design Algorithm

AN OPTIMIZED ALGORITHM FOR SIMULTANEOUS ROUTING AND BUFFER INSERTION IN MULTI-TERMINAL NETS

An Efficient Algorithm For RLC Buffer Insertion

Path-Based Buffer Insertion

Physical Design of Digital Integrated Circuits (EN0291 S40) Sherief Reda Division of Engineering, Brown University Fall 2006

Buffered Steiner Trees for Difficult Instances

Fast Algorithms For Slew Constrained Minimum Cost Buffering

Interconnect Delay and Area Estimation for Multiple-Pin Nets

Fast Interconnect Synthesis with Layer Assignment

Routing Tree Construction with Buffer Insertion under Buffer Location Constraints and Wiring Obstacles

Chapter 28: Buffering in the Layout Environment

S 1 S 2. C s1. C s2. S n. C sn. S 3 C s3. Input. l k S k C k. C 1 C 2 C k-1. R d

A Faster Approximation Scheme For Timing Driven Minimum Cost Layer Assignment

Timing-Driven Maze Routing

Porosity Aware Buffered Steiner Tree Construction

[14] M. A. B. Jackson, A. Srinivasan and E. S. Kuh, Clock routing for high-performance ICs, 27th ACM

Symmetrical Buffer Placement in Clock Trees for Minimal Skew Immune to Global On-chip Variations

Fast Algorithms for Power-optimal Buffering

Symmetrical Buffer Placement in Clock Trees for Minimal Skew Immune to Global On-chip Variations

Fast Delay Estimation with Buffer Insertion for Through-Silicon-Via-Based 3D Interconnects

λ-oat: λ-geometry Obstacle-Avoiding Tree Construction With O(n log n) Complexity

CATALYST: Planning Layer Directives for Effective Design Closure

ALGORITHMS FOR THE SCALING TOWARD NANOMETER VLSI PHYSICAL SYNTHESIS. A Dissertation CHIN NGAI SZE

A buffer planning algorithm for chip-level floorplanning

Buffered Routing Tree Construction Under Buffer Placement Blockages

Problem Formulation. Specialized algorithms are required for clock (and power nets) due to strict specifications for routing such nets.

An Interconnect-Centric Design Flow for Nanometer. Technologies

Algorithms for Non-Hanan-Based Optimization for VLSI Interconnect under a Higher-Order AWE Model

Interconnect Design for Deep Submicron ICs

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment

Convex Hull Algorithms

IN recent years, interconnect delay has become an increasingly

A Routing Approach to Reduce Glitches in Low Power FPGAs

Pyramids: An Efficient Computational Geometry-based Approach for Timing-Driven Placement

Buffer Block Planning for Interconnect Planning and Prediction

Place and Route for FPGAs

Power-Mode-Aware Buffer Synthesis for Low-Power Clock Skew Minimization

How Much Logic Should Go in an FPGA Logic Block?

University of California at Berkeley. Berkeley, CA the global routing in order to generate a feasible solution

An Efficient Data Structure for Maxplus Merge in Dynamic Programming

Linking Layout to Logic Synthesis: A Unification-Based Approach

PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO. IRIS Lab National Chiao Tung University

Clock Tree Resynthesis for Multi-corner Multi-mode Timing Closure

Very Large Scale Integration (VLSI)

Efficient Static Timing Analysis Using a Unified Framework for False Paths and Multi-Cycle Paths

A Path Based Algorithm for Timing Driven. Logic Replication in FPGA

A Survey on Buffered Clock Tree Synthesis for Skew Optimization

An Exact Algorithm for the Statistical Shortest Path Problem

Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction

Crosstalk Noise Optimization by Post-Layout Transistor Sizing

Fast, Accurate A Priori Routing Delay Estimation

Basic Idea. The routing problem is typically solved using a twostep

a) wire i with width (Wi) b) lij C coupled lij wire j with width (Wj) (x,y) (u,v) (u,v) (x,y) upper wiring (u,v) (x,y) (u,v) (x,y) lower wiring dij

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

Concurrent Flip-Flop and Repeater Insertion for High Performance Integrated Circuits

Trace Signal Selection to Enhance Timing and Logic Visibility in Post-Silicon Validation

Clock Skew Optimization Considering Complicated Power Modes

L14 - Placement and Routing

Incremental Layer Assignment for Critical Path Timing

Buffer Insertion with Adaptive Blockage Avoidance

Power Optima l Dual-V dd Bu ffered Tre e Con sid erin g Bu f fer Stations and Blockages

Crosslink Insertion for Variation-Driven Clock Network Construction

Cluster-based approach eases clock tree synthesis

Timing Optimization of FPGA Placements by Logic Replication

Circuit Model for Interconnect Crosstalk Noise Estimation in High Speed Integrated Circuits

Variation Tolerant Buffered Clock Network Synthesis with Cross Links

An Interconnect-Centric Design Flow for Nanometer Technologies

Post-Route Gate Sizing for Crosstalk Noise Reduction

Performance Analysis of Storage-Based Routing for Circuit-Switched Networks [1]

[Ba] Bykat, A., Convex hull of a finite set of points in two dimensions, Info. Proc. Lett. 7 (1978),

Lecture 13 Geometry and Computational Geometry

4. Conclusion. 5. Acknowledgment. 6. References

Performance-Preserved Analog Routing Methodology via Wire Load Reduction

Unit 5A: Circuit Partitioning

Portable Parallel Algorithms for Geometric Problems

Timing-Constrained I/O Buffer Placement for Flip- Chip Designs

Monotone Paths in Geometric Triangulations

CIRCUIT clustering (or partitioning) is often employed between

AS VLSI technology scales to deep submicron and beyond, interconnect

Retiming and Clock Scheduling for Digital Circuit Optimization

Lecture 3: Art Gallery Problems and Polygon Triangulation

Vdd Programmability to Reduce FPGA Interconnect Power

CS 410/584, Algorithm Design & Analysis, Lecture Notes 8!

IN HIGH-SPEED digital VLSI design, bounding the load capacitance

CIRCUIT PARTITIONING is a fundamental problem in

Timing-Driven Interconnect Synthesis

Budget Management with Applications

High-Speed Clock Routing. Performance-Driven Clock Routing

Algorithms on Minimizing the Maximum Sensor Movement for Barrier Coverage of a Linear Domain

EE582 Physical Design Automation of VLSI Circuits and Systems

Lecture 12 March 4th

Large Scale Circuit Partitioning

Connected Components of Underlying Graphs of Halving Lines

Optimal Multi-Domain Clock Skew Scheduling

c 2006 Society for Industrial and Applied Mathematics

Architecture Evaluation for

Transcription:

An Time Algorithm for Optimal Buffer Insertion with b Buffer Types * Zhuo Dept. of Electrical Engineering Texas University College Station, Texas 77843, USA. zhuoli@ee.tamu.edu Weiping Shi Dept. of Electrical Engineering Texas University College Station, Texas 77843, USA. wshi@ee.tamu.edu Abstract Buffer insertion is a popular technique to reduce the interconnect delay. The classic buffer insertion algorithm of van has complexity where n is the number of buffer positions. Cheng and Lin extended van Ginneken s algorithm to allow b buffer types in time For modern design libraries that contain hundreds of buffers, it is a serious challenge to balance the speed andperformance of the buffer insertion algorithm. In thispaper; we present a new algorithm that computes the optimal buffer insertion in time. The reduction is achieved by the observation that the C)pairs of the candidates that generate the new candidates must form a hull. On industrial test cases, the new algorithm is faster than the previous best buffer insertion algorithms by orders of magnitude. 1. Introduction Delay optimization techniques for interconnect are increasingly important for achieving timing closure of high performance designs. One popular technique for reducing interconnect delay is buffer insertion. A recent study by Saxena et al. projects that 35% of all cells will be intrablock repeaters for the 45 node. Consequently, algorithms that can efficiently insert buffers are essential for the design automation tools. In 1990,van proposed an optimal buffer insertion algorithm for one buffer type. His algorithm has time complexity where n is the number of candidate buffer positions. Lillis, Cheng and Lin [7] extended van Ginneken s algorithm to allow buffer types in time This research was supported by the NSF grants CCR0098329, 01 13668, 51202662001. Recently, Shi and Li presented a new algorithm with time complexity O(nlog n)for 2pin nets, and O(n n)for multipin nets, for one buffer type. Several works have built upon van Ginneken s algorithmand its extension for multiple buffer types to include wire sizing simultaneous tree construction [8, 6, 5, 9, noise constraints [2] and resource minimization [7, Modern design libraries may contain hundreds of different buffers with different input capacitances, driving resistance, intrinsic delay, power level, etc. Ifevery buffer available for the given technology is supplied, it is stated in [3] that the current algorithms could possibly take days or even weeks for large designs since all these algorithms are quadratic in terms of b. Alpert et. [3] studied how to reduce the size of the buffer library with a clustering algorithm. Though the buffer library size is reduced, the solution quality is often degraded accordingly. In this paper, we propose a new algorithm that performs optimal buffer insertion with buffer types in time. Our speedup is achieved by the observation that the candidates that generate new buffered candidates must lie on the convex hull of (Q,C). Experimental results show that our algorithm is significantly faster than previous best algorithms. Section 2 formulates the problem. Section 3 describes the new algorithm. Simulation results are given in Section4 and conclusions are given in Section 5. 2. Preliminary A net is given as a routing tree T = (V,E), where V = and E V x V. Vertex is the source vertex and alsothe root of T, is the set vertices, and is the set of internal vertices. Each sink vertex is associated with sink capacitance and required arrival time A buffer library contains differenttypes of buffers and its size is representedby b. For

each buffer type the intrinsic delay is driving resistance is and input capacitance is A function f : specifies the types of buffers allowed at each internal vertex. Each edge e E is associated with lumped resistance and capacitance Following previous researchers [14, 7, 9, 15, we use the Elmore delay for the interconnect and the linear delay for buffers. For each edge e = signals travel from to The delay of is where is the downstream capacitance at For any buffer type at vertex the buffer delay is. + where is the downstream capacitance at When a buffer is inserted, the capacitance viewed from the upper stream is For any vertex V,let be the downstream from and with being the root. Once we decide where to insert buffers in we have a candidate a for The delay from to sink under a is = + where the sum is over all edges e in the path from to s. If is a buffer in then is the buffer delay. If is not a buffer in a,then = The slack of v under a is a)= min a)}. Buffer Insertion Problem: Given routing tree T = (V,E),sink capacitance and for each sink capacitance and resistance for each edge e, possible buffer position and buffer library find a candidatea fort that maximizes a). The effect of a candidate to the upstream is described by slack Q and downstream capacitance Define as the downstream capacitance at node under candidate a. For any two candidates and of we say dominates if and The set of nonredundant candidates of which we denote as is the set of candidates such that no candidate in dominates any other candidate in and every candidate of is dominated by some candidates in Once we have the candidate that gives the maximum a)can be found easily. The number of total nonredundant candidates is at most 1for one buffer type and 1for bbuffer types [ where n is the number of candidatebuffer positions. 3. New Algorithm The previous best algorithm for multiple buffer types by Lillis, Cheng and Lin consists of three major operations: 1) adding buffers at a buffer position in time, 2) adding a wire in time, and 3) merging two branches in + time, where and are the numbers of buffer positions in the two branches. As a result, their algorithm has time complexity In this section, we show that the time complexity of the first operation, addingbuffers at a buffer position, can be reducedto and thus our algorithm can achieve total time complexity Assume we have computed the set of nonredundant candidates for and now reach a buffer position see Fig. 1. Wire (v, has 0 resistance and capacitance. Define as the slack if we add a buffer type at v for any candidatea in If we do not insert any buffer at then every candidate for is a candidate for If we insert a buffer at then for every buffer type =..., b, there will be a new candidate Note that some of the new candidates could be redundant. The algorithm of Lillis, Cheng and Lin takes time to generate all and time to insert dant ones into the list of nonredundant candidates Figure 1. and consists of buffer position We show how to generate all in time. Since all candidates discussed in this section are in we will write for a),and for a).suppose buffers in the buffer library are sorted according to its driving resistance in nonincreasingorder,... If some buffer types are not allowed at we simply omit them without affecting the rest of the algorithm. For any buffer type B,define the best candidate for as the candidate such that maximizes among all candidates of If there are multiple a s that maximize (a),the one with minimum is chosen.

Lemma 1 For any two buffer types and where j,let their best candidates be and respectively.then we must have Proof: From the definition of we have and Consequently, Therefore, Since > j, If 0. > If = then it is easy to get = and = From the definition, when there are multiple a's that maximize the one with minimum is chosen. Thus and should be the same candidate, which means = Lemma 1 implies that the best candidates..., for buffer types..., are in increasing order of However,this is not enough for an time algorithm. In the following, we define the concept of convex pruning,which is important in generating new candidates Convex pruning: Let and be three dant candidates of such that If then we prune candidatea Convex pruning can be explained by Figure 2. Consider Q as the Yaxis and as the Xaxis. Then the set of dundant candidate are a set of points in the twodimensional plane. Candidate in the above definition is shown in Figure and is pruned in Figure Call the candidates after convex pruning It can be seen that is a monotonically increasing sequence, while is a convexhull.. Function Convexpruning performs convex pruning for any list of nonredundant candidates sorted in increasing and order. The following C code the data structure for each candidate in the list: typedef struct Candidate Q, C; struct Candidate *next, double link list } Candidate; Let the candidate with minimum C be We add a dummy candidate pointed by header to simplify the algorithm. Function Lef checks if a2 and a3 form a left turn on the plane. It is the same as the condition in Eq. (2). void Candidate = header; a2 = a3 = *header) while (a3!= NULL) { a2, prune a2 and move backward free (a2 ; = a3; = al; a2 al; = } else move forward a3 = a2 = = Lemma2 Given any set of k nonredundant candidates sorted in increasing Q and convexpruning in time. Figure 2. (a) Nonredundant candidates on (Q,C) plane. (b) Nonredundantcandidates after convex pruning. Proof: This procedure is known as Graham's scan in computational geometry It finds the convex hull of a set of points in sorted order in linear time. It is well known that a set of points form a convex hull if and only if there are no consecutive a and that satisfy Eq. (2). Therefore, Convexpruning is correct since it checks all consecutive candidates. To analyze the time complexity, consider the number of forward and backward moves. Each time

moves backward, it deletes a candidate. Therefore, there can be at most k backward moves. The number of forwardmoves is the size of the list plus the number of backward moves. Therefore the number of forward moves is at most 2k. Hence the time complexity is Lemma3 For any type B, its best candidate that maximizes is not pruned by Proof: Consider any candidate with > According to the definition of we have Therefore, we have I < Therefore, Lemma 4 impliesthat for any buffer type if candidate a maximizes among its previous and next consecutive candidates in then a maximizes among all candidates in Function identifies from and generates new candidates = 1,...,b. Nonredundant candidates in are stored in increasing order using a double link list pointed by header.buffer types are sorted in nonincreasing driver resistance order and stored in array B. Function P a) computes as defined in Eq. (1). Candidate *header) Therefore, where is any candidateswith and is any candidates with > Accordingto the definition of convex pruning, is not pruned. Lemma4 Let the set of nonredundant candidates ter Convexpruning be and assume are sorted in increasingq and order. Considerany three candidates a, in such that < < For any buffer type B, if then 2 Proof: From the definition of convex pruning, we have If then Candidate Candidate *beta ; int i; Convexpruning (header); = header; a2 = for = i i ++) { while (a2 NULL) { if c = al>next; a2 = al>next; } else break; generate new candidate = beta = B sort beta s in nondecreasing C order; return beta s; Theorem 1 is a bufferposition, wire is a wire with zero resistance and capacitance,nonredundantcandidates of N are stored in increasingq and order, then

function AddBuff er generates all new candidates in time. Proof: Let the set of nonredundant candidates after Convexpruning be From Lemma 3, we know that all best candidates are in From Lemma and Lemma 4, starting from the first candidates in function AddBuf fer can find all in the increasing order of i. Now consider the time complexity. Function Convexpruning takes time according to Lemma 2. The for loop takes + b) = time. It takes only time to sort the entire buffer library in terms of the input capacitance and establish the order from buffer index i to the order in Each time function AddBuf f er is called, the new candidates can be sorted in nondecreasing order by using the index in time. Theorem 2 Given a set sorted in increasing Q and can be inserted in nonredundantcandidates all b new candidates time. Proof: Since are in the nondecreasing order of capacitance and the given set of nonredundant candidates are in nondecreasing order of it takes + = time to merge the two sorted lists. Since the operation of adding a buffer is reduced to time from Theorem 1 and 2, it is easy to see that buffer insertion with b buffer types can be done in worst case time with our new algorithm. 4. Simulation Both the algorithm of Lillis et al. [7] and the new algorithm are implemented in C and run on a Sun SPARC workstations with 400 and 2 GB memory. The device and interconnect parameters are based on TSMC 180 nm technology. We have 4 different buffer libraries, with the size 8, 16, 32 and 64 respectively. is chosen from 180 to 7000 Q, is chosen from 0.7 to 23 and is chosen from 29 ps to 36.4 ps. The sink capacitances range from 2 to 41 The wire resistance is 0.076 and the wire capacitance is 0.118 Table shows for large industrial circuits, the new algorithm is up to 11times faster than Lillis The memory usage is not shown in the table, but there is only almost 2% memory overhead due to the double linked list used by the new algorithm. When b is small, algorithm has a little time overhead compared to Lillis algorithm. due to function Fig. 3 shows the time complexity curve of two algorithms for the net with 1944 sinks and 33133 buffer positions with respect to the size of buffer library b. In the figure, the y axis is normalized to the running time of the case when the buffer library size is 8.Though the worst case time complexity of Lillis algorithm is quadratic in terms of b, it behaves more like a linear function of as observed in The time complexity curve of our algorithm is also linear, but has a much smaller slope. Normalized Running Time 12 10 Figure 3. Comparison of normalized running time with respect to buffer library size among two algorithms. Number of sink is and number of buffer positions is33133. Fig. 4 shows the time complexity of the two algorithms for the net with 1944 sinks, with respect to the number of buffer positions n. The buffer library size is 32. In the figure, the y axis is normalized to the running time of the case with 1943 buffer positions. We can see that while Lillis and our algorithms both behave quadratically, our algorithm shows much slower growing trend since the operation of adding a buffer becomes more dominant among three major operations when n increases. 5. Conclusion 8 6 4 2 We presented a new algorithm for optimal buffer insertion with buffer types of worst case time This is an improvement of the previous best algorithm Simulation results show our new algorithm is significantly faster than algorithms for large industrial circuits with large buffer libraries. Our algorithm can also be applied to reduce buffer cost. We leave the details to the journal version. References O(bn 2 ) O(b 2 n 2 ) 0 0 10 20 30 40 50 60 70 Buffer Library Size C. Alpert and A. Devgan. Wire segmenting for improved buffer insertion. In DAC, pages 588593, 1997. [2] C. J. Alpert, A. Devgan, and S.T. Quay. Buffer insertion for noise and delay optimization. In DAC, pages 362367, 1998.

180 160 O(bn 2 ) O(b 2 n 2 ) Normalized Running Time 140 120 100 80 60 40 m 337 20 0 0 1 2 3 4 5 6 7 Buffer Positions x 10 4 Figure 4. Comparison of normalized running time with respect to buffer positions among two algorithms. Number of sink is and number of buffer types is 32. [3] C. J. Alpert, R. G. Gandham, J. L. Neves, and S. T. Quay. Buffer library selection. In pages 221226,2000. [4] R. L.Graham. An efficient algorithm for determining the convex hull of a fiite planar set. Information Processing Letters, 1972. [5] M. Hrkic and J. Lillis. Buffer tree synthesis with consideration of temporal locality, sink polarity requirements, solution cost and blockages. In ISPD,pages 98103,2002. [6] M. Hrkic and J. Lillis. Stree: a technique for buffered routing tree synthesis. In DAC, pages 578583,2002. [7] J. Lillis, C. K. Cheng, and T.T. Y. Lin. Optimal wire sizing and buffer insertion for low power and a generalized delay model. IEEE Trans. Solidstate Circuits, 3 1996. J. Lillis, C.K. Cheng, T.T. Y.Lin, and C.Y. Ho. New performance driven routing techniques with explicit and simultaneous wire sizing. In DAC, pages 400,1996. [9] T. Okamoto and J. Cong. Buffered steiner tree construction with wire sizing for interconnect layout optimization. In CAD,pages 4449, 1996. P. Saxena, N. Menezes, P. Cocchini, and D. A. Kirkpatrick. Repeater scaling and its impact on CAD. IEEE Trans. CAD, [ J W. Shi and Z. Li. A fast algorithm for opitmal buffer insertion. IEEE Trans. CAD, to appear. W. Shi and Z. Li. An time algorithm for optimal buffer insertion. In pages 580585,2003. W. Shi, Li, and C. J. Alpert. Complexity analysis and speedup techniques for optimal buffer insertion with minimum cost. In ASPDAC, pages 609414,2004. [ L. P. P. van Buffer placement in distributed tree network for minimal delay. In ISCAS, pages 868, 1990. H. Zhou, D. F. Wong, I. M. Liu, and A. Aziz. Simultaneous routing and buffer insertion with restrictions on buffer locations. IEEE Trans. CAD, 2000. 1944 2676 Table Simulation results for industrial test cases, where m is the number of sinks, is the number of buffer positions,and is the library size.