An Efficient Algorithm For RLC Buffer Insertion

An Efficient Algorithm For RLC Buffer Insertion Zhanyuan Jiang, Shiyan Hu, Jiang Hu and Weiping Shi Texas A&M University, College Station, Texas 77840 Email: {jerryjiang, hushiyan, jianghu, wshi}@ece.tamu.edu Abstract Traditional buffer insertion algorithms neglect the impact of inductance effect, which often introduces large error in circuit optimization. On the other hand, ultra-fast buffering techniques are always desirable as buffering is such a widely used technique in industry. It is a challenge to design an RLC buffering algorithm which excels in both runtime and solution quality. In this paper, such an algorithm is proposed. The new algorithm wors under the dynamic programming framewor and runs in provably linear time for multiple buffer types due to two novel techniques: restrictive cost buceting and efficient delay update. Experiment results on industrial netlists demonstrate that the new algorithm consistently outperforms van Ginneen/Lillis algorithm [], [2] for RC buffering and all nown RLC buffering algorithms. Without buffer cost minimization, the new algorithm saves up to 8.5% buffer area and provides up to 4 speedup over Ismail s algorithm [3]. When buffer cost minimization is handled, the new algorithm uses 33.4% fewer buffers than van Ginnenen- Lillis s algorithm, and saves up to 5.3% buffer area and gives up to 5 speedup compared to the algorithm in [4]. I. INTRODUCTION As VLSI technology moves into the nanoscale regime, interconnect delay becomes a dominant constraint in circuit design. A great amount of effort has been made to reduce interconnect delay and buffer insertion appears as a very effective technique. It is witnessed in [5] that a large number of buffers are needed with current IC technology. In two recent IBM ASIC designs, 25% gates are buffers [6]. Due to fast scaling of technology, inductance effect in circuit performance causes increasing research attention. With higher operating frequencies, the quadratic delay-length dependance in RC model is approaching linear [7]. Thus, RC model often overestimates circuit delay which results in excessive buffers inserted. Realizing this, new RLC buffering algorithms are proposed in [3] and [4], which are able to save more than 30% buffers compared to the minimum cost RC buffering algorithms. In [3], a top-down greedy style RLC algorithm is proposed. At each candidate buffer position, the delay at a driver is evaluated to decide whether a buffer is inserted. This method gives a great reduction of the buffer area. Due to the topdown nature of the algorithm, a tree traversal is needed to collect RLC information for delay evaluation, which is the bottlenec for the efficiency of the algorithm. Since only a single solution is maintained, this algorithm still runs fast, however, it sacrifices solution quality. In [4], a different RLC algorithm based on dynamic programming is designed. The main contributions of [4] are the This research was supported in part by SRC grant 2004-TJ-205, NSF grant EIA-0223785, and IBM faculty research award. concept of downstream impedance and new pruning condition to speed up the solution propagation. However, the approach is not efficient. Experimental results in [4] show that the algorithm runs significantly slower than RC timing buffering algorithms. In reality, buffering techniques are often applied to huge volume of nets and thus fast buffering techniques are highly desired. This imposes great challenge on designing state-of-the-art buffer insertion algorithm and motivates this wor. In this paper, we propose the fastest RLC buffer insertion algorithm. The new algorithm wors under the dynamic programming framewor with two new features: a highly effective restrictive cost buceting technique for solution pruning and an O()-time efficient delay evaluation procedure. The whole algorithm runs in provably linear time in terms of candidate buffer positions. Because RLC circuit behavior is not well understood and only approximate delay model exists, there is no surprise as we have to use some approximation techniques to mae our RLC buffering algorithm run in linear time. The experimental results demonstrate that our linear time algorithm consistently outperforms all nown RLC buffering algorithms in terms of both solution quality and runtime. That is, the new algorithm uses fewer buffers, runs in shorter time and the buffered tree has better timing. In particular, the new algorithm givesupto8.5% buffer saving and 4 speedup over [3]. When buffer cost minimization is handled, 5.3% fewer buffers and 5 speedup is obtained over [4]. The rest of the paper is organized as follows: Section II describes the delay model. Section III describes the proposed RLC buffering algorithm. Section IV presents the experimental results with analysis. A summary of wor is given in Section V. II. DELAY MODEL Since the prevailing RC Elmore delay model does not catch the actual performance considering inductance effect, various authors claim 30% to 00% timing errors [7], [8]. Thus, more accurate delay models are necessary for accurate timing analysis and buffer insertion. Such a model will be introduced in this section. A. Inductance Impact on Delay To accurately investigate the inductance impact on delay, it is important to have the realistic range of unit parasitic resistance, capacitance and inductance. We set up a realistic environment to accurately extract the parasitics based on the model of [9]. Since the inductance effect depends on current return path, the capacitance crosstal path which is shown in Figure

.8.6 RLC Output RC Output.4 Voltage (volts).2 0.8 0.6 Input 0.4 0.2 0 0 50 00 50 200 Time (ps) Fig.. Capacitance crosstal path of a signal line. Fig. 3. wires. Comparison of output signals between RLC and RC in intermediate Voltage (volts).8.6.4.2 0.8 0.6 0.4 Input RLC Output RC Output which is given by where ω 2 n h(s) = s 2 + s2ζω n + ωn 2, () ζ = RC and ω n =. (2) 2 LC LC Fig. 2. 0.2 0-0.2 0 00 200 300 400 500 Time (ps) Comparison of output signals between RLC and RC in global wires. is crucial. The signal line is assumed to have two neighboring lines which remain static during the entire analysis period. Then, FastCap [0] and FastHenry [] are used to extract capacitance and inductance. For MOSIS 30nm technology [2], the following parameters are obtained. From metal layer one to six, unit resistance varies from 350Ω/mm to 0Ω/mm, unit capacitance from 380fF/mm to 80fF/mm and unit inductance from 0.6nH/mm to.3nh/mm. We investigated inductance effect in global wires and intermediate wires both. SPICE simulation results demonstrate that differences between applications of RLC model and RC model to global wires on metal layer five and six (Figure 2) and intermediate wires on metal layer three and four (Figure 3) are quite significant. The inductance introduces 20% additional delay to global wires and also causes large overshoot, while relatively less impact has been observed on intermediate wires. The reason behind this phenomenon is that unit resistance of global wires is much less than that of intermediate wires. B. Interconnect Delay Model An equivalent Elmore delay model of an RLC tree from [3] is adopted in this paper. The model comes from a second order approximation of transfer function, which preserves the same accuracy for RLC trees as that of Elmore delay with respect to RC trees. A single line segment of an RLC circuit is shown in Figure 4. This circuit has a second order transfer function For a general RLC tree shown in Figure 5, the voltage drop at node N i compared with the input voltage is V in (s) V i (s) = C V (s)s(r i + L i s), (3) where is the index of all branches along the path from N 0 to N i. The normalized transfer function h i (s) at node N i is h i(s) = X C V (s)s(r i + L i s)=+m i s + m i 2s 2 +. The first and second moments at node N i are approximated by m i = C R i, (4) m i 2 =( C R i ) 2 C L i. (5) Then ζ and ω n that characterize a second order approximation of the transfer function at node N i are ζ i = C R i 2 and ω ni =. (6) C L i C L i Eqn. () and Eqn. (6) can be used to determine the time domain signal at node i for an arbitrary input. In [3], a curve fitting method is applied to compute 50% delay of a step input and the delay is t =(.047e ζ i 0.85 )/ωni +0.695 C R i. (7) The calculation of ζ i and ω ni in Eqn. (6) requires the calculation of the two summations, C R i and C L i. The former is the RC Elmore delay, which can be calculated efficiently with linear complexity, and the latter is used to characterize inductance effects, which is also calculated with linear complexity.

Fig. 4. A single segment of an RLC circuit. by area or any other metric, depending on the optimization objective. Without loss of generality, we assume that the driver at source s 0 is also a buffer. A function f : V n 2 B specifies the types of buffers allowed at each internal vertex. A buffer assignment γ is a mapping γ : V n B { } where denotes that no buffer is inserted. The cost of a solution γ is W (γ) = b γ W b. With the above notations, our RLC buffering problem can be formulated as follows. RLC Minimum Cost Buffer Insertion Problem: Givena routing tree T =(V,E), possible buffer positions defined by f, and a buffer library B, find a buffer assignment γ such that the total cost W (γ) is minimized, the RLC required arrival time at the driver is no less than a given constant α. Fig. 5. An RLC tree. C. CMOS Gate Delay Model The computation of CMOS gate delay t g is adopted from [4]. CMOS gate delay is a combination of the linear approximation t lin and the saturation approximation t sat, t g = t lin + t sat exp(. t lin ). (8) t sat The calculation of t lin and t sat is related to Eqn. (7) [3]. According to [3], the error of this method is within 3% from actual CMOS gate delay. Eight netlists with tree topology each having around 00 nodes are used to test the accuracy of adopted RLC Elmore delay model and CMOS gate delay model. One can see that the maximum error between RLC delay and SPICE delay is just 7.%. TABLE I COMPARISON BETWEEN RLC ELMORE DELAY AND SPICE DELAY. TIME UNIT IS ns. Tree Unbuffered Tree Buffered Tree Cases RLC SPICE Error RLC SPICE Error 4.63 4.89 5.3 %.4.07 6. % 2 3.78 3.95 4.3 %.2.04 7. % 3 3.09 3.25 4.9 %.8.0 6.8 % 4 3.06 3.22 5.0 %.09.02 6.4 % 5 2.85 2.97 4.0 %..04 6.3 % 6 2.66 2.76 3.6 %.08.0 6.5 % 7 3.20 3.34 4.2 %.20.3 5.8 % 8.07.3 5.3 % 0.63 0.59 6.3 % III. RLC BUFFERING ALGORITHM A. Preliminaries The basic buffering problem includes a routing tree T = (V,E),whereV = {s 0 } V s V n,ande V V.Vertexs 0 is the source vertex, V s is the set of sin vertices and V n is the set of internal vertices. Each sin vertex s V s is associated with sin capacitance C s, and each edge e E is associated with lumped resistance R e and capacitance C e. A buffer library B contains different types of buffers. Each type of buffer b is associated with an output resistance R b, input capacitance C b, intrinsic delay K b and a cost W b. W b can be measured B. Algorithm Our algorithm shares the same dynamic programming framewor as van Ginneen/Lillis s algorithm. In our new algorithm, candidate solutions eep being updated during the procedure from leaf towards the driver, where each solution γ is characterized by two tuples. The first tuple is (Q, W ), where Q refers to the required arrival time (RAT) and W denotes the cost of the solution. The first tuple is used in pruning. The second tuple is (C, CR, CL), wherec denotes the downstream capacitance, CR and CL represent the largest C R i and C L i in all downstream branches, respectively, where represents the index of all branches along the path from the node i to its immediate buffered descendant node (which can also be a sin). The second tuple is used for accurately calculating delay in Eqn. (7) and Eqn. (8) but not pruning. During the solution propagation, significant amount of solutions are pruned by our new restrictive cost buceting technique. Restrictive Cost Buceting is an effective pruning technique with slight impact on solution quality. In the technique, all solutions with the same buffer cost are placed in a bucet. Each bucet has a bucet capacity. It is the maximum number of solutions a bucet can hold. In this paper, all bucets have the same bucet capacity P. The main power of the restrictive cost buceting lies in that it imposes some restrictions on the bucet capacity. Through these restrictions, we are able to control the number of solutions in the solution set and thus the overall complexity of the algorithm. If there are excessive solutions to be inserted to a bucet, the solution selection procedure will be carried out. The preference of solutions could be based on various criteria and in our experiments, large slac solutions are preferred. The procedure of the new buffering algorithm is as follows. ) Sin: At a sin node, we create a candidate solution set in which each bucet is associated with a range of buffer costs. A solution is added to a cost bucet associated to zero buffer cost which can be computed since that Q is equal to the required arrival time at that sin and C is equal to the sin capacitance. W =0and CL =0, (9) CR =0. After this operation, the size of zero cost bucet is equal to and the size of other cost bucet is zero.

2) Wire Insertion: Consider to propagate solutions from a node v to its parent node u through edge e =(u, v). A solution γ v at v becomes solution γ u at u, C(γ u )=C(γ v )+C e and W (γ u )=W (γ v ). CR(γ u )=CR(γ v )+R e C(γ u ), CL(γ u )=CL(γ v )+L e C(γ u ), (0) where R e and L e are wire resistance and inductance, respectively. Since our delay model is not additive, Q is not updated in the operation of wire insertion. No new solution is added in the solution set, and the size of each cost bucet does not change. 3) Buffer Insertion: In addition to eeping the unbuffered solution γ u,abufferb (b B) can be inserted at u to generate a buffered solution γ u,buf. The required arrival time of new solution is computed as Q(γ u,buf )=Q(γ u ) D(γ buf ), () where D(γ buf ) is the total downstream delay computed in Eqn. (7) and Eqn. (8) from node u to its immediate buffered descendant node (which can also be a sin). D(γ buf ) is computed using delay re-evaluation, which is necessary as our delay model is not additive. In [3], when inserting buffers, delay re-evaluation is performed at the driver. For this, an entire tree traversal is needed to get the summation of CR and CL. This step is very time consuming and is the bottlenec for the efficiency of their algorithm. In our algorithm, an efficient delay update technique is used to calculate D(γ buf ) in O()-time. Our idea is to propagate those summation of CR and CL along with the bottom up solution propagation. At each node, RAT value can be easily computed by Eqn. (7) and Eqn. (8) without the need to bactrac the downstream subtree. Our experiment indicates that this method significantly saves runtime compared to Ismail s entire tree traversal for delay update. After buffer insertion, C(γ u,buf )=C b and W (γ u,buf )= W (γ v )+W b. CR(γ u,buf )=0, (2) CL(γ u,buf )=0. The new solution γ u,buf is inserted to a cost bucet associated with cost W (γ u,buf ). According to our solution selection procedure, P maximum slac solutions are selected where P is bucet capacity. 4) Branch Merge: When two sets of solutions are propagated through left child branch and right child branch to reach a branching node, they are merged. Denote the leftbranch solution set and the right-branch solution set by Γ l and Γ r, respectively. For each solution γ l Γ l and each solution γ r Γ r, the corresponding merged solution γ can be obtained according to Q(γ )=min{q(γ l ),Q(γ r )}. Note that before merging, Q(γ l ) and Q(γ r ) need to be re-evaluated by efficient update technique. C(γ )=C(γ l )+C(γ r ) and W (γ )=W(γ l )+W(γ r ). To ensure in the worst case, all downstream branches still satisfy the timing constraint, the largest CR and CL between two branches are chosen to propagate. The new CR(γ ) and CL(γ ) are computed as CR(γ )=max{cr(γ l ),CR(γ r )}, CL(γ (3) )=max{cl(γ l ),CL(γ r )}. The new solution γ may be inserted in the buffer cost bucet associated to the cost W (γ ). After merging, the same solution selection procedure is performed on each solution bucet. 5) Linear Time Complexity: Due to introducing restrictive cost buceting and efficient delay update techniques, the complexity of the whole algorithm breas down to linear time. Denote by n the number of nodes of the routing tree and by B the number of buffers in the buffer library. Suppose that our solution set has m bucets and bucet capacity is P. It is then clear that at any node, there can be at most mp solutions. Given this fact, we are to analyze the complexity due to wire insertion, buffer insertion and branch merging operations. For wire insertion, the number of solutions do not increase. For buffer insertion, at most B mp could be generated (of course, only mp can be selected). For branch merging, at most m 2 P 2 can be generated. Thus, at any time during the solution propagation, we can have at most max{ B mp, m 2 P 2 } solutions. By building a balanced binary search tree on each bucet in the solution set, solution pruning by restrictive cost buceting can be certainly completed in max{ B mp, m 2 P 2 } O(log P ) time and each delay evaluation only taes O() time since we do not need to traverse any subtree. In practice, we set m, P to small constants. Since there are at most n candidate buffer positions, it immediately follows that our algorithm runs in O(n B ) time. IV. EXPERIMENTAL RESULTS Our new algorithms and algorithm in [4] are implemented in C++. Implementation of Ismail s RLC algorithm is borrowed from the author s webpage. All algorithms are tested on a Pentium IV computer with a 3.2GHz CPU and GB memory. Our test cases are extracted from an industrial ASIC chip, which consists of 000 nets with more than 50000 nodes including sins, branching nodes and buffer positions. Among them, 682 nets have 5 sins and all the remaining nets have 20 sins. The sin capacitances range from 2.5fF to 200fF. The unit resistance is 6.4Ω/mm, the unit capacitance is 94.2fF/mm and the unit inductance is.0nh/mm. The buffer library consists of 2 buffers. Buffer resistances range from 30Ω to 360Ω and input capacitances range from 3.3fF to 40.0fF. The time unit for this section is ps if not specified. SPICE simulation is based on RLC model in all the experiments below. For convenience, all algorithms in comparison are listed below together with their abbreviations. van Ginneen-Lillis: van Ginneen/Lillis s min-cost timing buffering based on the RC Elmore delay [], [2]. Ismail: Ismail s top-down RLC timing buffering algorithm based on RLC Elmore delay [3]. Impedance: RLC min-cost timing buffering based on impedance delay [4]. NEW: RLC timing buffering without cost minimization based on RLC Elmore delay. In NEW, the number of bucet m is set as. NEW+COST: RLC min-cost timing buffering based on RLC Elmore delay. In NEW+COST, the number of bucet m is set as 7 and P is equal to 5.

TABLE II COMPARISON OF BUFFERING BETWEEN ISMAIL AND NEW ON FIVE SETS OF TEST CASES.EACH SET CONSISTS OF RANDOMLY SELECTED 200 NETS. Ismail NEW (P =30) NEW (P =40) NEW (P =50) Test # SPICE CPU # SPICE CPU # SPICE CPU # SPICE CPU Sets Buf. Delay (s) Buf. Delay (s) Buf. Delay (s) Buf. Delay (s) S 270 965 96 95 95 50 236 944 64 242 939 82 S2 008 809 24 980 806 45 992 805 5 986 802 70 S3 99 675 06 85 665 40 848 665 50 84 665 6 S4 80 645 03 766 643 40 769 644 47 772 643 58 S5 640 568 66 627 557 35 63 555 42 628 555 50 TABLE III COMPARISON BETWEEN VAN GINNEKEN-LILLIS,IMPEDANCE AND NEW+COST ON FIVE SETS OF TEST CASES, EACH HAVING RANDOMLY SELECTED 200 NETS. van Ginneen-Lillis Impedance NEW+COST (m =7, P =5) Test # SPICE CPU # SPICE CPU # SPICE CPU Saving to Saving to Sets Buf. Delay (s) Buf. Delay (s) Buf. Delay (s) van Ginneen-Lillis Impedance S 635 979 330 32 935 825 089 933 67 33.4 % 3.8 % S2 428 843 280 982 802 660 965 802 35 32.4 %.7 % S3 2 688 69 879 667 42 832 658 22 3.3 % 5.3 % S4 020 672 52 764 643 388 738 64 0 27.6 % 3.4 % S5 86 583 23 627 555 27 624 555 03 27.5 % 0.5 % A. Comparison between Ismail and NEW We compare the number of buffer, SPICE delay and CPU time between Ismail and NEW. The results on total 000 nets are summarized in Table II. In NEW algorithm, P is set to 30, 40 and 50, respectively. From Table II, we mae the following observations: The solutions from NEW always have less SPICE delay and less buffer area than those returned by Ismail. Especially, NEW (P =50) can save up to 8.5% buffer area than Ismail. In addition to returning high quality solutions, NEW also is efficient in terms of runtime. Especially, NEW (P =30) can provide up to 4 speedup over Ismail. The reason behind this phenomenon is that our restrictive cost buceting and efficient delay updating techniques tremendously reduce computation complexity. As P increases, the solution quality in NEW also increases and so does the runtime. This property maes NEW greatly suitable for pratical use since a tradeoff between solution quality and runtime is easily achieved. B. Comparison between van Ginneen-Lillis, Impedance and NEW+COST We compare buffer reduction between van Ginneen-Lillis, Impedance and NEW+COST algorithms since all these three handle cost minimization. The results on total 000 nets are summarized in Table III. The solution satisfying the minimum delay condition with minimum number of buffers is chosen. Saving refers to percentage difference in the number of buffers. We mae the following observations: NEW+COST saves up to 33.4% buffers over van Ginneen-Lillis. One reason for huge buffer saving is that the delay of RLC is approaching linear with inductance effect. Another reason is that delay is overestimated using the traditional Elmore delay model and thus excessive buffers are inserted. NEW+COST saves up to 5.3% buffers over Impedance, and in the meantime, NEW+COST achieves up to 5 speedup. NEW+COST provides the best solution quality in the least CPU time, which means that NEW+COST consistently outperforms van Ginneen- Lillis and Impedance. V. CONCLUSION This wor proposes a new linear approximation RLC buffering algorithm to meet the demand of accurate timing optimization in high frequency digital circuits. Experiments on industrial netlists show that buffer insertion considering inductance effects can save up to 33.4% over van Ginneen- Lillis s algorithm. Meanwhile, the new algorithm achieves 4 speedup over Ismail s algorithm and 5 over algorithm based on impedance. REFERENCES [] L.P.P.P. van Ginneen, Buffer placement in distributed RC-tree networs for minimal Elmore delay, ISCAS, pp. 865 868, 990. [2] J. Lillis and C.-K. Cheng and T.-T.Y. Lin, Optimal wire sizing and buffer insertion for low power and a generalized delay model, Journal of Solid State Circuits, vol. 3, no. 3, pp. 437 447, 996. [3] Y. Ismail, E. Friedman, and J. Neves, Repeater insertion in tree structured inductive interconnect, ICCAD, pp. 420 424, 200. [4] Z. Jiang, S. Hu, J. Hu, Z. Li, and W. Shi, A new RLC buffer insertion algorithm, ICCAD, 2006. [5] P. Saxena and N. Menezes and P. Cocchini and D.A. Kirpatric, Repeater scaling and its impact on CAD, TCAD, vol. 23, no. 4, pp. 45 463, 2004. [6] P.J. Osler, Placement driven synthesis case studies on two sets of two chips: hierarchical and flat, ISPD, pp. 90 97, 2004. [7] Y.I. Ismail and E.G. Friedman, Effects of inductance on the propagation delay and repeater insertion in VLSI circuits, DAC, pp. 72 724, 999. [8] C. Alpert, A. Devgan, and S. Quay, Buffer insertion with accurate gate and interconnect delay computation, DAC, pp. 479 484, 999. [9] K. Gala, V. Zolotov, R. Panda, B. Young, J. Wang, and D. Blaauw, On-Chip inductance modeling and analysis, DAC, pp. 63 68, 2000. [0] K. Narbos and J. White, FastCap: A multipole accelerated 3D capacitance extraction program, TCAD, vol. 0, no., pp. 447 459, 99. [] M. Kamon and M.J. Ttsu and J.K. White, FASTHENRY: A multipole accelerated 3-D inductance extraction program, IEEE Transactions on Microwave Theory and Techniques, vol. 42, no. 9, pp. 750 758, 994. [2] http://www.mosis.com. [3] Y. Ismail, E. Friedman, and J. Neves, Equivalent Elmore delay for RLC trees, DAC, pp. 75 720, 999. [4] Y. Ismail and E. Friedman, Optimal repeater insertion based on a CMOS delay model for On-Chip RLC interconnect, ASIC, pp. 369 373, 998.