Fast Delay Estimation with Buffer Insertion for Through-Silicon-Via-Based 3D Interconnects

Size: px

Start display at page:

Download "Fast Delay Estimation with Buffer Insertion for Through-Silicon-Via-Based 3D Interconnects"

Richard Powers
6 years ago
Views:

1 Fast Delay Estimation with Buffer Insertion for Through-Silicon-Via-Based 3D Interconnects Young-Joon Lee and Sung Kyu Lim Electrical and Computer Engineering, Georgia Institute of Technology Abstract For successful adoption of through-silicon-via-based 3D ICs, delay estimation techniques of 3D interconnects for early design stages are required. The 3D nets may connect gates/macros placed far apart and through-silicon-vias (s) have large parasitic capacitances. Thus, buffers are inserted to reduce interconnect delay. To make good decisions in early design stages, the estimation of buffered delay should be fast and reasonably accurate. However, there has been no buffered delay estimation work for 3D ICs that considers proper delay models and RC parasitics. In this work, we investigate several analytical delay models for 3D net delay estimation. Then, based on analytical formula and our heuristic algorithm, we propose how to estimate the buffered delay for movable cases and fixed cases. The effectiveness of our delay estimation technique is demonstrated with various 3D nets. Compared with the van Ginneken buffer insertion based delay estimation, our estimation provides solutions about 750 times faster with almost the same estimated delay. Index Terms 3D IC, through-silicon-via, delay estimation, buffer insertion. I. INTRODUCTION As the physical limit for technology scaling approaches and the cost for IC fabrication soars, 3D IC is considered as a viable way to preserve Moore s law. The benefits of 3D ICs have been advocated and lots of researches have been done on material, fabrication, design methodology, testing and so on, for successful commercialization. For a fast adoption, designers may choose to partition existing designs into blocks then place them on different dies along with IP blocks. To enable this so-called block-level 3D IC design methodology, a reasonably good timing estimation technique is required in early design stages, such as architectural design space exploration, floorplanning, or timing-driven placement. Buffer insertion for 2D ICs was studied in closed analytical formulations [4]. After the pioneering work of van Ginneken [16] which adopted dynamic programming, efforts for generalization [9], speed-up [15], and higher accuracy [2] were made. However, the Ginneken algorithm usually takes considerable computation time, especially when wires are segmented for candidate buffer locations [1]. In early design stages, we need a fast yet reasonably good delay estimation. The buffer locations do not have to be very accurate, because the actual buffers may be placed at different locations as more layout information is available in the later design stages. For 3D ICs, a buffer planning algorithm at floorplanning stage was recently proposed in [5]. However, buffer delay was assumed to be constant, irrespective of RC load, and parasitics was not considered. Since s have large parasitic capacitance, it is natural to consider it in the delay estimation. Furthermore, depending on the design flow, design constraints, and manufacturing issues, we may or may not be allowed to move s in the 3D nets. In this work, This material is based upon work supported by the Semiconductor Research Corporation (SRC) under the Integrated Circuit & Systems Sciences (ICSS, Task ID: & ) and the Interconnect Focus Center (IFC, Theme ID: ). we demonstrate how to perform buffered delay estimation for based 3D interconnects, for both movable and fixed cases. Our buffered delay estimation technique takes two steps. First, we quickly determine the number of buffers and their locations using simple analytical delay models. Then we evaluate the buffered delay with more elaborate delay models. The major contributions of this work are as follows: We compare existing analytical delay models for various 3D nets with s. For gate delay, linear model and k-factor equation based model are examined, along with lumped and effective capacitance models. For net delay, the Elmore model, a moment based model (WED), and a technique for using step-input-based net delay models for ramp inputs (PERI) are explored. We discuss which models are suitable for buffered delay estimation. We provide fast delay estimation techniques for buffered 3D nets in -based 3D ICs. Two different cases of 3D nets are discussed: movable case and fixed case. For movable case buffer locations and locations are determined, whereas for fixed case buffer locations are determined with the consideration of given locations. Our delay destimation technique can be used for architecture exploration, floorplanning, or timing-driven placement. We demonstrate the effectiveness of our delay estimation technique with layout experiments of various 3D nets. Compared with the widely-used van Ginneken style buffer insertion based delay estimation method, our estimation provides solutions about 750 times faster with almost the same or a few percents increased delay. The remainder of this paper is organized as follows. Our 3D IC structure and parasitics are introduced and gate and net delay models are investigated in Section II. Then analytical buffer insertions for movable and fixed cases as well as our heuristic buffer insertion algorithm for delay estimation are presented in Section III. Experimental results are shown in Section IV, followed by conclusions in Section V. II. ANALYTICAL DELAY MODELS Our 3D IC structure is shown in Fig. 1(a). Although only two dies are depicted, the whole chip may have multiple dies stacked, in which s go through thinned substrates. Based on Nangate 45nm standard cell library [12], our macro occupies four standard cell rows as shown in Fig. 1(b). The diameter is 2µm. It is well known that s have large parasitics that affect timing. Depending on dimensions, materials, and manufacturing process, the magnitude of parasitics may vary [8]. Each has a parasitic capacitance (C T SV ) and a resistance (R T SV ) and is represented by a π-model with two capacitors and a resistor, as shown in Fig. 1(c). The inductance of is ignored because it is not dominant under a few GHz signal speed. Due to the parasitics, 1) gate and net delays are affected by number of s and their location, and 2) /12/$ IEEE th Int'l Symposium on Quality Electronic Design

2 Die 1 face LP (M6) back Die 0 PP (M1) (a) local vias gate PP (M1) (b) PP R C 2 (c) LP TABLE I PARAMETERS USED IN THIS WORK. THE BUFFER CELL BUF X8 IS USED. C g IS GATE INPUT CAPACITANCE. K g ps R g 0.303kΩ C g 6.585fF k k E-5 k E-8 k k C 0.102fF/µm R 1.5Ω/µm C T SV 25fF R T SV 1Ω Fig. 1. (a) Side view of the 3D IC, (b) top view of a, and (c) π-model of parasitics. PP (M1) and LP (M6) represent pin pad on metal1 and landing pad on metal6, respectively. Backside metallization on Die 0 is hidden for simplicity. Dashed lines in (b) denote standard cell row boundaries. Dimensions are in µm. nets with s no longer have uniform unit length R and C along the path, which is different from the uniform net RC assumption in previous 2D analytical buffer insertion works [4][1]. For multi-fanout nets, assuming that non-critical paths can be decoupled by inserting infinitesimally small offloading buffers [3], we have two-pin 3D nets that connects the source gate to the critical sink gate, with s along the path. Thus, for the rest of this work, we focus on two-pin nets. A. Gate Delay Model Linear gate delay model has been extensively used in timing optimization works. Linear gate delay is expressed as follows: D g,linear = K g + R g C L where K g and R g are intrinsic delay and intrinsic resistance, and C L is the lumped load capacitance at the output pin of the gate. We fit the above equation to the actual gate delay and obtain the parameters K g and R g. Note that the parameters change with different slew values (=transition times) at the input pin of the gate, however it is usually ignored. As discussed in [2], the linear gate delay model is inaccurate because 1) due to the resistive shielding [13], lumped load capacitance is an overestimate of the effective capacitance [14] seen at the gate output, and 2) gate delay is not a linear function of load capacitance. The first problem can be solved by adopting effective capacitance model, while the second problem is dealt by k-factor equation gate delay model [14]. In the effective capacitance calculation, the RC network is reduced to a π-model (C n, R π, C f ) in which R π models the resistive shielding effect. Then the effective capacitance (C eff ) at the gate output is computed as in [14]. The k-factor equation for gate delay is: D g,k factor = (k 1 + k 2C eff )S g,in + k 3C 3 eff + k 4C eff + k 5 where S g,in is the signal slew at the gate input and k 1 k 5 are curvefitting parameters. B. Net Delay Model The basic net delay model is given by Elmore delay equation [6], which corresponds to the first moment of the impulse response. The Elmore delay has been used in various timing optimization works because it is easy to compute and the delay is additive, meaning that the delay from A to C is the delay from A to B plus the delay from B to C. This additive characteristic supports the optimal substructure of dynamic programming in the original Ginneken buffer insertion [16]. The shortcoming of Elmore delay is that it may deviate from the actual delay by orders of magnitude [2]. For higher accuracy, we may use a moment-matching based delay metric. In this study, we evaluate WED [11], because it requires a small amount of computation and is reported to be more accurate than Elmore delay model. Two moments are computed in bottom-up traversal in van Ginneken dynamic programming as in [2], then two simple 1D table lookups are needed to obtain the estimated net delay. The problem of WED is that the model works only for step inputs. In real circuits, an input signal has a finite slew which makes WED significantly underestimate the actual delay. Thus we also adopt PERI method [7] which converts the delay from the delay models for step inputs to the delay for ramp inputs. However, since WED and PERI methods depend on signal slew and in early design stages we usually cannot determine signal slews on nets, it may not be appropriate to use WED and PERI in early design stages. C. Experiments for Delay Models We perform two experiments to compare the delay metrics: (1) For a 3D net of wirelength L=1000µm (= distance from source to sink gate) with one, vary the location, and (2) for 3D nets of various wirelengths with two s, vary the locations. For all experiments we use the Nangate 45nm standard cell library. For the same reason as in [3], we assume that all gates (sources, sinks, and buffers) are the same (BUF X8). Also we assume that router uses metal5 with unit length capacitance C and resistance R. The parameters used in this work are summarized in Table I. As a reference, delay values from PrimeTime are shown. The layout of each net is performed in Cadence Encounter, followed by RC extraction using Cadence QRC. Then combined with the RC modeling of s in Fig. 1, we run 3D static timing analysis in Synopsys PrimeTime to obtain the delay values. 1) A 3D Net with One : For a 3D net of wirelength L = 1000µm, we vary the location F (measured from the source gate) from 0 to 1000µm. As shown in Fig. 2, when the is moved away from the source gate (= increasing F ), the gate delay decreases while the net delay increases. Gate delay decreases because the effective load capacitance decreases as shown on the bottom right of Fig. 2. Due to the resistive shielding effect on the (the resistance of the wire between source gate and ), the capacitance seen by the source gate decreases as the moves towards the sink gate. Linear gate delay model with lumped capacitance fails to follow this trend. Note that the gate intrinsic delay (K g in Table I) is comparable to the gate delay, which suggests that ignoring the intrinsic delay may incur a large error in estimated delay. Elmore follows PrimeTime net delay closely, while WED constantly underestimates it. Per total delay, linear gate delay with lumped capacitance and Elmore net delay (lin+clu+elm) overestimates the delay much, while k-factor gate delay with effective capacitance and Elmore net delay (kfa+cef+elm) follows the PrimeTime delay closely. 2) 3D Nets with Two s: For various 3D nets with different wirelengths and two s, we pick the locations randomly. Table II summarizes the delay from the analytical delay models

3 Fig. 2. Source gate delay (D g ), net delay (D n ), segment delay (D seg =D g +D n ), and load capacitance (C load ) for the 3D net with single case. Clump/Clu and Ceff/Cef mean lumped capacitance and effective capacitance. WE, PE, and PT stand for WED, PERI, and PrimeTime. TABLE II COMPARISON OF SOURCE GATE AND NET DELAY VALUES BY ANALYTICAL DELAY MODELS AND PRIMETIME. THE LUMPED AND EFFECTIVE CAPACITANCES OF NET (C n) ARE ALSO SHOWN. THE L, F, AND G REPRESENT WIRELENGTH, DISTANCE FROM SOURCE GATE TO THE FIRST, AND DISTANCE FROM THE FIRST TO THE SECOND. THE LENGTH, DELAY, AND CAPACITANCE VALUES ARE IN µm, ps, AND ff RESPECTIVELY. L F G D g D n D seg = D g + D n C n linear+clump kfactor+ceff PT Elmore WED+PERI PT lin+clu+elm kfa+cef+elm PT Clump Ceff ave and PrimeTime. Again, same trend as in the previous subsection is observed. The linear+clump overestimates the gate delay much, while kfactor+ceff shows close values to PrimeTime results. This is mainly because of the difference between Clump and Ceff, shown in the rightmost column. Depending on the L, F, and G, Elmore under/overestimates the net delay (compared with PrimeTime delay), while WED+PERI always underestimates it. From the above experiments, it may seem that the simple linear gate delay with lumped capacitance and Elmore net delay are not suitable for buffered delay estimation. However, because of their fidelity we claim that these models are still useful for determining a buffer solution, as discussed in the following section. III. ANALYTICAL BUFFER INSERTION Our buffered delay estimation technique takes two steps. First, we quickly determine the number of buffers and their locations with simple analytical delay models. Then we evaluate the buffered delay with more elaborate delay models. This section explains the first step. Note that the buffers are not actually inserted into the netlist; we assume the buffers are temporarily inserted for evaluating the buffered delay. The final outcome of our technique is the estimated buffered delay, not the exact buffer locations. Our 3D buffer insertion problem is described as follows: During early design stage (eg. floorplanning), for given unbuffered 3D nets with the estimated routes (before detailed routing), we estimate the buffered delay of the 3D nets. The estimated 3D routes are constructed by adopting the 3D rectilinear Steiner tree [10] or the 3D rectilinear minimum spanning tree algorithm. Depending on the design flow and other design constraints, we may be allowed to move s in the 3D nets. Thus we categorize the problem into two cases: 1) Movable case: We determine/suggest the number of buffers

4 segment segment buffer segment length interval x L Fig. 3. L-x (a) a segment Fig. 4. left left right right left current right interval interval interval Terminologies for analytical formulations. R g R x R R(L - x) C x C C(L-x) (b) RC-tree model A segment and its RC-tree model. and their locations as well as the optimal locations to estimate the lowest achievable buffered 3D net delay. 2) Fixed case: Given the locations, we carefully insert buffers to estimate a reasonably low buffered 3D net delay. In this section, we provide the optimal buffer insertion solutions for several 3D net cases by solving analytical formulations. An analytical buffer insertion solution provides optimal number of buffers and their locations. Furthermore, from analytical solutions we gain insights on how the buffers should be inserted for optimal buffered delay. Note that the analytical formulations are based on linear gate delay with lumped load capacitance and Elmore net delay. We define terminologies used in our analytical formulations. As shown in Fig. 3, a segment is a buffer and the wire to the downstream buffer, and a segment is a segment with s in it. A segment length is defined as the wirelength of the segment, excluding the size. A interval is the interval between two adjacent s or the interval between the leftmost (or rightmost) and the source (or sink) gate. Considering a buffer (shown in shaded blue), left is towards upstream and right is towards downstream. Current interval is the interval that the buffer belongs to, and left/right interval is the interval to the left/right. Left/right is the left/right of the current interval. A. Movable Case For movable cases, our suggestions are as follows. Perform 2D buffer insertion without considering s, then place the s close to any buffer output. When the number of s is more than the number of buffers, place some of the s back to back near a buffer output. Now we explain the reasons for these suggestions. Theorem 1: The delay of a segment is minimized by placing the right after the driving buffer. Proof: Consider the following problem shown in Fig. 4: Given the wirelength between two buffers (L), determine location (x) so that the delay is minimized. The segment delay is formulated as follows: D seg = K g + R g(cx + C T SV + C(L x) + C g) + Rx(Cx/2 + C T SV + C(L x) + C g) + R T SV (C T SV /2 + C(L x) + C g ) + R(L x)(c(l x)/2 + C g ) To find the optimal x, we differentiate D seg by x: dd seg/dx = RC T SV R T SV C C g Since RC T SV R T SV C, dd seg/dx > 0. Thus, D seg is minimum when x = 0, i.e. is placed right after the driving buffer, regardless of L. Theorem 2: The path delay is optimal when non- segments are of the same length. Proof: Consider the path delay excluding the segment. Then, this path delay is optimal when the distance between buffers are all the same, as shown in [1]. Thus, the non- segments are of the same length. Theorem 3: The path delay does not change regardless of the segment locations. Proof: From Theorem 2, we may assume that all non- segments have the same length (=l s ), and the segment has a different length (=l t ). Let the delay of a non- segment and a segment D s and D t, respectively. When k buffers are inserted for a 3D net with n s, there are (k n + 1) non- segments and n segments. The path delay is the sum of the delay of each segment, i.e. D path = nd t + (k n + 1)D s, regardless of the segment locations. From Theorem 1, the optimal solution should place the right after a buffer. Furthermore, by Theorem 3, the optimal delay does not change regardless of the segment location. Now we find the optimal buffer insertion solution in a similar way as in [1]. With the assumptions in the proof of Theorem 3, when the total length is L, l t = (L (k n + 1)l s )/n. Then, the path delay is: D path (k) = nd t + (k n + 1)D s = n{k g + R g(c T SV + Cl t + C g) + R T SV (C T SV /2 + Cl t + C g) + Rl t (Cl t /2 + C g )} + (k n + 1){K g + R g (Cl s + C g ) + Rl s (Cl s /2 + C g )} which is quadratic in l s when l t is substituted in. To find the optimal l s, we differentiate it and set to zero: dd path /dl s = 0. The solution is: l s = 1 RT SV (L + n k + 1 R ), l t = 1 RT SV (L (k n + 1) k + 1 R ) Substituting l s and l t into D path (k), we find the optimal k by setting D path (k 1) > D path (k), i.e. the path delay starts to increase when one more buffer is added to the net with (k-1) buffers. The solution is: k = RC(L + nrt SV /R) K g + R gc g Since R T SV /R L, by approximation we get: l s = l t L k + 1, k RCL2 2 K g + R gc g Note that there is no related term in the solution equations. In fact, the approximated solution is the same as the 2D solution without s. Thus, we can insert buffers with the above equations then relocate s to minimize delay. B. Fixed Case For fixed cases, it is not easy to find the optimal solution. We may formulate the delay equations in matrix form using linear gate delay model and Elmore net delay model and solve it for delay minimization. However, it is computationally expensive and as shown in Section II the actual delay differs from the simple delay models thus the calculation effort is not so much worthwhile. In the early design stage, we rather prefer quick buffer insertion solution that gives the delay reasonably close to the optimum delay. We first show

5 L-k d L k buffers d Fig. 6. L d k buffers L-k d A right before sink. Fig. 5. A right after source, and the experimental results with wirelength (L)=2, 4mm. D path is the delay from the source to the sink gate. The PT means PrimeTime. the analytical solutions for several example cases, then we propose a heuristic algorithm that is simple and fast yet gives reasonably good delay estimates. 1) A Right After Source: As one extreme, we assume the fixed location is right after the source gate. We find the number of buffers (k) and their locations between the and sink gate. The path delay from the source gate to the sink gate is: D path (k) = K g + R g(c T SV + C(L k d) + C g) + R T SV (C T SV /2 + C(L k d) + C g) + R(L k d)(c(l k d)/2 + C g) + k{k g + R g(c d + C g) + R d(c d/2 + C g)} This is quadratic in terms of d. To find the optimal d, we differentiate it and set to zero: dd path /dd = 0. The solution is: d = L + RT SV /R k + 1 We substitute d into D path (k). The delay with k buffers is smaller than the delay with k 1 buffers when D path (k) < D path (k 1). That is, D path (k) decreases with larger k up to: k < RC(L + RT SV /R)2 K g + R gc g Since L R T SV /R, we may ignore R T SV /R term in the above solution. Then, k < RCL2, d = L (1) 2 K g + R gc g k + 1 which is the same solution as 2D solution [1]. Thus, when a (or stacked s) are placed right after source gate, we can insert k buffers from the above equations at equal distances, separated by d. As shown on the right of Fig. 5, we performed a layout experiments for the right after source gate case. For 3D nets of wirelength (L) 2 and 4mm, we increase the number of buffers (k) and place them at equal distances. For L=2mm, the minimum delay occurs at k = 2 for PrimeTime, linear+clump+elmore, and kfactor+ceff+elmore. The linear+clump+elmore overestimates the buffered delay, however the optimal number of buffers and their location are the same as those by PrimeTime. For L=4mm, the optimal k for PrimeTime and linear+clump+elmore is 5, while that of kfactor+ceff+elmore is 6. Yet, the PrimeTime delay difference between k=5 and 6 is small. In fact, k from 3 to 7 are all good solutions in terms of buffered delay. In sum, although linear+clump+elmore overestimates the delay, it gives the same optimal k as PrimeTime does. This suggests that the simple linear+clump+elmore has good fidelity in terms of buffer count and location. On the other hand, kfactor+ceff+elmore may give a little higher k than PrimeTime does. However, with the optimal k, the estimated delay is very close to the PrimeTime delay. From this experiment, we conclude that (1) for determining number of buffers and their locations, we may use linear+clump+elmore delay models for simplicity, and (2) for delay calculation after a buffer solution is obtained, we need to use more elaborate delay models (such as kfactor+ceff+elmore) for accurate delay estimations. 2) A Right Before Sink: As another extreme, we assume the fixed location is right before the sink gate, as shown in Fig. 6. The path delay from the source gate to the sink gate is: D path (k) = k{k g + R g (C d + C g ) + R d(c d/2 + C g )} + K g + R g(c(l k d) + C T SV + C g) + R(L k d)(c(l k d)/2 + C T SV + C g) + R T SV (C T SV /2 + C g ) By the same method as in the previous subsection, we get: k < RC(L + C T SV /C) 2, d = L + C T SV /C 2 K g + R g C g k + 1 The C T SV /C 123µm is not negligible when compared with L (usually around 700µm after buffer insertion). Thus the optimal k and d are different from 2D analytical solution. In fact, as shown in Section II, the delay of a segment increases as moves towards downstream. With this extreme location, the delay of the segment is relatively larger than other segments, thus it is intuitive to move buffers towards the. Comparing the k and d to the 2D analytical solution in the previous subsection, we see that it is equivalent to the solution for a 2D net with wirelength of L + C T SV /C, except for the last segment. C. 3D Heuristic Buffer Insertion for Fixed Case Since it is hard to find analytical solutions for general 3D nets with multiple fixed s, we propose the 3D heuristic buffer insertion algorithm (3Dheu), which is outlined in Algorithm 1. Our algorithm starts from the sink and performs a single bottom-up traversal to determine locations of buffers. For explanation we use terminologies defined at the beginning of Section III. As we traverse up towards the source gate, each buffer location is determined. The current buffer location serves as an anchor to the next buffer on the upstream. For determining the current buffer location, we first use the wirelength

6 Algorithm 1: The proposed 3Dheu algorithm. Input: a two-pin net with locations of sink gate and s Output: buffer location list BLlist 1 u sink gate location; 2 while u > 0 do 3 L u; 4 calculate d using Eq. 1; 5 if d = u then 6 break; // no more buffers needed 7 end 8 M 2 d; 9 v u M + d; 10 count N rt SV between u and v; 11 calculate x using Eq. 2; 12 w u M + x; 13 while v and w are in different interval do 14 if w is in right interval of v then 15 move v to the center of the right interval; 16 w w C T SV /(2C); 17 end 18 else if w is in left interval of v then // move direction changed 19 move v to the rightmost of the left interval; 20 w w +C T SV /(2C); 21 if v and w are in different interval then 22 w v; 23 end 24 break; // no further adjustment needed 25 end 26 end 27 u w; 28 add u into BLlist; 29 end from right (downstream, already determined) buffer to the source gate to determine d in Eq. 1 (Algorithm 1, Line 4). Note that we ignore s in calculating this d. Then the current buffer is temporarily placed with segment length of d (v in Algorithm 1, Line 9). Then, depending on how many s are between the current buffer and the right buffer, we adjust the current buffer location. How much we need to adjust will be discussed below. After adjustment, we continue the process until no more buffer is needed (Algorithm 1, Line 6). We now provide the details on how to adjust the current buffer location. In Fig. 7(a), a exists on the right of the current buffer. Here, M is the distance between three adjacent buffers by 2D analytical buffer insertion (M = 2d, d is from Eq. 1). The delay from the left (yet to be determined) buffer to the right (already determined) buffer is: D buf = K g + R g (Cx + C g ) + Rx(Cx/2 + C g ) + K g + R g (C(M x F ) + C T SV + CF + C g ) + R(M x F )(C(M x F )/2 + C T SV + CF + C g) + R T SV (C T SV /2 + CF + C g) + RF (CF/2 + C g) This is quadratic in terms of x. To find optimal x, we differentiate it and set to zero: dd buf /dx = 0. The solution is: x = M/2 + C T SV /(2C) That is, we need to move the current buffer from the temporary temporarily projected x currently determined M M-x-F current buffer M/2 C /(2C) optimal location (a) Single on the right M (b) Two s on the right F M-x-F 1 -F 2 F 1 x F 2 M/2 (C /(2C already determined Fig. 7. Adjusting current buffer location with (a) a and (b) two s on the right. F x x x-f M M ignore upstream s M-x (a) Single on the left (b) General case N r Fig. 8. Adjusting current buffer location with (a) a on the left. In (b), a more general case is shown. location (x = M/2) towards right by C T SV /(2C). In Fig. 7(b), two s exist on the right of the current buffer. By the similar method, the optimal x = M/2 + C T SV /(2C) 2. In general, when there are N rt SV s between current buffer and the right buffer, the optimal location of the current buffer is: x = M/2 + (C T SV /(2C)) N rt SV (2) In Fig. 8(a), a exists on the left of the current buffer. The delay from the left buffer to the right buffer is: D buf = K g + R g(cf + C T SV + C(x F ) + C g) + RF (CF/2 + C T SV + C(x F ) + C g) + R(x F )(C(x F )/2 + C g) + K g + R g(c(m x) + C g) + R(M x)(c(m x)/2 + C g) The optimal x = M/2. That is, the on the left (=upstream) does not affect the optimal location of the current buffer. This is because the current buffer does not see the (upstream) as load. In Fig. 8(b), we show a general case where both left and right s exist. In determining the current buffer location, we just need to

7 current interval left interval right interval (a) Across right current interval TABLE IV COMPARISON OF THE ESTIMATED DELAY WITH FIXED CASE AND MOVABLE CASE. THE DELAY IS IN ps. net Fixed Movable #buf delay #buf delay n n n n n average (b) Across left Fig. 9. Adjusting current buffer location across (a) the right and (b) the left. count the s on the right of the current buffer, then multiply it by C T SV /(2C) to get the adjustment distance, as in Eq. 2. Depending on the amount of adjustment, the current buffer may cross the right, sometimes multiple s. In that case, we do not move it to the target location at once, to avoid possible ping-pong situations. Instead, we move the current buffer across one at a time, as shown in Fig. 9. When the current buffer is moved to the neighboring interval, because the buffer now sees different number of s on the right, the optimal location changes. If it moved towards right interval, N rt SV in Eq. 2 decreases by one, thus the optimal location moves towards left by C T SV /(2C) (Algorithm 1, Line 16). On the other hand, if it moved towards left interval, the optimal location moves towards right by C T SV /(2C) (Algorithm 1, Line 20). If the adjusted optimal location is in the new interval, the buffer location is finalized (exit condition of while loop in Algorithm 1, Line 13). If the new location is outside the new interval and the move direction is the same, we move the temporary buffer location into the new interval (Algorithm 1, Line 15). However, if the new location is outside the new interval and the move direction changes, i.e. the new optimal buffer location is in the right (previously visited) interval, we push the buffer towards the right extreme of the interval (Algorithm 1, Line 22) so that the distance between the buffer butler and the is minimum, which reduces the delay of the segment. When the adjusted buffer location and the temporary buffer location are in different intervals, we take the temporary buffer location (Algorithm 1, Line 22). Then the buffer location is finalized (Algorithm 1, Line 24) because we found that further adjustments do not reduce delay by much. At most, we need to move a buffer by the number of the s between the buffer and the right buffer, which is bounded by N T SV. Thus the runtime complexity of the algorithm is O(N T SV k), where k is the number of buffers inserted. IV. EXPERIMENTAL RESULTS To demonstrate the effectiveness of our proposed delay estimation technique, we perform buffer insertions on example nets. The same parameters as in Table I are used. A. 3D Nets with Fixed s The following buffer insertion methods are compared: 1) 2D analytical (2Dana): A 2D analytical buffer insertion, ignoring locations. 2) Proposed 3D heuristic (3Dheu): Our proposed 3D buffer insertion heuristic. 3) Ginneken (Gin): Ginneken buffer insertion with linear gate delay, lumped capacitance, and Elmore net delay models. Our Ginneken implementation is similar to the VG in [2], with extensions for 3D IC handling. A single buffer type (BUF X8) is used. The example nets are a mixture of purely random nets and nets from 3D design layouts (modified into two-pin form). All the experiments are based on layouts as explained in Section II-C. After buffer insertion, we evaluate the buffered delay using PrimeTime (which is accurate), to avoid unfair comparisons due to inaccuracy of analytical delay models. For delay estimation during early design stages, we may use fast and reasonably accurate analytical delay models instead. Table III shows the comparison of the estimated delay with different buffer insertion methods. In column 2Dana, we show two delay values: 1) 2D delay = delay without parasitics, 2) delay = delay with parasitics. The 2D delay underestimates the (actual 3D) delay by about 23%, which indicates the importance of including parasitics in delay estimation. We observe that 2Dana uses less number of buffers than other methods, because it ignores parasitics. As a result, compared with Gin, 2Dana overestimates achievable delay of n2 and n4 by about 12% and 11% respectively. Our 3Dheu uses about the same number of buffers as Gin does and the average estimated delay is only 2.5% higher, which clearly demonstrates the effectiveness of our algorithm. The average runtime of 3Dheu and Gin per net are about 8µs and 6ms, thus 3Dheu is about 750 times faster. Fig. 10 shows the buffer insertion results for n4. In 2Dana, the second buffer from the left drives three s, resulting in large delay. Our 3Dheu placed one more buffers than Gin did. Although the buffer locations by 3Dheu and Gin are different, the buffered delay values differ by only 3.0%, as shown in Table III. B. 3D Nets with Movable s Table III shows the comparison of the estimated delay with fixed case and movable case. We use the same target nets in Table III. For movable case, we place the buffers based on 2D analytical buffer insertion (Eq. 1), then move the s as discussed in Section III-A. The fixed column data is from Gin in Table III. By moving s as well as buffers, we can reduce the number of buffers and the delay by 29% and 5.8%, respectively. This demonstrates the effectiveness of moving s to reduce the number of buffers as well as the buffered delay. V. CONCLUSIONS We presented the fast buffered delay estimation techniques for -based 3D interconnects. Analytical delay models were applied to the 3D interconnects, and analytical buffer insertion formulations were developed to derive optimal buffer insertion solutions. For fixed cases, we proposed a fast buffer insertion heuristic method, which produced only a few percents of error in delay estimation

8 TABLE III COMPARISON OF THE ESTIMATED DELAY WITH DIFFERENT BUFFER INSERTION METHODS. THE LOCATION IS THE DISTANCE FROM THE SOURCE GATE TO THE, EXCLUDING LANDING PAD DIAMETER. THE LENGTH AND DELAY ARE IN µm AND ps, RESPECTIVELY. net wirelen. # location 2Dana 3Dheu Gin #buf 2D delay delay #buf delay #buf delay n , n , 1970, 1980, n , 1740, 1890, 2470, n , 1430, 1880, 1910, 2630, 2730, n , 1570, 1790, 1800, 1850, 2260, 3460, 4120, 4520, average (a) locations in n (b) 2Dana (c) 3Dheu (d) Gin Fig. 10. Buffer insertion results for n4. Numbers represent locations of s and buffers in µm. compared with van Ginneken buffer insertion based delay estimation. We also showed that for movable cases, relocation may improve delay. As a follow-up work, we plan to extend our algorithm to handle multi-fanout nets and perform buffered delay estimations on 3D interconnects of block-level 3D IC designs with buffer blockages. REFERENCES [1] C. Alpert and A. Devgan. Wire Segmenting for Improved Buffer Insertion. In Proc. ACM Design Automation Conf., pages , [2] C. J. Alpert, A. Devgan, and S. T. Quay. Buffer Insertion With Accurate Gate and Interconnect Delay Computation. In Proc. ACM Design Automation Conf., pages , [3] C. J. Alpert, J. Hu, S. S. Sapatnekar, and C. N. Sze. Accurate Estimation of Global Buffer Delay Within a Floorplan. IEEE Trans. on Computer- Aided Design of Integrated Circuits and Systems, 25(6): , June [4] H. B. Bakoglu and J. D. Meindl. Optimal Interconnection Circuits for VLSI. IEEE Trans. on Electron Devices, 32(5): , May [5] S. Dong, H. Bai, X. Hong, and S. Goto. Buffer Planning for 3D ICs. In Proc. IEEE Int. Symp. on Circuits and Systems, pages , [6] W. C. Elmore. The Transient Response of Damped Linear Network with Particular Regard to Wideband Amplifiers. J. Applied Physics, 19:55 63, Jan [7] C. V. Kashyap, C. J. Alpert, F. Liu, and A. Devgan. Closed-Form Expressions for Extending Step Delay and Slew Metrics to Ramp Inputs for RC Trees. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 23(4): , Apr [8] G. Katti, M. Stucchi, K. D. Meyer, and W. Dehaene. Electrical Modeling and Characterization of Through Silicon via for Three-Dimensional ICs. IEEE Trans. on Electron Devices, 57(1): , Jan [9] J. Lillis, C.-K. Cheng, and T.-T. Y. Lin. Optimal Wire Sizing and Buffer Insertion for Low Power and a Generalized Delay Model. IEEE Journal of Solid-State Circuits, 31(3): , [10] Chung-Wei Lin, Shih-Lun Huang, Kai-Chi Hsu, Meng-Xiang Lee, and Yao-Wen Chang. Multilayer Obstacle-Avoiding Rectilinear Steiner Tree Construction Based on Spanning Graphs. IEEE Trans. on Computer- Aided Design of Integrated Circuits and Systems, 27(11): , Nov [11] F. Liu, C. Kashyap, and C. J. Alpert. A Delay Metric for RC Circuits Based on the Weibull Distribution. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 23(3): , Mar [12] Nangate. Nangate 45nm Open Cell Library. [13] P. R. O Brien and T. L. Savarino. Modeling the Driving-Point Characteristic of Resistive Interconnect for Accurate Delay Estimation. In Proc. IEEE Int. Conf. on Computer-Aided Design, pages , [14] J. Qian, S. Pullela, and L. Pillage. Modeling the Effective Capacitance for the RC Interconnect of CMOS Gates. IEEE Trans. on Computer- Aided Design of Integrated Circuits and Systems, 13(12): , [15] W. Shi, Z. Li, and C. J. Alpert. Complexity Analysis and Speedup Techniques for Optimal Buffer Insertion with Minimum Cost. In Proc. Asia and South Pacific Design Automation Conf., pages , [16] L. P.P.P. van Ginneken. Buffer Placement in Distributed RC-tree Networks for Minimal Elmore Delay. In Proc. IEEE Int. Symp. on Circuits and Systems, pages , 1990.

On GPU Bus Power Reduction with 3D IC Technologies

On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The