Porosity Aware Buffered Steiner Tree Construction

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 100 Porosity Aware Buffered Steiner Tree Construction Charles J. Alpert, Gopal Gandham, Milos Hrkic, Jiang Hu, Stephen T. Quay and C. N. Sze Abstract In order to achieve timing closure on increasingly complex IC designs, buffer insertion needs to be performed on thousands of nets within an integrated physical synthesis system. Modern designs may contain large blocks which severely constrain the buffer locations. Even when there may appear to be space for buffers in the alleys between large blocks, these regions are often densely packed or may needed later to fix critical paths. Therefore, within physical synthesis, a buffer insertion scheme needs to be aware of the porosity of the existing layout to be able to decide when to insert buffers in dense regions to achieve critical performance improvement and when to utilize the sparser regions of the chip. This work addresses the problem of finding porosity-aware buffering solutions by constructing a smart Steiner tree to pass to van Ginneken s topology based algorithm. This flow allows one to fully integrate the algorithm into a physical synthesis system without paying an exorbitant runtime penalty. We show that significant improvements on timing closure are obtained when this approach is integrated into a physical synthesis system. Index Terms Buffer insertion, deep submicron, interconnect, layout, physical design, VLSI. I. INTRODUCTION It has been widely recognized that interconnect becomes a dominating factor for modern VLSI circuit designs. A key technology to improve interconnect performance is buffer insertion. Cong [4] speculates that close to 800,000 buffers will be required for 50 nanometer technologies. A. Previous Work Early works on buffer insertion are mostly focused on improving interconnect timing performance. The most influential pioneer work is van Ginneken s dynamic programming algorithm [16] that achieves polynomial time optimal solution on a given Steiner tree under Elmore delay model [7]. In [13], Lillis et al. extended van Ginneken s algorithm by using a buffer library with inverting and non-inverting buffers, while also considering power consumptions. The major weakness of the van Ginneken approach is that it requires a fixed Steiner tree topology has to be provided in advance which makes the final buffer solution quality dependent on the input Steiner tree. Even though it is optimal for a given topology, the van Ginneken algorithm will yield poor solutions when fed a poor topology. To overcome this problem, several works have proposed simultaneously constructing a Steiner This work is supported in part by the SRC under contract 2003-TJ-1124. C. J. Alpert and S. T. Quay are with the IBM Corporation, Austin, TX 78758 USA. G. Gandham is with the IBM Corporation, Hopewell Junction, NY 12533 USA. M. Hrkic is with the Department of Computer Science, University of Illinois, Chicago, IL 60607 USA. J. Hu and C. N. Sze are with the Department of Electrical Engineering, Texas A&M University, College Station, TX 77843 USA. (a) Buffer Buffer (b) Fig. 1. An arbitrary Steiner tree as shown in (a) may force buffers being inserted at dense regions. A porosity aware buffered Steiner tree, which is shown in (b), will enable buffers solution at sparse regions. tree while performing buffer insertion. Lillis et al. proposed the buffered P-Tree algorithm which integrates buffer insertion into the P-Tree Steiner tree algorithm [14]. Buffered P-Tree generally yields high quality solution, but its time complexity is also high because candidate solutions are explored on almost every node in the Hanan grid. In recently reported S-Tree [9] algorithm, alternative abstract topologies for a given Steiner tree are explored to promote solutions that are better at dealing with sink criticalities. This technique is integrated with P-Tree as SP-Tree algorithm in [8]. A different approach to solve the weakness of van Ginneken s algorithm is proposed by Alpert et al [3]. They construct a buffer-aware Steiner tree, called C- Tree for van Ginneken s algorithm. Despite being a two-stage sequential method, it yields solutions comparable in quality to simultaneous methods, while consuming significantly less CPU time. Recent trends toward hierarchical (or semi-hierarchical) chip design and system-on-chip design force certain regions of a chip to be occupied by large building blocks or IP cores so that buffer insertion is not permitted. These constraints on buffer locations can severely hamper solution quality, and these effects need be considered. Thus buffer blockages are considered in the buffered path [11], [12], [17] class of algorithms. Though optimal, they are only applicable to two pin nets. Works that handle restrictions on buffer locations while performing simultaneous Steiner tree construction and buffer insertion are proposed in [5], [15]. Like buffered P-Tree, these approaches can provide high quality solutions though at runtimes too exorbitant to be used in a physical synthesis system. In [1], a Steiner tree is rerouted to avoid buffer blockages before conducting buffer insertion. This sequential approach is fast, but sometimes unnecessary wiring detours may result in poor solutions. An adaptive tree adjustment technique is proposed in [10] to obtain good solution results efficiently.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 101 0000000 1111111 0000000 1111111 000 111 00 11 000000 111111 Buffer (a) 0000000 1111111 0000000 1111111 000 111 00 11 0 1 000000 111111 Buffer (b) Fig. 2. An arbitrary Steiner tree as shown in (a) may force wires to pass through routing congested regions. A congestion aware buffered Steiner tree, which is shown in (b), will may let wires go through sparse regions. B. The Importance of Porosity In tradition design flows, buffer insertion is applied after placement to optimize interconnect timing. As interconnect effects become increasingly severe, buffer insertion needs to be pushed up earlier in the design flow to help logic optimization and placement algorithms on interconnect estimation and to reserve buffering resources. The sheer number of nets that require buffering means that resources need to be allocated intelligently. For example, large blocks placed close together create narrow alleys that are magnets for buffers since they are the only locations that buffers can be inserted for routes that cross over these blocks. But competition for resources for these routes is very fierce. Inserting buffers for less critical nets can eliminate space that is needed for more critical nets that may require gate sizing or other logic transforms. Further, though no blockages lie in the alleys, these regions could already be packed with logic and no feasible space for the buffer might exist. If the buffer insertion algorithm cannot recognize this scenario then inserting a buffer into this space may cause placement legalization to spiral it out to a region too far from its original location. Hence, whenever possible one should avoid denser regions unless absolutely critical. For example, Figure 1(a) shows a net which is routed through a dense region or hot spot, and Figure 1(b) shows the net routed differently to avoid this congested region. In addition to porosity (or placement congestion), a buffer solution may generate remarkable impact on wiring congestion as well. Figure 2 (a) and (b) show an example for a multi-pin net where the Steiner point needs to be moved outside of the dense region to yield less wiring congestion. The literature contains no buffering approach which addresses these issues. Note that several algorithms (such as S-Tree and SP-Tree) could be used to address porosity constraints since they can be run on arbitrary grid graphs. However, the extensions are hardly straightforward. One has to carefully consider the weighting scheme for edges in the grid graph, how to sparsify the graph to give reasonable runtimes, the appropriate cost functions, etc. Further, any simultaneous approach is likely to require too much runtime to be practically used within physical synthesis. This work proposes a porosity aware buffered Steiner tree construction and is the first buffer insertion work to address porosity in the layout directly (as opposed to simply specifying a handful of allowable buffer insertion locations). The approach begins with a timing-driven but porosity-ignorant Steiner tree, then applies a plate-based adjustment guided by length based buffer insertion. After performing localized blockage avoidance, the resulting tree is then passed to van Ginneken s algorithm to obtain a porosity aware buffered Steiner tree. Physical synthesis experiments on real industrial circuits confirms the effectiveness of this framework. II. PROBLEM FORMULATION The porosity is represented through a tile graph G(V G,E G ) where a node g V G corresponds to a tile and an edge e uv E G represents a boundary between two neighboring tiles u and v V G. A high porosity implies a low congestion or density, vice versa. If a tile g V G has an area of A(g) and its area occupied by placed cells are a(g), the placement density is defined as the area usage density d(g) = a(g) A(g) of tile g. For the boundary between two adjacent tiles g i and g j, its wire density d(g i,g j ) is the number of wires across it divided by the number of wiring tracks across this boundary. We take d 2 (g i,g j ) as wire density cost in our implementation. We address the following problem: Porosity-aware Buffered Steiner Tree Problem: Given a net N = {v 0,v 1,...,v n } with source v 0 and sinks {v 1,...,v n }, load capacitance c(v i ) and required arrival time q(v i ) for 1 i n, tile graph G(V G,E G ), and a buffer type b, construct a Steiner tree T (V,E), in which V = N V Steiner and edges in E span every node in V, such that a buffer insertion solution that satisfies q(v i ) may be obtained with a minimal porosity cost(wire density cost). For porosity cost, one can sum the costs of the tiles that the routes cross. We use the following function, though certainly others can be used as well, depending on the application. For each buffer inserted in a tile g, the porosity cost p(g) is defined as: 0.5 : d(g) 0.5 p(g) = 4d 2 (g) 2d(g)+0.5 : 0.5 <d(g) 0.8 11.5d 2 (g) 8d(g)+0.5 : 0.8 <d(g) According to this porosity cost definition, any difference between low placement density (below 0.5) does not matter. However, when density is high (greater than 0.8), the cost penalty is much heavier. Thus, the more buffers inserted, the higher the cost, though buffers are close to free in sparser regions. The coefficients for terms with d(g) and d 2 (g) are selected in a way to make this cost function continuous. Note, that we do not recommend using an infinite penalty if inserting a buffer exceeds the density target. It may be better to insert a buffer into an over dense region and move other cells out of the region than to accept a significantly inferior delay result. Ultimately the right trade-off between porosity cost and timing depends on the criticality of the net and the design characteristics. III. ALGORITHM DESCRIPTION A. Overview Since simultaneous Steiner tree construction and buffer insertion is computationally expensive for practical circuit designs, we prefer to first construct a Steiner tree followed by

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 102 v1 v5 v1 v5 v1 v5 v4 v3 v4 v3 v4 v3 v2 v2 v2 (a) (b) (c) Fig. 3. (a) Candidate solutions are generated from v 2 and v 3 and propagated to every tile, which is shaded, in the plate for v 4. Solution search is limited to the bounding boxes indicated by the thickened dashed lines. (b) Solutions from v 1 and every tile in the plate for v 4 are propagated to the plate for v 5. (c) Solutions from plate of v 5 are propagated to the source and the thin solid lines indicate an alternative tree that may result from this process. a van Ginneken style buffer insertion algorithm that employs higher order delay models, handles slew and load capacitance constraints, manages a library of inverting and non-inverting buffers and trades off buffer resource utilization with solution quality. Integrating a Steiner tree construction into this algorithm while maintaining all its features would be prohibitive. However, just constructing a Steiner tree without performing buffer insertions cannot possibly yield the best topology. Therefore, we propose to solve the porosity aware buffered Steiner tree problem through the following four stages: 1) Initial porosity ignorant timing-driven Steiner tree construction 2) Plate-based adjustment for porosity improvement 3) Local blockage avoidance 4) Van Ginneken style buffer insertion For stage one, one can construct a timing-driven Steiner tree through any heuristics. We choose to apply the buffer aware C-Tree algorithm [3]. The C-Tree algorithm first clusters the sinks according to their timing criticality, signal polarity and location proximity. Then a timing driven Steiner tree algorithm is applied to obtain the intra-cluster and inter-cluster routings. Stage 2 contains the key ideas behind the algorithm. The plate-based adjustment phase modifies the existing timingdriven Steiner in an effort to reduce congestion cost while maintaining the tree s high performance. One of the key elements is that it allows Steiner points to migrate outside of highporosity tiles into lower-porosity tiles to reduce congestion while maintaining (if not improving) performance. After Stage 2, the Steiner tree is correct in that it goes through the tiles that optimize performance while minimizing porosity cost, though the routes may overlap blockages. Performing local blockage avoidance in Stage 3, manipulates the routes within each tile to avoid blockages, thereby allowing more buffer insertion candidates. However, this stage does not disturb the tree topology uncovered from Stage 2. Finally, in Stage 4 the resulting tree topology is fixed for van Ginneken style buffer insertion [16]. Van Ginneken s algorithm is a dynamic programming based bottom-up approach. A set of candidate solutions are propagated from leaf nodes toward the source and the optimal solution is selected at the source. Since we use known algorithms for Stages 1 and 4, the rest of the discussion focuses on the other two stages. B. Stage 2: Plate-based Adjustments The basic idea for the plate-based adjustment is to perform a simplified simultaneous buffer insertion and local tree adjustment so that the Steiner nodes and wiring paths can be moved to regions with greater porosity without significant disturbance on the timing performance obtained in Stage 1. The buffer insertion scheme employed here is similar to the length based buffer insertion in [2]. However, we allow the Steiner points and wiring paths to be adjusted. This approach is also distinctive from a simultaneous buffer insertion and Steiner tree construction approach [14] in which the Steiner tree is built from scratch. The plate-based adjustment traverses the given Steiner topology in a bottom-up fashion similar to van Ginneken s algorithm. During this process, Steiner nodes and wiring paths may be adjusted together with buffer insertion to generate multiple candidate solutions. The range of the moves are restrained by the plates defined as follows. For a node v i T (V,E) which is located in a tile g k,aplate P (v i ) for v i is a set of tiles in the neighborhood of g k including g k itself. During the plate-based adjustment, we confine the location change for each Steiner node within its corresponding plate. If v i is a sink or the source node we let P (v i )={g k }. For example, when v i is a branch node, we set P (v i ) to be the 3 3 array of tiles centered at g k. Of course, if g k lies on the border of the entire tile graph, P (v i ) may include fewer tiles. Figure 3(a) gives an example of the plate corresponding to Steiner node v 4. The plate indicates any of the possible location which the Steiner node may be moved to. This example shows the Steiner node being moved up one tile and right one tile. The search for alternate wiring paths is limited to the minimum bounding box covering the plates of two end nodes. In Figure 3, such bounding boxes are indicated by the thickened dashed lines. Therefore, the size of plates define the search range for both Steiner nodes and wiring paths.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 103 Certainly, a plate may be defined to include more tiles or to even be a irregular shape. For example, if the plate size is 1 1, our algorithm will basically reduce to a length based/van Ginneken style buffer insertion algorithm along a fixed topology. However, one could also choose the plate to be the entire tile graph this yields an algorithm similar to the S-Tree embedding [9]. By choosing a plate of size larger than one but smaller than the entire grid graph, we still obtain the ability to modify the topology to move critical Steiner points into low-porosity regions while also capping the runtime penalty. Hence, our experiments use a 3 3 tile size to capture this tradeoff. An example of how a new Steiner topology might be constructed from an existing topology is demonstrated in Figures 3(a)-(c). Adopting a length-based buffer insertion scheme [2] makes the adjustment simpler and thereby faster. Since this stage only produces a Steiner tree before buffering, the purpose of including buffering at this stage is for estimation purposes, hence a simplified scheme should be adequate. Alpert et al. [2] suggest that buffer insertion may be performed in a simple way by just following a rule of thumb: the maximal interval between two neighboring buffers is no greater than certain upper bound. The interval between neighboring buffers is specified in term of number of tiles. In this work, this constraint is modified to be the maximum load capacitance U a buffer/driver may drive, so that sink/buffer capacitance can be incorporated. To keep the succinctness of the tile-based interval metric in [2], we discretize the load capacitance in units equivalent to the capacitance of wire with average tile size. If the average tile boundary length is λ and the wire capacitance per unit length is ĉ, then a load capacitance is expressed in the number of θ = λĉ. This modification allows us to use non-uniform sized tiles, even though we still assume such non-uniformity is very limited. Fig. 4. Procedure: F indcandidates(v) Input: Current node v to be processed Output: Candidate solution set S(P (v)) Global: Steiner tree T (V,E) Tile graph G(V G,E G ) 1. If v is a sink S(v) {(v, 0, 0)} S(P (v)) {S(v)} Return S(P (v)) 2. v l left child of v S(P (v l )) F indcandidates(v l ) 3. S l (P (v)) P ropagate(s(p (v l )),P(v)) 4. If v has only one child Return S l (P (v)) 5. v r right child of v S(P (v r )) F indcandidates(v r ) 6. S r (P (v)) P ropagate(s(p (v r )),P(v)) 7. S(P (v)) Merge(S l (P (v)),s r (P (v)) 8. Return S(P (v)) Core algorithm. Also for the sake of simplification, we assume a single typical buffer type, the Elmore delay model for interconnect and a switch level RC gate delay model for this plate-based adjustment. Each intermediate buffer solution is characterized by a 3- tuple s(v, c, w) in which v is the root of the subtree, c is the discretized load capacitance seen from v, and w is the accumulated porosity cost. A solution s i (v, c i,w i ) is said to be dominated by another solution s j (v, c j,w j ),ifc i c j and w i w j. A set of buffer solutions S(v) at node v is a nondominating set when there is no solution in S(v) dominated by another solution in S(v). Usually S(v) is arranged as an array {s 1 (v, c 1,w 1 ),s 2 (v, c 2,w 2 )...} in the ascending order of load capacitance. The basic operations of the length-based buffer insertion are: Procedure: P ropagate(s(p (v i )),P(v j )) Input: Candidate solutions at P (v i ) Expansion region P (v j ) Output: Candidate solution set S(P (v j )) 0. S(P (v j )) B bounding box of P (v i ) and P (v j ) Q 1. For each tile g k P (v i ) Q Q S(g k ) 2. While Q 3. s min cost solution in Q g k tile where s locates 4. For each tile g l adjacent to g k 5. If g l B Expand(s(g k ),g l ) Prune Q 6. Return S(P (v j )) Fig. 5. Subroutine of propagating candidate solutions from one plate to another plate. AddW ire(s i (v, c i,w i ),u): grow solution s i at v to node u by adding a shortest distance wire between them. Node u is either within the same tile as v or in the neighboring tile of v. If the wire has capacitance C, we can get c j (u) =c i (v)+ C θ and w j(u) =w i (v)+γl(v, u) where l(v, u) is the rectilinear distance between v and u divided by the average tile size and γ is a weighting factor. Since the expansion step is one tile, l(v, u) is normally a value around 1. When the wiring congestion is considered, we will count the wire density d(u, v) which is the ratio of the number of wires across the boundary between u and v to the number of wiring tracks there. Thus, an additional term βd 2 (u, v) is added to w j (u). The coefficient β is an weighting factor. AddBuffer(s i (v, c i,w i )): insert buffer at v. If buffer b has an input capacitance c b, output resistance r b and intrinsic delay t b, then the buffered solution s j (v, c j,w j ) is characterized by c j (v) =c b /θ and w j (v) =w i (v) + αp(g) in which p(g) is the porosity cost defined in Section II and α is its weighting factor. Prune(S(v)): remove any solution s i S(v) that is dominated by another solution s j S(v).

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 104 to be removed from Q as well. When the candidate solution expansion reaches node v j, solution set S(P (v j )) is updated. (a) (c) Fig. 6. A path overlaps with a buffer blockage as in (a) can be moved out of the blockage with a small detour as in (b) with a small penalty coefficient on cost. Only a greater penalty coefficient can allow a large detour like from (c) to (d). Expand(s i (v, c i,w i ),u): grow s i (v) to node u by AddW ire to get solution s j (u, c j,w j ), insert buffer for s j (u) to obtain s k (u, c k,w k ). Add the unbuffered solution s j and buffered solution s k into solution set S(u) and prune the solutions in S(u). Merge(S l (v),s r (v)): merge solution set from left child of v to the solution set from the right child of v to obtain a merged solution set S(v). For a solution s i,l (v, c i,l,w i,l ) from the left child and a solution s j,r (v, c j,r,w j,r ), they are merged to s k (v, c k = c i,l + c j,r,w k = w i,l + w j,r ) only when c k U. Similar to the van Ginneken s algorithm, starting from the leaf nodes, candidate solutions are generated and propagated toward the source in a bottom-up manner. Before we propagate candidate solutions from node v i to its parent node v j,wefirst find both plate P (v i ) and plate P (v j ) and define a bounding box which is the minimum sized array of tiles covering both P (v i ) and P (v j ). Then we propagate all the candidate solutions from each tile of P (v i ) to each tile of P (v j ) within this bounding box. Note that branch nodes are allowed to be moved in a neighborhood defined by the plate. Since the branch nodes are more likely to be buffer sites due to the demand on decoupling non-critical branch load from the critical path, allowing branch nodes to be moved to less congested area is especially important. Moreover, such move is a part of a candidate solution, thus such move will be committed only when its corresponding candidate solution is finally selected at the driver. Therefore, such move is dynamically generated and selected according to the request of the final minimal porosity cost solution. The generality of this adjustment algorithm allows its application on alleviating wiring congestion simply through incorporating a cost reflecting the wire density in AddW ire(). The complete description on the core algorithm is given in Figure 4 and the subroutine of solution propagation is in Figure 5. In the process of candidate solution propagation (Figure 5), a priority queue Q is employed to maintain the wavefront solutions. In line 5 of Figure 5, if a solution at g l is in Q and is pruned out in Expand(s(g k ),g l ), it needs (b) (d) C. Stage 3: Local Blockage Avoidance After the tree has been modified to route through the tiles with high porosity, the tree still may overlap blockages. This is especially true if one uses a sparse grid graph to begin with. This occurs because the plate-based adjustment in Stage 2 is performed with respect to a given tile graph, but large blockages themselves are managed as part of a tile. For example, if a blockage occupies half the area of a given tile, but the remaining half of the tile is empty, the tile s density is 50%. Of course, a net that traverses through this tile should be routed in the space not occupied by the blockage wherever possible. Thus, this stage performs local cleanup within tiles to achieve blockage avoidance without changing tile assignment. The local blockage avoidance is similar to the algorithm proposed in [1]. The main idea is to divide each Steiner tree into a set of non-overlapping 2-paths and then perform local blockage avoidance on each 2-path. A 2-path is a path of nodes for a given Steiner tree such that both endpoints are either the source, a sink, or a Steiner node with degree greater than two. Each 2-path of a given Steiner tree is rerouted over an extended Hanan grid generated by drawing horizontal and vertical lines through each pin and extending the borders of each rectangle shaped buffer blockage. The extended Hanan grids are shown by the dashed lines in Figure 6. In this grid, if an edge does not overlap with any blockage, its cost is its geometric length. Otherwise, edge cost will be geometric length timed by a penalty coefficient greater than one. A minimal cost path is searched over this grid to replace original 2-path to achieve blockage avoidance. The value of the penalty coefficient controls the tradeoff between blockage avoidance and total wirelength. A greater value of the penalty coefficient implies a stronger desire to avoid the blockage and a greater wiring detour is allowed. A small penalty coefficient will restrict the total wirelength with a reduced blockage avoidance ability. Such tradeoff is illustrated in Figure 6. During seeking the minimal cost path in [1], the entire region covering the two endpoints are expanded properly and are searched. For the example in Figure 7(a), when the 2-path between v 3 and v 5 is rerouted, the search region is indicated by the thickened dashed lines in Figure 7(a). In our case, since the Steiner tree after stage 2 is already in high porosity regions, we may refine the search region to be the tiles the 2-path passing through as shown by the thickened dashed lines in Figure 7(b). In this work, we define the penalty coefficient to be 1.01 so that the wire detour from this rerouting will be very small (no more than 1% wirelength increase) and the disturbance to the timing performance of the tree is very limited. In Figure 7, the rerouted path after blockage avoidance is shown in (b). IV. EXPERIMENTAL RESULTS In order to test the effectiveness of our algorithm, we perform experiments in three scenarios: (1) experiments on several large nets to observe the effect on porosity; (2) experiments with consideration on wiring congestion; (3) applying

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 105 v4 v1 v4 v1 v5 v2 v5 v2 v3 v3 (a) (b) Fig. 7. An example of local blockage avoidance. Shaded rectangles represent buffer blockages. The Steiner tree before blockage avoidance (a) and after blockage avoidance (b). our algorithm into an industrial design flow to evaluate impact on overall circuit timing performance. A. Experiments Considering Porosity 7 6 stages 1 4 stages 1 3 4 stages 1 2 3 4 TABLE I BENCHMARK CIRCUITS.THE SLACK UNIT IS ps. Net # sinks Max slack Min slack Grid size mcu0s5 18 6596.0 6195.3 29 31 mcu1s9 19 6507.3 6247.7 40 30 n1071 17 2560.4 1902.4 26 42 n18905 29 6650.3 609.5 58 46 n313 19 6704.4 1232.5 52 65 n7866 32 6703.5 96.9 88 36 n8692 21 6390.1 1053.9 74 26 n8702 43 6588.5 739.3 64 41 n8730 20 6656.2 729.5 58 35 netbig1 88 159564.7 1974 65 90 netbig2 79 65837.9 104.2 47 31 netbig3 63 40643.2 1327.4 42 36 Total porosity cost 5 4 3 2 1 100 150 200 250 300 350 400 450 Slack improvement/ps Fig. 8. The cost-slack trade-off curves for net mcu0s5. The statistics for the benchmark circuits are shown in Table I. These nets are extracted from industrial circuits with randomly generated wire congestion map. The maximum slack and the minimum slack among all sinks in the initial C-Tree is given for each net. The algorithms are implemented in C++ and the experiments are performed on a SUN Ultra-4 workstation of 400MHz and 2Gb memory. In each set of experiments, we compare three variants: (1) 1-4 in which only stage 1 and 4 are performed; (2) 1-3-4 in which stage 3 of blockage avoidance is run between stage 1 and 4; (3) 1-2-3-4 which the complete four-stage algorithm. In stage 2, the weighting factor for porosity cost and wirelength is α =1and γ =0.5, respectively. The metric we observe includes: Slack improvement: the change of the minimum slack with respect to initial C-Tree result in ps. #bufs: number of buffers inserted. Average(P ave ) and the maximum(p max ) porosity cost for the inserted buffers. Total wirelength. CPU time in seconds. TABLE III STATISTICS FOR THE POROSITY COST ONLY EXPERIMENTAL RESULTS. THE SLACK IMPROVEMENT IS THE DIFFERENCE FROM 1-4 FLOW. POROSITY COST IS NORMALIZED WITH RESPECT TO THE POROSITY COST FROM 1-4 FLOW. THE MIN-AVE-MAX IS THE MINIMUM, THE AVERAGE AND MAXIMUM VALUE AMONG ALL NETS. 1-3-4 1-2-3-4 Min Ave Max Min Ave Max Slack improve(ps) -9.7 81.4 408.3-19.3 200.7 814.0 Avg porosity cost 0.74 1.12 1.66 0.79 0.91 1.04 Max porosity cost 0.63 1.68 4.80 0.45 0.79 1.00

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 106 TABLE II EXPERIMENTAL RESULTS WHEN ONLY POROSITY COST IS MINIMIZED.POROSITY COST IS DENOTED AS P. Net Flow Slack impr P ave #buf P max Wirelen CPU(s) 1-4 330.80 0.75 7 1.25 10830 0.91 mcu0s5 1-3-4 378.58 0.73 9 1.16 10830 3.75 1-2-3-4 417.97 0.66 9 0.85 12513 8.82 1-4 865.50 0.57 12 1.02 12600 4.60 mcu1s9 1-3-4 876.40 0.52 13 0.64 12600 17.36 1-2-3-4 911.90 0.50 14 0.50 13726 74.41 1-4 126.02 2.56 2 2.56 3837 0.10 n1071 1-3-4 189.23 1.90 4 2.56 3837 0.39 1-2-3-4 203.21 2.03 5 2.56 3997 4.66 1-4 1927.62 0.60 9 1.38 16461 2.63 n18905 1-3-4 1964.95 0.94 12 3.25 16461 5.64 1-2-3-4 2230.13 0.51 14 0.61 18999 23.17 1-4 382.02 1.89 5 2.09 16245 0.22 n313 1-3-4 386.39 1.94 9 2.09 16245 1.95 1-2-3-4 452.01 1.65 10 2.09 17555 11.57 1-4 2332.62 1.05 5 1.67 16617 0.82 n7866 1-3-4 2358.14 0.94 3 1.67 16617 23.98 1-2-3-4 2365.44 0.94 3 1.67 16379 61.65 1-4 1912.96 0.55 19 1.02 13929 149.33 n8692 1-3-4 1907.25 0.72 23 3.28 14373 67.92 1-2-3-4 1893.71 0.52 21 0.90 14978 83.24 1-4 4322.21 2.09 9 3.42 14496 1.07 n8702 1-3-4 4430.78 2.12 14 3.42 14202 25.36 1-2-3-4 4599.57 1.80 19 2.40 14967 89.14 1-4 1760.20 0.62 11 1.84 15363 3.52 n8730 1-3-4 1830.23 0.73 14 3.69 15363 9.29 1-2-3-4 1945.03 0.60 17 1.06 17211 22.38 1-4 6209.30 0.58 28 1.67 59004 35.67 netbig1 1-3-4 6436.58 0.70 38 3.11 59004 39.57 1-2-3-4 6653.41 0.60 40 1.46 64830 79.10 1-4 873.35 2.18 8 2.76 44106 0.74 netbig2 1-3-4 1281.61 2.40 16 2.76 44106 3.25 1-2-3-4 1687.39 2.12 23 2.76 48911 30.79 1-4 330.23 1.54 2 1.67 29370 0.81 netbig3 1-3-4 401.03 1.59 3 1.67 29370 1.09 1-2-3-4 552.99 1.45 7 1.67 31115 17.71 The results for set 1 are shown in Table II with its statistics in Table III. The data in Table III show that the complete 1-2-3-4 flow can achieve 200ps slack improvement over 1-4 flow and 120ps improvement over 1-3-4 flow on average. The average porosity cost P ave is an average value for all buffers inserted in a net and P max is the maximum porosity among all buffers. The average porosity cost is reduced by 9% from 1-4 flow. Since 1-3-4 flow does not consider porosity, its rerouting in stage 3 increases the average porosity cost by 12%. The effect is more significant for the maximum porosity cost P max. The 1-2-3-4 flow achieves 21% maximum porosity cost reduction while the 1-3-4 flow results in 68% greater cost. Therefore, the stage 2 is very necessary on improving both timing and porosity cost. In order to further illustrate the effect of our algorithm, the cost-slack trade-off curves of net mcu0s5 for different flows are shown in Figure 8. These curves come from the final candidate solutions at the root node in Van Ginneken s algorithm at stage 4. Obviously, the solutions from our proposed 1-2-3-4 flow is better on both timing and cost. The CPU time is concluded in the rightmost column of Table II. B. Experiments Considering Wiring Congestion We also tested the effect of our algorithm on wiring congestion for the same set of testcases in Section IV-A. When wiring congestion is considered, only the cost function in stage 2 is changed by augmenting wire density cost βd 2 where d is the wire density. Average(D ave ) and the maximum(d max ) wire density over all tile boundaries the edges of the Steiner tree crossing are reported in the experimental results. Table IV and Table V show results from experiments in which wire density is considered but porosity cost is ignored. In these experiments, the weighting factors for porosity cost, wire density and scaled wirelength are α = 0,β = 1 and γ =0.25, respectively. The statistics in Table V shows that the average wire density, which is the average of the wire density among all the tile boundaries the routing tree across, is improved by 30%. The peak wire density is reduced by 19% as well. The combination of porosity and wiring congestion is tested and the results are listed in Table VI and Table VII. In these experiments, the weighting factors for porosity cost, wire density and scaled wirelength are α =1,β =0.5and γ =0.1, respectively. Since the only different part is in stage 2, the results of flow 1 4 and flow 1 3 4 are the same as those in Table II and Table IV. However, the statistics for flow 1-3-4 is shown in Table VII for the ease of comparison. Sometimes wire density is increased by our algorithm. However, this normally happens when the wire density of the initial C-Tree is low. For example, the average wire density is increased from

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 107 TABLE VI EXPERIMENTAL RESULTS OF 1-2-3-4 FLOW WHEN BOTH POROSITY COST AND WIRE DENSITY COST ARE CONSIDERED.POROSITY COST AND WIRE DENSITY ARE DENOTED AS P AND D, RESPECTIVELY. Net Slack impr P ave #buf P max D ave D max Wirelen CPU(s) mcu0s5 412.93 0.59 9 0.85 0.19 0.85 12937 11.42 mcu1s9 903.92 0.50 15 0.50 0.20 0.79 14264 45.16 n1071 186.97 1.90 4 2.56 0.08 0.32 4094 4.25 n18905 2254.63 0.50 13 0.50 0.44 0.95 19564 24.38 n313 450.87 1.69 11 2.09 0.41 0.88 18165 10.71 n7866 2300.07 0.50 5 0.50 0.35 0.85 16993 41.42 n8692 1894.26 0.50 20 0.55 0.16 0.79 14876 83.95 n8702 4567.38 1.50 12 2.40 0.29 0.83 15329 34.81 n8730 1936.88 0.57 16 1.06 0.41 0.86 17376 25.55 netbig1 6649.48 0.58 40 1.46 0.44 0.97 67647 97.00 netbig2 1676.72 1.88 24 2.76 0.47 0.97 52825 32.00 netbig3 579.06 1.49 7 1.67 0.45 0.99 33351 18.20 TABLE IV EXPERIMENTAL RESULTS WHEN ONLY WIRE DENSITY COST IS MINIMIZED.WIRE DENSITY IS DENOTED AS D. Net Flow Slk impr D ave D max Wirelen CPU(s) 1-4 330.80 0.27 0.93 10830 0.91 mcu0s5 1-3-4 378.58 0.29 0.93 10830 3.75 1-2-3-4 413.57 0.18 0.62 12646 7.79 1-4 865.50 0.26 0.80 12600 4.60 mcu1s9 1-3-4 876.40 0.26 0.79 12600 17.36 1-2-3-4 863.58 0.15 0.59 13699 32.97 1-4 126.02 0.10 0.41 3837 0.10 n1071 1-3-4 189.23 0.08 0.40 3837 0.39 1-2-3-4 203.21 0.06 0.31 4020 2.99 1-4 1927.62 0.49 0.99 16461 2.63 n18905 1-3-4 1964.95 0.49 0.99 16461 5.64 1-2-3-4 2045.26 0.37 0.97 18881 17.58 1-4 382.02 0.45 0.87 16245 0.22 n313 1-3-4 386.39 0.44 0.88 16245 1.95 1-2-3-4 371.80 0.35 0.85 18457 7.61 1-4 2332.62 0.31 0.79 16617 0.82 n7866 1-3-4 2358.14 0.34 0.85 16617 23.98 1-2-3-4 2391.25 0.22 0.57 17441 28.79 1-4 1912.96 0.17 0.80 13929 149.33 n8692 1-3-4 1907.25 0.15 0.79 14373 67.92 1-2-3-4 1886.53 0.08 0.40 15124 55.68 1-4 4322.21 0.42 0.73 14496 1.07 n8702 1-3-4 4430.78 0.54 0.99 14202 25.36 1-2-3-4 4584.66 0.23 0.68 15195 23.98 1-4 1760.20 0.41 0.87 15363 3.52 n8730 1-3-4 1830.23 0.44 0.87 15363 9.29 1-2-3-4 1952.44 0.31 0.86 17050 22.21 1-4 6209.30 0.47 0.97 59004 35.67 netbig1 1-3-4 6436.58 0.47 0.97 59004 39.57 1-2-3-4 6333.22 0.41 0.97 66536 82.45 1-4 873.35 0.47 0.97 44106 0.74 netbig2 1-3-4 1281.61 0.48 0.97 44106 3.25 1-2-3-4 1351.06 0.41 0.95 50636 18.12 1-4 330.23 0.47 0.99 29370 0.81 netbig3 1-3-4 401.03 0.47 0.99 29370 1.09 1-2-3-4 530.92 0.41 0.89 33835 13.67 0.31 to 0.35 for net n7866. When the initial wire density is low, moderate wire density increase does not matter and can be utilized to trade for porosity cost reduction. For net n7866, the average porosity cost is reduced from 1.05 to 0.5. The total runtime over experiments in both Section IV-A and Section IV-B on all nets for each stage is shown in Table VIII. The CPU time on stage 2 is about the same as the CPU time on stage 4. The average wirelength increase over C-Tree due to our algorithm is about 10%. TABLE V STATISTICS FOR THE WIRE DENSITY COST ONLY EXPERIMENTAL RESULTS. THE SLACK IMPROVEMENT IS THE DIFFERENCE FROM 1-4 FLOW. WIRE DENSITY IS NORMALIZED WITH RESPECT TO THE WIRE DENSITY FROM 1-4 FLOW. THE MIN-AVE-MAX IS THE MINIMUM, THE AVERAGE AND THE MAXIMUM VALUE AMONG ALL NETS. 1-3-4 1-2-3-4 Min Ave Max Min Ave Max Slack improve(ps) -9.7 81.4 408.3-26.4 125.2 477.7 Avg wire density 0.81 1.05 1.34 0.50 0.70 0.88 Max wire density 0.98 1.03 1.36 0.44 0.81 1.00 TABLE VII STATISTICS FOR EXPERIMENTS WHEN BOTH POROSITY COST AND WIRE DENSITY COST ARE CONSIDERED.THE SLACK IMPROVEMENT IS THE DIFFERENCE FROM 1-4 FLOW. POROSITY COST AND WIRE DENSITY ARE NORMALIZED WITH RESPECT TO RESULT OF 1-4 FLOW. THE MIN-AVE-MAX IS THE MINIMUM, THE AVERAGE AND THE MAXIMUM VALUE AMONG ALL NETS. 1-3-4 1-2-3-4 Min Ave Max Min Ave Max Slack improve (ps) -9.7 81.4 408.3-32.6 192.8 803.4 Avg porosity cost 0.74 1.12 1.66 0.48 0.84 0.99 Max porosity cost 0.63 1.68 4.80 0.30 0.71 1.00 Avg wire density 0.81 1.05 1.34 0.69 0.90 1.12 Max wire density 0.98 1.03 1.36 0.66 0.96 1.14 C. Experiments in Industrial Physical Synthesis System We also incorporate our porosity-aware buffered Steiner tree algorithm into an industrial physical synthesis system called PDS [6]. The system begins with a placed netlist and performs several operations on critical paths to reduce timing closure, such as gate sizing, pin swapping, logic transforms, and of course buffer insertion. At the end of physical synthesis, PDS reports the following statistics measuring the success of a run: TABLE VIII TOTAL RUNTIME FOR EACH STAGE IN SECONDS. Flow Stage 1 Stage 2 Stage 3 Stage 4 Total 1-4 9 - - 592 601 1-3-4 9-58 531 599 1-2-3-4 9 572 129 539 1249

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 108 1) Slack: the minimum slack among all the timing paths, 2) # Neg: the number of timing paths with negative slack, 3) FOM(Figure of Merit): a measure of the cumulative slack of all negative slack cells in the design and a greater value of FOM is desired, 4) TWL: total wirelength, For each of three industrial circuits, we ran physical synthesis in two different ways: 1) Baseline: just Stages 1 and 4 in Section 3 are run. 2) Porosity: the porosity-based modifications of Stage 2 and 3 are included on top of the baseline. Only porosity cost is considered here. We make the following observations. For each case, using the porosity algorithm reduced the number of nets with negative slack and also the FOM. For two of the three test cases, some difference in the worst slack in the design was observed, though not enough to be significant. The total penalty for wirelength is less than half a percent. In some sense, the ability to measure the effectiveness of the algorithm is only as good as the design data. For truly chunky semi-hierarchical designs, this technology should have a much bigger impact than for totally flat designs, since it is the alley space problems that can significantly impact the ability to close timing. Finally, Figure 9 and 10 illustrate the kind of impact the algorithm can have. Figure 9 shows the baseline for an example 17-pin net for test2, and 10 shows the net layout for the porosity-aware algorithm. The circles along the topology indicate possible buffer insertion points. Observe that the layouts are similar, but the porosity-aware layout is able to better avoid blockages while maintaining the same integrity of the timing-driven Steiner tree. After inserting buffers, the worst case slack on the tree in Figure 10 is 150ps better than that of the tree in Figure 9. [4] J. Cong. Challenges and opportunities for design innovations in nanometer technologies. SRC Design Sciences Concept Paper, 1997. [5] J. Cong and X. Yuan. Routing tree construction under fixed buffer locations. In Proceedings of the ACM/IEEE Design Automation Conference, pages 379 384, 2000. [6] W. Donath, P. Kudva, L. Stok, P. Villarrubia, L. Reddy, A. Sullivan, and K. Chakraborty. Transformational placement and synthesis. In Proceedings of Design, Automation and Test in Europe Conference, pages 194 201, 2000. [7] W. C. Elmore. The transient response of damped linear networks with particular regard to wideband amplifiers. Journal of Applied Physics, 19:55 63, January 1948. [8] M. Hrkic and J. Lillis. Buffer tree synthesis with consideration of temporal locality, sink polarity requirements, solution cost and blockages. In Proceedings of the ACM International Symposium on Physical Design, pages 98 103, 2002. [9] M. Hrkic and J. Lillis. S-tree: A technique for buffered routing tree synthesis. In Proceedings of the ACM/IEEE Design Automation Conference, pages 578 583, 2002. [10] J. Hu, C.J. Alpert, S.T. Quay, and G. Gandham. Buffer insertion with adaptive blockage avoidance. In Proceedings of the ACM International Symposium on Physical Design, pages 92 97, 2002. [11] A. Jagannathan, S.-W. Hur, and J. Lillis. A fast algorithm for contextaware buffer insertion. In Proceedings of the ACM/IEEE Design Automation Conference, pages 368 373, 2000. [12] M. Lai and D.F. Wong. Maze routing with buffer insertion and wiresizing. In Proceedings of the ACM/IEEE Design Automation Conference, pages 374 378, 2000. [13] J. Lillis, C. K. Cheng, and T. Y. Lin. Optimal wire sizing and buffer insertion for low and a generalized delay model. IEEE Journal of Solid- State Circuits, 31(3):437 447, March 1996. [14] J. Lillis, C. K. Cheng, and T. Y. Lin. Simultaneous routing and buffer insertion for high performance interconnect. In Proceedings of the Great Lake Symposium on VLSI, pages 148 153, 1996. [15] X. Tang, R. Tian, H. Xiang, and D.F. Wong. A new algorithm for routing tree construction with buffer insertion and wire sizing under obstacle constraints. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pages 49 56, 2001. [16] L. P. P P. van Ginneken. Buffer placement in distributed RC-tree networks for minimal Elmore delay. In Proceedings of the IEEE International Symposium on Circuits and Systems, pages 865 868, 1990. [17] H. Zhou, D. F. Wong, I-M. Liu, and A. Aziz. Simultaneous routing and buffer insertion with restrictions on buffer locations. In Proceedings of the ACM/IEEE Design Automation Conference, pages 96 99, 1999. V. CONCLUSION This paper addresses the problem that not just blockages, but the porosity and wiring congestion of the entire layout affecting the quality of buffer insertion algorithms, especially within a physical synthesis environment. We presented a four stage algorithm to address buffered Steiner tree design. The key idea in Stage 2 was to use a plate to limit the solution space exploration for Steiner nodes during tree adjustment, thereby permitting a fairly efficient algorithm. We demonstrated the effectiveness of this technique on benchmark circuits as well as in an industrial physical synthesis system. REFERENCES [1] C. J. Alpert, G. Gandham, J. Hu, J. L. Neves, S. T. Quay, and S. S. Sapatnekar. A Steiner tree construction for buffers, blockages, and bays. IEEE Transactions on Computer-Aided Design, 20(4):556 562, April 2001. [2] C. J. Alpert, J. Hu, S. S. Sapatnekar, and P. G. Villarrubia. A practical methodology for early buffer and wire resource allocation. In Proceedings of the ACM/IEEE Design Automation Conference, pages 189 194, 2001. [3] C.J. Alpert, G. Gandham, M. Hrkic, J. Hu, A.B. Kahng, J. Lillis, B. Liu, S.T. Quay, S.S. Sapatnekar, and A.J. Sullivan. Buffered Steiner trees for difficult instances. IEEE Transactions on Computer-Aided Design, 21(1):3 14, January 2002.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2003 109 TABLE IX PHYSICAL SYNTHESIS ON INDUSTRIAL DESIGNS. Testcase #cells #blockages Grid size Algorithm Slack(ns) #Neg FOM TWL test1 155K 209 24 24 baseline -1.60 36827-10995 122.5 porosity -1.50 29112-8224 122.7 test2 334K 848 32 32 baseline -0.96 14885-5209 205.4 porosity -0.98 14115-4896 206.2 test3 293K 18 32 32 baseline -1.32 59525-25870 111.4 porosity -1.29 51743-22213 111.7 Fig. 9. An example of Steiner tree from baseline methodology. The shaded rectangles are buffer blockages. The source is indicated by a cross. Fig. 10. The Steiner tree obtained through porosity aware method on the same net as in Figure 9.