Techniques for Fast Physical Synthesis

Size: px

Start display at page:

Download "Techniques for Fast Physical Synthesis"

Elinor Johnson
5 years ago
Views:

1 INVITED PAPER Techniques for Fast Physical Synthesis Fast, efficient buffer design, logic transformations, and clustering components for placement are some of the techniques being used to reduce design turnaround for large, complex chips. By Charles J. Alpert, Fellow IEEE, Shrirang K. Karandikar, Zhuo Li, Member IEEE, Gi-Joon Nam, Member IEEE, Stephen T. Quay, Haoxing Ren, Member IEEE, C. N. Sze, Member IEEE, Paul G. Villarrubia, and Mehmet C. Yildiz, Member IEEE ABSTRACT The traditional purpose of physical synthesis is to perform timing closure, i.e., to create a placed design that meets its timing specifications while also satisfying electrical, routability, and signal integrity constraints. In modern design flows, physical synthesis tools hardly ever achieve this goal in their first iteration. The design team must iterate by studying the output of the physical synthesis run, then potentially massage the input, e.g., by changing the floorplan, timing assertions, pin locations, logic structures, etc., in order to hopefully achieve a better solution for the next iteration. The complexity of physical synthesis means that systems can take days to run on designs with multimillions of placeable objects, which severely hurts design productivity. This paper discusses some newer techniques that have been deployed within IBM s physical synthesis tool called PDS [1] that significantly improves throughput. In particular, we focus on some of the biggest contributors to runtime, placement, legalization, buffering, and electric correction, and present techniques that generate significant turnaround time improvements. KEYWORDS Circuit optimization; circuit synthesis; CMOS integrated circuits; design automation I. INTRODUCTION Physical synthesis has emerged as a critical component of modern design methodologies. The primary purpose of Manuscript received March 8, 2006; revised October 20, C. J. Alpert, S. K. Karandikar, Z. Li, G.-J. Nam, and C. N. Sze are with the IBM Austin Research Laboratory, Austin, TX USA ( alpert@us.ibm.com; akkarand@us.ibm.com; lizhuo@us.ibm.com; gnam@us.ibm.com; csze@us.ibm.com). S. T. Quay, H. Ren, P. G. Villarrubia, and M. C. Yildiz are with the IBM Corporation, Austin, TX USA ( quayst@us.ibm.com; hoaxing@us.ibm.com; pgvillar@us.ibm.com; mcan@us.ibm.com). Digital Object Identifier: /JPROC /$25.00 Ó2007 IEEE physical synthesis is to perform timing closure. Several technology generations ago, back when wire delay was insignificant, synthesis provided an accurate picture of the timing of the design. However, technology scaling has caused wire delay to continue to increase relative to gate delay. Consequently, a design that meets timing requirements in synthesis likely will not close once its physical footprint is realized, due to the wire delays. The purpose of physical synthesis is place the design, recognize the delays and signal integrity issues introduced by the wiring, and fix the problems. It may also need to locally resynthesize pieces of the design that no longer meet timing constraints. That new logic needs to be replaced, which causes iterations between synthesis and placement, until hopefully the design closes on timing. Unfortunately, more often than not, the design will not close on timing without manual designer intervention. Perhaps the designer needs to modify the floorplan or restructure certain sets of paths. This causes the designer to iterate between manual design work and automatic physical synthesis. The turnaround time of the physical design stage critically depends on the efficiency (and quality) of the physical synthesis system. On large multimillion ASIC parts, physical synthesis can take several days to complete, even on the best hardware available. This trend is only getting worse as designs seem to scale faster than the hardware improves to optimize them. While hierarchical or system on a chip (SoC) methodologies can be used to handle the large complexities, performing timing closure on a flat part is always preferable if at all possible [2], since it avoids all the complexities of hierarchical design. Of course, there are many newer challenges that the physical system needs to handle besides traditional timing closure. Some examples include lowering power using a Vol. 95, No. 3, March 2007 Proceedings of the IEEE 573

2 technology library with multiple threshold voltages, fix noise violations that show up after performing routing, and handling the timing variability and uncertainty introduced by modern design processes. Inserting techniques for analysis and optimization for these more complex problems only add to the runtime of the entire system. Thus, the turnaround time for the core system needs to be as fast as possible. This work surveys some of the recent techniques introduced into PDS [1] to improve turnaround time. A. Buffering Trends Much of the paper focuses on innovation in buffering techniques since buffering is perhaps the most important challenge for physical synthesis as it moves beyond 90 nm technologies. As technology scales, wires become thinner which causes their resistance to increase. The result is that wire delays increasingly dominate gate delays, and the problem only becomes worse with each advance from the 65 to the 45 to the 32 nm node. Saxena et al. [3] predict a Bbuffering explosion[ whereby over half of all the logic will consist of buffers, which are essentially performing no useful computationvthey are merely helping move signals from one part of the chip to another. Even in today s 90 nm designs, we commonly see 20% 25% of the logic consisting of buffers and/or inverters; some of the larger designs that PDS optimizes end up with over a million buffers. Given these trends, there are several challenges to achieve a fast and effective physical synthesis result. 1) One has to be able to perform buffer insertion incredibly quickly. If one is going to insert over a million buffers and then may have to rip them up and redo trees to improve timing and routability, the underlying algorithm must be efficient. 2) Area and power are big concerns. Smart floorplanning and logic coding from the designer can help mitigate the buffering effects, but still, one should insert buffers to minimize both total area (so that they can be easily incorporated into the design) and power. 3) Buffering algorithms need to understand where the free space is in the layout to be effective and not overfill areas that cannot handle the buffers. Some methodologies invoke buffer block planning to drive buffer locations to preallocated areas (e.g., [4], [5]). 4) Buffering constricts or seeds global routing. Because the distance between buffers continues to decrease, a long net may have perhaps ten stages of buffering to get from point A to point B. The locations of those ten buffers force the global router to route from A to the first buffer, then from the first buffer to the second buffer, etc., instead of finding the best direct route from A to B. Essentially, the routing problem is pushed up to be handled by buffering. A good buffering solution can make the global router s job easy, while a bad one makes it more difficult. B. Major Phases of Physical Synthesis The authors of [1] present seven primary stages to PDS: 1) initial placement and optimization; 2) timing-driven placement and optimization; 3) timing-driven detailed placement; 4) optimization techniques; 5) clock insertion and optimization; 6) routing and post routing optimization; 7) early-mode timing optimization. Before running physical synthesis, at the very least one should achieve timing closure with a zero wireload (ZWL) timing model. If one cannot close on the design with ZWL, then one certainly will not be able to once the design is realized physically. In fact, since a ZWL model may be hopelessly optimistic since wire delays are increasingly significant, a designer may want to achieve timing closure with a more pessimistic model. As examples, one could multiply each gate delay by a constant factor and/or use a linear optimally buffered delay model for logic that is restricted by the designer s floorplan. Thus, before proceeding with physical synthesis, the designer should iterate on the architecture, synthesis, and floorplan to achieve a closed design from some type of physically ignorant timing model so that the design is in a reasonably good state. Similarly, if one cannot close on the timing before clock insertion and routing, then it is unlikely one will be able to close after these steps. Thus, the first four stages of the above flow can be considered the core physical synthesis operations. The designer will typically iterate withphysicalsynthesisrunsinthispartoftheflowbefore proceedingstosteps5,6,and7.hence,thefocusforthis paper will be on the first four stages. The purpose of initial placement (e.g., [6] [8], mfar [9], [10]) is to place the cells such that they do not overlap with each other or existing fixed objects from the floorplan. At this point, the timing of the design will have degraded completely from the ZWL timing due to the introduction of long wires. The optimization steps then buffer and repower the design so that the timing looks quite reasonable. From the timing analysis, one can then draw conclusions as to which nets must be shortened by placement and which ones do not, or for that matter, could even afford to be longer. The purpose of timing-driven placement (e.g., [11] [13]) is to use timing analysis to drive the placement to achieve a good timing result at the possible expense of wirelength. Probably the easiest (and certainly fastest) way to achieve this is to perform net weighting [14] [16] whereby the nets which need to be shorter are assigned a high weight, and the nets that can afford to be longer are assigned a low weight. Any placement algorithm can be modified to handle net weights. For example, a net with 574 Proceedings of the IEEE Vol.95,No.3,March2007

3 Fig. 1. Examples of logic direct logic transforms. (a) Initial gate. (b) Logic decomposition. (c) Connection reordering. integer weight n can be replaced with n identical nets of weight one. The problem of coming up with a good mapping of nets to weights is a difficult problem. The approach of Pan et al. [15] is advantageous in that it figures out which nets can influence the most possible critical paths and gives these nets higher weight; since nets which are both influential and negative are emphasized, the wirelength degradation from timing is minimized. The mechanism with which the given placer handles net weights certainly affects the quality; a particular net weighting algorithm may work splendidly with Placer A but not with Placer B. In general, net weighting actually causes the total wirelength in the design to increase, though it will cause the timing to be significantly better. After net weighting, the entire placement is performed again from scratch, though repowering levels and buffering structures may remain from the previous phase. Once again, the timing picture will look quite grim immediately after placement due to new long wires. Another round of buffering and repowering optimizations can then be applied to get the timing into reasonable shape for the next phase. After timing-driven placement and optimization, many cells may be placed in locations that are locally suboptimal. Timing-driven detailed placement makes local moves and swaps to try to improve both wirelength and the global timing of the design. The detailed placement is timing driven in that it can also use net weights to guide its solution. Constraints to limit cell movement may be used to prevent global moves that may undo the placement achieved by the previous phase. The final phase of core timing closure is pure optimization. At this point, the timing is hopefully reasonably close, but buffering and repowering alone does not suffice to fix the critical paths. Direct logic transforms can be applied at this point [1], [17] [19]. Examples include the following. 1) Cloning takes a cell that may be driving a large number of pins and duplicates it so that the load can be divided between the existing cell and the original. This may or may not reduce delay, depending on the increased load caused by the new cell. It certainly will increase area. 2) Inverter manipulation takes inverters that are driving or driven by a cell and absorbs them into the cell. For example, an and gate driving an inverter can become a nand gate. The reverse can happen as well, whereby inverters are pulled out of the logic of a cell. 3) Logic decomposition breaks apart a single logic cell into several cells. For example, Fig. 1 shows a 4-input nand gate can be decomposed into (b) two 2-input and gates each driving a third 2-input nand gate. 4) Connection reordering rewires commutative connections in fan-in trees. Fig. 1(c) shows an example reordering of the inputs to derive a different physical solution. 5) Cell movement picks a cell along a critical path and tries to find a new location for the cell that improves timing. These optimizations can be deployed on critical paths along with incremental timing analysis to push the design closer to timing closure. C. A Closer Look at Optimization While placement is relatively straightforward, the pieces that constitute optimization may not be so clear. Optimization can also be broken down into the following phases: 1) electrical correction; 2) critical path optimization; 3) histogram compression; 4) legalization. The purpose of electrical correction [20] is to fix the capacitance and slew violations introduced, usually through buffering and repowering. Most of these are introduced from the placement stage. In general, one wants to first perform electrical correction in order to get the design in a reasonable state for the subsequent Vol. 95, No. 3, March 2007 Proceedings of the IEEE 575

4 optimizations. Electrical correction is potentially a big runtime hog. The reason is that designs need more buffers than ever to fix slew violations due to the ever decreasing ratio between gate and wire delays. Some older designs [21] may require buffers and some newer designs today require over a million buffers. The trend toward large and more complex designs has turned a relatively simple and fast phase into a complex and slow one. During critical path optimization one examines a small subset of the most critical paths and performs optimization specifically to improve timing on those paths. This phase needs to be intertwined with incremental timing so that one can see the impact of logic changes right away and to then find the next set of critical paths to work on. Here one can afford to throw Bthe kitchen sink[ at the problem; any optimization such as direct logic transforms described above that may potentially improve the timing is fair game and can be attempted. A continuing challenge in the field is to derive more complex transforms that involve the interaction of multiple gates and potential cell movements. For example, one may wish to Bstraighten[ all the gates in a path, simultaneously repower them, and perform buffer insertion on the fly. Unlike electrical correction, the runtime for this phase does not scale nearly as much with increasing design size, since the number of paths that are worked on in this phase is a user parameter that is independent of the design size. The bottlenecks for runtime here are how far the critical paths are from closure and the time it takes to update the timing. Critical path optimization certainly can fail to close on timing. There could be a path (or paths) in the design that is completely incapable of meeting timing requirements, e.g., due to floorplanning of fixed blocks. At this point physical synthesis could return with its best solution found so far, though there might still be thousands of paths which do not meet their timing targets. The purpose of the histogram compression phase is to perform optimization on these less critical, but still negative paths. This is analogous to pushing down on the timing histogram returned from timing analysis. The size of the histogram after this phase gives the designer an indication of how much work there remains to close on timing. This phase helps the designer distinguish between a few really poor paths versus thousands of systemic problems. Throughout all of the above phases, every optimization will disrupt the placement. One can choose to always find a legal location for every buffer or piece of logic during optimization; however, this will be very expensive. In PDS, optimizations are allowed to make changes and place cells that may cause the placement to have overlaps, potentially in the thousands. Periodically a phase of placement legalization needs to be called to resolve these overlaps to once again make the placement viable. The frequency that this step needs to be called (along with the size of its task) can be a major contributor to the total runtime of the system. Fig. 2. Major phases of physical synthesis. Fig. 2 gives an example of how the four major phases of core physical synthesis may be broken down further. For example, observe how no legalization occurs at the end of the first phase. Since the entire design will be replaced in phase 2, legalization at this point can be considered unnecessary. Also observe that in phase 4, critical path optimization and legalization are run after each other three times. In practice, this loop can be made even tighter so that any timing disturbances caused by legalization are quickly reflected in the list of most critical paths. The flow showninthisfigureisjustanexampleofhowthedifferent phases may operate together. Many different combinations can be employed (such as more or fewer placements, optimizations before initial placement, etc.) that may achieve better results. It remains a challenge of physical synthesis to find flows that achieve excellent results across a wide range of design styles. D. Achieving Fast Physical Synthesis In order to make the physical synthesis as fast as possible, we have focused on a variety of techniques that can be deployed throughout the flows. A key philosophy for achieving both a fast and high quality result is to do the optimizations as fast as possible even if some optimality is sacrificed. As long as the design is in a reasonably good 576 Proceedings of the IEEE Vol.95,No.3,March2007

5 state after applying fast optimization, one can always apply slower, but more accurate optimization to further polish the design. In other words, one can break a few eggs while making the cake, as long as there is a way to clean them up (but if one is careless and breaks too many eggs, the cake will never be completed). This paper presents some of the major algorithmic techniques that have been discovered or utilized. They include the following. Clustering for multilevel placement. While it is well established that multilevel partitioning [22] givessuperiorruntimeandqualityofresultsforthe circuit partitioning problem, achieving a similar result for placement has been much more elusive. Forplacement,thisrequiresclusteringwithabit more care; we have been able to cluster to achieve speedups of a factor of 3 5 versus flat placement while obtaining similar placement quality. This result can be applied to both the initial and timingdriven placement phases. Details of this technique can be found in [23]. Fast timing-driven buffering. It is well known that Van Ginneken s algorithm [24] can achieve an optimal buffering result for a given tree topology. When one extends it to handle a large buffer library and to control the total buffer area, the runtime increases significantly. This work shows how one can add new pruning and estimation techniques to improve runtime without any measurable degradation in solution quality. This result can be applied to any of the buffering phases. Details of this work can be found in [25]. Integrated electrical correction. As mentioned above, electrical correction consumes an increasingly large percentage of the runtime of physical synthesis. This work proposes integrating buffering and repowering into a single engine that recognizes which optimization is best to perform for a given net. Details of the scheme can be found in [20]. Timerless buffering. For electrical correction, one does not require the best solution in terms of timing. Any suboptimal timing solution that proves critical can always rebuffered later. When potentially inserting a million buffers for electrical correction, it is essential to fix slew and capacitanceviolationsasfastaspossiblewhileusingthe minimum buffer area. This section describes a new algorithm for solving this problem that is an order of magnitude faster than timing-driven buffering. Details of the algorithms can be found in [26]. Layout aware buffer trees. When performing buffer insertion, one can run into danger by ignoring the density of placed objects and the design routability because placements of buffers in those locations may require them to get moved later by legalization. Often one may find locations that are in sparser regions of the chip but are no worse than the locations in dense regions. This work presents a generalized fast technique for constructing a Steiner tree for buffering, via either timing-driven or timerless buffering. Details of the work can be found in [27]. Diffusion based legalization. The danger of legalization is that it can potentially degrade timing by moving a timing critical cell to a legal location that is far away from its optimal location. To avoid this, one can run legalization very frequently to keep it from doing too much work for any iteration. As an alternative, diffusion-based legalization more smoothly spread cells using the paradigm of physical process of diffusion. Consequently, timingdegradations are less frequent and legalization can be run less often in between optimizations. Further, this technique can be used to alleviate local routing congestion hot spots. Details of the algorithm can be found in [28]. The remainder of the paper discusses each of these technical contributions in more detail. II. CLUSTERING FOR FAST GLOBAL PLACEMENT Global placement is perhaps the most independent and well defined component of physical synthesis that is a major contributor to the total runtime of the system. Global placement algorithms can generally be categorized as simulated annealing, top-down cut-based partitioning, analytic placement, or some combination thereof. Simulated annealing [29] is an iterative optimization method which refines a placement solution using stochastic algorithm. Although this is an effective method to integrate non-conventional multidimensional objective functions for global placement, it is known to be slow and not scalable compared to other global placement algorithms. Recent years have seen the emergence of several new academic placement tools, especially in the top-down partitioning and analytic domains. With the advent of multilevel partitioning [22], [30] as a fast and effective algorithm for min-cut partitioning, new generations of topdown cut-based placers such as Capo [31], Feng Shui [32], Dragon2000 [33] have appeared in recent years. A placer in this class partitions the cells into two (bisection) or four (quadrisection) regions of the chip, then recursively partitions each region until a coarse global placement is achieved. In general, recursive cut-based placement approaches perform quite well when designs are dense, but rather poorly when they are sparse. Analytic placers typically solve a relaxed placement formulation (such as minimum total squared wirelength) optimally, allowing cells to temporarily overlap. Legalization is achieved by removing overlaps via either Vol. 95, No. 3, March 2007 Proceedings of the IEEE 577

6 partitioning or by introducing additional forces/constraints to generate a new optimization problem. The recent placement contest [6] shows that analytic placement algorithms can produce high quality placement solutions on modern real circuits. This helped the advent of new renaissance of analytic placers, e.g., Aplace [7], mpl [8], mfar [9], FastPlace [10]. The genesis of this analytic placement movement began with [34] and has very recently been signicantly imporved in [35]. Analytic placers tends to find better global placement solutions particularly when designs have non-trivial white space in it. In other words, analytic placer seems to have an advantage in managing white space during global placement process. For any placer, clustering can be used to make it faster. Clustering groups cells into fewer clusters, then placement can be run directly on the clusters. However, for any placer of reasonable quality, the challenge lies in using clustering to maintain and perhaps even enhance solution quality. The particular clustering technique needs to be adapted for the placer to which it is being applied. The remainder of this section discusses how hierarchical clustering and unclustering techniques are integrated into a top-down analytic placer that exists in PDS, though is general enough to be applied to any placer. This placer was chosen since it has been proven effective in the design of several hundred real ASIC parts and is flexible enough to handle a wide variety of special user constraints, like bounds on cell movements. Further, clustering can also help improve timing-driven placement under a netweighting paradigm. By grouping cells with high weights into clusters, these cell groups will likely be placed close together in the final placement. The hierarchical analytic placement is the integration of three key components: analytic top-down placement, best-choice clustering, and area-based unclustering. First, we briefly review the flat global placement algorithm used for this particular speedup technique. Multilevel placement can be applied to just about any flat global placer, though the techniques for clustering and unclustering must be customized to obtain good results. A. Analytic Top-Down Placement Overview The analytic top-down global placement algorithm presented here is based on quadratic placement with geometric partitioning [36]. A quadratic wirelength objective is often used for analytic placement since it can be easily optimized, e.g., with a conjugate gradient solver (CG) minimize ð~x;~yþ ¼ X i 9 j w ij ðx i x j Þ 2 þðy i y j Þ 2 where ~x ¼½x 1 ; x 2 ;...; x n Š and ~y ¼½y 1 ; y 2 ;...; y n Š are the coordinates of the movable cells ~v ¼½v 1 ; v 2 ;...; v n Š and w ij (1) istheweightofthenetconnectingv i and v j.theoptimal solution is found by solving one linear system for~x and one for~y. Quadratic placement only works on a placement with fixed objects (anchor points). Otherwise, it produces a degenerate solution where all cells are on top of each other on a single point. Although the solution of (1) provides a placement solution with optimal squared wirelength, the solution will have lots of overlapping cells. To remove overlaps, we adopt geometric four-way partitioning [36]. Four-way partitioning, or quadrisection, is a function f : V! i 2 f0; 1; 2; 3g where i represents an index for one of subregionsorbinsb 0, B 1, B 2, B 3. The assignment of cells to bins needs to satisfy the capacity constraint for each bin. With the given cell locations determined by quadratic optimization, four-way geometric partitioning minimizes the sum of weighted cell movements (using a linear time algorithm), defined as X v 2 V sizeðvþd ðx v ; y v Þ; B fðvþ (2) where v is a cell, ðx v ; y v Þ is a location of cell v from quadratic solution and B fðvþ refers to one of four bins which the cell v is assigned to. The distance term dððx; yþ; B i Þ with i 2f0; 1; 2; 3g is the Manhattan distance from coordinate ðx; yþ to the nearest point to the bin B i. The distance is weighted by the size of cell, sizeðvþ. The intuition of this objective function is to obtain the partitioning solution with the minimum perturbation to the previous quadratic optimization solution. Quadrisection is recursively applied, so that at level k, there are 4 k placementsub-regionsorbins.foreachbin, the process of quadratic optimization and subsequent geometric partition are repeated until each sub-placement region contains a trivial number of objects. At each placement level, one can also apply local refinement techniques such as repartitioning [37]. Repartitioning consists of applying a quadrisection algorithm on each 2 2subproblem instance in a sequential manner. The fundamental reason that repartitioning can improve wirelengths of placement is that it can fix any poor assignments from the mininimum movement quadrisection step. Since it can see the assignments made from the prior level, it is able to locally reverse any poor assignments based on the repartitioning objective function. B. Clustering for Multilevel Placement As placement instances climb into the multimillions of cells, clustering becomes a powerful tool for speeding up the global placer. Clustering effectively reduces the problem size fed to the placer by viewing each cluster of cells as a single cell. The quality of a clustering based or 578 Proceedings of the IEEE Vol.95,No.3,March2007

7 multilevel placer is critically dependent on the ability to perform intelligent clustering. In terms of the interactions between clustering and placement, the prior work can be classified into two categories: transient and persistent. Transient clustering usually involves clustering and unclustering as part of the internal placement algorithm interactions. For example, multilevel min-cut partitioning [22], [31] falls into this category. The clustering is used for partitioning for a given level, but then an entirely new clustering is generated for the subsequent level. Hence, the clustering is transient since it is constantly recomputed based on the current state of the placer. In contrast, persistent clustering generates a cluster hierarchy at the beginning of a placement in order to reduce the size of a problem for the entire placement [9]. The clustered objects can be dissolved at or near the end of placement, with a final Bclean-up[ operation. For persistent clustering, the clustering algorithm itself is actually independent the core placer. Rather, it is a preprocessing step which imposes a more compact netlist structure for the placer, e.g., [29]. To embed clustering within our placer, we propose a semipersistent clustering strategy. One problem with persistent clustering is that clustered objects may be too large relative to the decision granularity (for example, the size of bin which the cluster is assigned during partitioning), which results in the degradation of final placement solution quality. The goal of semipersistent clustering is to address this deficiency. Semipersistent clustering takes advantage of the hierarchical nature of clustering so that clustered objects are dissolved slowly during the placement flow. At the early stage of the placement algorithm, a global optimization process is performed on highly clustered netlist while local optimization/refinement can be executed on the almost flattened netlist at later stage. There are many algorithms and objectives for clustering (see [38] for a survey). For example, a common technique is to match pairs of similar objects and apply matching passes recursively [22]. While extremely fast, the pairs that get merged towards the end of a pass may cluster to a less desirable neighbor. To avoid this behavior, one could perform partial passes where one only merges some small fraction of the cells before updating the list of potential matches. In its most extreme, one can use a partial list of one match. In other words, at each pass only perform the single best clustering over all possible clusters according to the given objective function. This is what we call best-choice clustering [23] as shown in Fig. 3. By using a priority queue to identify the best cluster, one obtains a sub-quadratic implementation. Priority queue management naturally provides an ideal clustering sequence and it is always guaranteed that two objects with the best clustering score will be clustered. The degree of clustering may be controlled by computing a target number of objects. Best-choice clustering is Fig. 3. Best-choice clustering algorithm. simply repeated until the overall number of objects becomes the target number of objects. Fewer target objects imply more extensive clustering (and a larger speedup). During the clustering score calculation, the weight w e of a hyperedge e is defined as 1=jej which is inversely proportional to the number of objects that are incident to the hyperedge. Given two objects u and v, theclustering score dðu; vþ between u and v is defined as dðu; vþ ¼ X e2eju;v2e w e ðaðuþþaðvþþ (3) where e is a hyperedge connecting object u and v, w e is a corresponding edge weight, and aðuþ and aðvþ are the areas of u and v respectively. The clustering score of two objects is directly proportional to the total sum of edge weights connecting them, and inversely proportional to the sum of their areas. This clustering score function can handle hyper edge directly without transforming it into a clique model. Also the area-based denominator of the score function helps to produce more balanced clustering results. Suppose N u is the set of neighboring objects to a given object u. Wedefinetheclosest object to u, denotedcðuþ, as the neighbor object with the highest clustering score to u, i.e., cðuþ ¼v such that dðu; vþ ¼maxfdðu; zþjz 2 N u g. The best-choice algorithm is composed of two phases. In phase I, for each object u in the netlist, the closest object v and its associated clustering score d are calculated. Then, the tuple ðu; v; dþ is inserted to the priority queue with d as comparison key. For each object u, only one tuple with the closest object v is inserted. This vertex-oriented priority queue allows for more efficient data structure Vol. 95, No. 3, March 2007 Proceedings of the IEEE 579

8 clustered object is compared to the available free space. If the size is bigger than the predetermined percentage of the available free space, the clustered object is dissolved. Our empirical analysis shows that with the appropriate threshold value (5%), most clusters can be preserved during the global placement flow with insignificant loss of wirelength. The area-based selective unclustering is another knob to provide a tradeoff between runtime and quality of placement solution. More aggressive unclustering (lower threshold value) produces better wirelengths at the cost of higher CPU time. Fig. 4. Clustering a pair of objects A and C. managements than edge-based methods. Phase I is a simply priority queue PQ initialization step. In the second phase, the top tupleðu; v; dþ in PQ is picked up (Step 2), and the pair of objects ðu; vþ are clustered creating a new object u 0 (Step 3). The netlist is updated (Step 4), the closest object v 0 to the new object u 0 and its associated clustering score d 0 are calculated, and a new tuple ðu 0 ; v 0 ; d 0 Þ is inserted to PQ (Steps 5 6). Since clustering changes the netlist connectivity, some of previously calculated clustering scores might become invalid. Thus, the clustering scores of the neighbors of the new object u 0, (equivalently all neighbors of u and v) need to be recalculated (Step 7), and PQ is adjusted accordingly. The following example illustrates clustering score calculation and updating. For example, assume the input netlist with six objects fa; B; C; D; E; Fg and eight hyperedges fa; Bg, fa; Cg, fa; Dg, fa; Eg, fa; Fg, another fa; Cg, fb; Cg, and fa; C; Fg as in Fig. 4(a). Let the size of each objects is 1. By calculating the clustering score of A to its neighbors, we find that dða; BÞ ¼1=4, dða; CÞ ¼2=3, dða; DÞ ¼1=4, dða; EÞ ¼1=4, and dða; FÞ ¼5=12. dða; CÞ has the highest score, and C is declared as the closest object to A. Since dða; CÞ is the highest score in the priority queue, A will be clustered with C and the circuit netlist will be updated as shown in Fig. 4(b). With a new object AC introduced, corresponding cluster scores will be dðac; FÞ ¼1=3, dðac; EÞ ¼1=6, dðac; DÞ ¼1=6, and dðac; BÞ ¼1=3. C. Area-Based Selective Unclustering In this semipersistent clustering scenario, the clustering hierarchy is preserved during the most global placement. However, if the size of a clustered object is relatively large to the decision granularity, the geometric partitioning result on this object can affect not only the quality of global placement solution, but also the subsequent legalization due to the limited amount of available free space. To address this issue, we employ an adaptive areabased unclustering strategy. For each bin, the size of each D. Putting the Placer Together Finally, the clustering can be integrated with analytic top-down placement to derive a new hierarchical global placement algorithm that is summarized in Fig. 5. With a given initial netlist, a coarsened netlist is generated via best-choice clustering which is used as a seed for the subsequent global placement. Steps 2 5 are the basic analytic top-down global placement algorithm described in Section II-A. After quadratic optimization and quadrisection are performed for each bin, an area-based selective unclustering is performed to dissolve large clustered objects (Step 6). At the end of each placement level, a repartitioning refinement is executed for local improvement (Step 8). Steps 2 9 consist of the main global placement algorithm. If there exist clustered objects after the global placement, those are dissolved unconditionally (Step 10) before the final legalization and detailed placement are executed (Step 11). The proposed algorithm relies on three strategic components; best-choice clustering, analytic top-down global placement, and area-based selective unclustering. E. Results and Summary Table 1 shows the performance of hierarchical placement over flat placement on real industrial circuits. The table shows the size of circuits in terms of the number of Fig. 5. Hierarchical analytic top-down placement algorithm. 580 Proceedings of the IEEE Vol.95,No.3,March2007

Table 1 Comparisons of Hierarchical Analytic Top-Down Placement Against Flat Placement in Wirelengths and Runtimes objects and nets, the utilization of designs, the wirelength improvement and

With clustering ratio ¼ 2, hierarchical placement is on average twice as fast as flat placement while obtaining a slight 0.92% improvement in wirelength.

9 Table 1 Comparisons of Hierarchical Analytic Top-Down Placement Against Flat Placement in Wirelengths and Runtimes objects and nets, the utilization of designs, the wirelength improvement and speed-up over flat placement. Let be the ratio of the number of cells to the target number of clusters. With clustering ratio ¼ 2, hierarchical placement is on average twice as fast as flat placement while obtaining a slight 0.92% improvement in wirelength. With a more aggressive clustering ratio of ¼ 10, hierarchical placement is about five times faster than flat placement, with a slight 3% degradation in wirelength. Different values of can be used to trade off speed and quality. Overall, we demonstrate that careful clustering and unclustering strategies can yield a hierarchical placement that is significantly faster than flat while with comparable solution quality. III. TECHNIQUES FOR FAST TIMING-DRIVEN BUFFER INSERTION For timing critical nets, buffer insertion must be deployed frequently to improve delay, either for handling nets with large fanout, long wires, or isolating non-critical sinks from critical ones. For example, Fig. 6(a) shows a 3-pin net with poor timing in which the small squares are potential buffer insertion locations. Proper buffer insertion, as shown in Fig. 6(b), improves the timing to the most critical sink by 200 ps. The bottom sink is not critical so only a decoupling buffer is required for that subpath. The buffering algorithms in PDS are based on the classic dynamic programming paradigm [24]. The reason for this is because the algorithm is provably optimal for a given tree topology (such as [39], [40]), though this result will frequently insert many additional buffers to obtain a negligible improvement in performance. Thus, the algorithm must also manage the tradeoff between buffering resources and delay [41]. Doing so changes the algorithms complexity from polynomial to pseudopolynomial and in practiceaddsanorderofmagnitudetotheruntime.the result is an extremely effective algorithm for timing-driven buffer trees, though the algorithm s inefficiency is problematic. Thus, it is essential to make this core optimization as fast as possible. Hence, this section explores tricks for tweaking the classic algorithm to obtain significant performance improvement without losing solution quality. These techniques can be easily integrated with the classic buffer insertion framework while also considering slew, noise, and capacitance constraints [42], [43]. Used in conjunction, these techniques can lead to more than a factor of ten performance improvement versus traditional dynamic programming. A. Overview of Classic Buffering Algorithm For a given Steiner tree with a set of buffer locations (namely the internal nodes), buffer insertion inserts buffers at some subset of legal locations such that the required arrival time (RAT) at the source is maximized. In the dynamic programming framework, candidate solutions are generated and propagated from the sinks toward the Fig. 6. An example of how buffer insertion can improve timing to critical sinks. (a) A net without buffers inserted. (b) Proper buffer insertion improves timing. Vol. 95, No. 3, March 2007 Proceedings of the IEEE 581

10 source. Each candidate solution is associated with an internal node in the tree and is characterized by a 3-tuple ðq; c; wþ. Thevalueq represents the required arrival time; c is the downstream load capacitance; and w is the cost summation for the buffer insertion decision. Initially, a single candidate ðq; c; wþ is assigned for each sink where q is the sink RAT, c is the load capacitance and w ¼ 0. When the candidate solutions are propagated from a node to its parent, all three terms are updated accordingly. At an internal node, a new candidate is generated by inserting a buffer. At each Steiner node, two sets of solutions from the children are merged. Finally at the source, the solutions with max q are selected. The candidate solutions at each node are organized as an array of linked lists. The solutions in each list of the array have the same buffer cost value w ¼ 0; 1; 2;... During the algorithm, inferior solutions are pruned. A solution is defined as inferior (or redundant) if there exists another solution that is better in slack, capacitance, and buffer cost. More precisely, for two candidate solutions 1 ¼ðq 1 ; c 1 ; w 1 Þ and 2 ¼ðq 2 ; c 2 ; w 2 Þ, 2 dominates 1 if q 2 q 1, c 2 c 1 and w 2 w 1. In such case, we say 1 is redundant and may be pruned. After pruning, every list with the same cost is a sorted in terms of q and c. A buffer library is a set of buffers and inverters, while each of them is associated with its driving resistance, input capacitance, intrinsic delay, and buffer cost. During optimization, we wish to control the total buffer resources so that the design is not over-buffered for marginal timing improvement. While total buffer area can be used, to the first order, the number of buffers provides a reasonably good approximation for the buffer resource utilization. Indeed, we use the number of buffers since it allows a much more efficient baseline van Ginneken implementation. Note that, our techniques presented in this paper can be applied on any buffer resources model, such as total buffer area or power. At the end of the algorithm, a set of solutions with different cost-rat tradeoff is obtained. Each solution gives the maximum RAT achieved under the corresponding cost bound. Practically, we choose neither the solution with maximum RAT at source nor the one with minimum total buffer cost. Usually, we would like to pick one solution in themiddlesuchthatthesolutionwithonemorebuffer brings marginal timing gain. In PDS, we use the B10 ps rule[ (though the value can of course be modified depending on the frequency target). For the final solutions sorted by the source s RAT value, we start from the solution with maximum RAT and compare it with the second solution (usually it has one buffer less). If the difference in RAT is more than 10 ps, we pick the first solution. Otherwise, we drop it (since with less than 10 ps timing improvement, it does not worth an extra buffer) and continue to compare the second and the third solution. Of course, instead of 10 ps, any time threshold can be used when applying to different nets. B. Preslack Pruning During the algorithm, a candidate solution is pruned out only if there is another solution that is superior in terms of capacitance, slack and cost. This pruning is based on the information at the current node being processed. However, all solutions at this node must be propagated further upstream toward the source. This means the load seen at this node must be driven by some minimal amount of upstream wire or gate resistance. By anticipating the upstream resistance ahead of time, one can prune out more potentially inferior solutions earlier rather than later, which reduces the total number of candidates generated. More specifically, assume that each candidate must be driven by an upstream resistance of at least R min. The pruning based on anticipated upstream resistance is called the prebuffer slack pruning. Prebuffer Slack Pruning (PSP): For two non-redundant solutions ðq 1 ; c 1 ; wþ and ðq 2 ; c 2 ; wþ, where q 1 G q 2 and c 1 G c 2, if ðq 2 q 1 Þ=ðc 2 c 1 ÞR min, then ðq 2 ; c 2 ; wþ is pruned. The PSP technique was first proposed in [44]. Using an appropriate value of R min guarantees optimality is not lost [44], [45]. However, what if we are willing to sacrifice optimality for a faster solution by using a resistance R which is larger than R min. In practice, we observe that a somewhat larger value than R min doesnothurtsolution quality. We performed buffer insertion experiments on 1000 high capacitance industrial nets by varying the value of R used for preslack pruning. The percent slack and CPU time compared to no preslack pruning is shown in Fig. 7. Observe that the slack slowly degrades as a function of resistance, while the CPU time decrease is fairly sharp. For example, R ¼ 120 is the minimum resistance value which preslack pruning is still optimal solution. However, one can get a 50% speedup for less than 5% slack degradation for a larger value of R ¼ 600. These results indicate that using PSP can bring a huge speed-up in classic buffering for a fairly small degradation in solution quality. C. Squeeze Pruning The basic data structure of van Ginneken style algorithms is a sorted list of non-dominated candidate solutions. Both the pruning in van Ginneken style algorithm and the prebuffer slack pruning are performed by comparing two neighboring candidate solutions at a time. However, more potentially inferior solutions can be pruned out by comparing three neighboring candidate solutions simultaneously. For three solutions in the sorted list, the middle one may be pruned according to the squeeze pruning defined as follows. Squeeze Pruning: For every three candidate solutions ðq 1 ; c 1 ; wþ, ðq 2 ; c 2 ; wþ, ðq 3 ; c 3 ; wþ, where q 1 G q 2 G q 3 and 582 Proceedings of the IEEE Vol.95,No.3,March2007

11 Fig. 7. The speed-up and solution sacrifice of aggressive preslack-pruning for 1000 nets as a function of R. c 1 G c 2 G c 3, if ðq 2 q 1 Þ=ðc 2 c 1 Þ G ðq 3 q 2 Þ=ðc 3 c 2 Þ, then ðq 2 ; c 2 ; wþ is pruned. For a two-pin net, consider the case that the algorithm proceeds to a buffer location and there are three sorted candidate solutions with the same cost that correspond to the first three candidate solutions in Fig. 3(a). According to the rationale in prebuffer slack pruning, the q-c slope for two neighboring candidate solutions tells the potential that the candidate solution with smaller c can prune out the other one. A small slope implies a high potential. For example, ðq 1 ; c 1 ; wþ has a high potential to prune out ðq 2 ; c 2 ; wþ if ðq 2 q 1 Þ=ðc 2 c 1 Þ is small. If the slope value between the first and the second candidate solutions is smaller than the slope value between the second and the third candidate solutions, then the middle candidate solution is always dominated by either the first candidate solution or the third candidate solution. Squeeze pruning keeps optimality for a two-pin net. After squeeze pruning, the solution curve in ðq; cþ plane is concave as shown in Fig. 8(b). For a multisink net, squeeze pruning does not guarantee optimality since each candidate solution may merge with different candidate solutions from the other branch and the middle candidate solution in Fig. 8(a) may offer smaller capacitance to other candidate solutions in the other branch. Squeeze pruning may prune out a postmerging candidate solution that is originally with less total capacitance. However, despite the loss of guaranteed optimality, most of the time squeeze pruning causes no degradation in solution quality and overall is a fairly safe pruning technique. D. Library Lookup The size of buffer library is an important factor in determining runtime. Modern designs may have hundreds of buffers and inverters to choose from. The theoretical complexity of van Ginneken style buffer insertion is quadratic in terms of the library size, though in practice it appears to be linear. To avoid the slow down from large libraries, we take advantage of buffer library pruning [46] to select a small yet effective set of buffers from all those that may be used. We now discuss a more effective technique, library lookup. During van Ginneken style buffer insertion, every buffer in the library is examined for iteration. If there are n candidate solutions at an internal node before buffer insertion and the library consists of m buffers, then mn tentative solutions are evaluated. For example, in Fig. 9(a), all eight buffers are considered for all n candidate solutions. However, many of these candidate solutions are clearly not worth considering We seek to avoid generating poor candidate solutions in the first place and not even consider adding m buffered candidate solutions for each Fig. 8. Squeeze pruning example. (a) The solution curve in ðq; cþ plane before squeeze pruning. (b) The solution curve after squeeze pruning. Vol. 95, No. 3, March 2007 Proceedings of the IEEE 583

12 Table 2 Simulation Results for Full Library Consisting of 24 Buffers. Baseline are the Results of the Algorithm of Lillis et al. [47]. PSP Shows the Results of Aggressive Prebuffer Slack Pruning Technique. SqP Stands for our Squeeze Pruning Technique. LL is the Library Lookup Technique Fig. 9. Library lookup example. B1 to B4 are non-inverting buffers. I1 and I4 are inverting buffers. (a) van Ginneken style buffer insertion. (b) Library lookup. unbuffered candidate solution. We propose to consider each candidate solution in turn. For each candidate solution with capacitance c i, we look up the best noninverting buffer and the best inverting buffer that yield the best delay from two precomputed tables before optimization. For Fig. 9(b), the capacitance c i results in selecting buffer B3 and inverter I2 from the non-inverting and inverting buffer tables. The two tables may be precomputed before buffer insertion begins. All 2n tentative new buffered candidate solutions can be divided into two groups, where one group includes n candidate solutions with an inverting buffer just inserted and the other group includes n candidate solutions with a non-inverting buffer just inserted. We only choose one candidate solution that yields the maximum slack from each group and finally only two candidate solutions are inserted into the original candidate solution lists. Since the number of tentative new buffered solutions is reduced from mn to 2n, the speedup is achieved. Also, since only two new candidate solutions instead of m new candidate solutions are inserted, the number of total candidate solutionsisreduced.thisissimilartothecasewhenthe buffer library size is only two, but the buffer type may change depending on the downstream load. E. Results and Summary Table 2 shows the impact of the three speedup techniques: preslack pruning (PSP), squeeze pruning (SqP), and library lookup (LL) versus the classic algorithm (baseline). The results are average for 5000 high capacitance results from an ASIC chip. The second column shows the total slack improvement (for all 5000 nets) improvement after buffer insertion, and the third column gives the total CPU time. Overall, the three techniques resulted in a 20X speedup, with just 3% degradation in solution quality. Buffer insertion is a core optimization for fixing timing critical paths. When optimizing tens of thousands of nets, some optimality can be afforded to be sacrificed in order to get sufficient runtime. Note that at the end of physical synthesis, one could try reapplying buffer insertion without these speedups (while also using more accurate delay models) to the handful of remaining critical nets. This is still much more efficient than applying full blown high accuracy buffer insertion for the entire design. This work in essence summarizes our philosophy to fast physical synthesis. Do the optimization well as fast as possible, even if a little optimality is sacrificed. At the end, if the design is close to timing closure, slower and more accurate techniques can always be employed to further refine the design. IV. FAST ELECTRICAL CORRECTION The previous discussion discusses fast buffering for critical path optimization. Our focus now turns toward using buffers and gate sizing for electrical correction. As discussed in the first section, electrical correction is becoming an increasingly costly phase of physical synthesis. High wire resistance and sharp required slew rates (for either noise or performance) mean that potentially millions of buffers must be inserted and millions of gates must be repowered simply to have an electrically correct design. Critical path optimization techniques rely on the correct operation of the timing analyzer; however, any timer, even a sophisticated one, only works correctly if the design it is given is in a reasonable electrical state. For example, if capacitive loads are outside the range that a gate model has been characterized for, the timer will give results that do not reflect the true performance of the gate. Further, if one can quickly make the timing result look decent, this will leave much less work for the subsequent slower critical path optimizations. This section focuses on how to quickly perform electrical correction, i.e., fix capacitance and slew violations [20]. Further, it is crucial that this phase requires minimal 584 Proceedings of the IEEE Vol.95,No.3,March2007

13 area overhead, thereby reducing unnecessary power consumption and silicon real estate. The need for reducing area usage is obvious for area-constrained designs. However, even in designs where the total area may not be at a premium, local regions may be congested. Further, in delay-constrained designs, the area saving can be used by subsequent optimizations to improve the performance of critical regions. A. Types of Electrical Violations Timing analyzers utilize models for gate delays and slews, which are precharacterized. Each gate is characterized with a maximum capacitive load that it can drive and a maximum input slew rate, and the operation of the timer is valid within these ranges. If these conditions are violated, timers usually extrapolate to obtain Bbest guess[ values. However, values calculated in this manner may be inaccurate. This leads to the limits that define electrical violations. There are two Brules[ that a design has to pass for it to be electrically clean, as follows. Slew Limits: These rules define the maximum slews permissible on all nets of the design. If the slew (defined here as the 10% 90% rise or fall time of a signal, other definitions can be used as well) at the input of a logic gate is too large, a gate may not switch at the target speed, or may not switch at all, leading to incorrect operation. Capacitance Limits: Thesedefinethemaximum effective capacitance that a gate or an input pin can drive. A large capacitance on the output of a gate directly affects its switching speed and power dissipation. Additionally, gates are typically characterized for a limited range of output capacitance, and delay calculation during design can be incorrect if the output capacitance is greater than the maximum value. Violations of these rules (referred to as slew violations and capacitance violations) taken together are called electrical violations. These limits are principally determined during gate characterization, but designers may choose to tighten these constraints further. High performance designs, such as microprocessors typically have much tighter slew limits than ASICs. B. Causes of Violations Fig. 10 shows the main causes of slew violations, and how these may be fixed. Consider a net having source gate A and sink gate B. The capacitive load seen by gate A is the sum of the interconnect capacitance of the net and the input capacitance of gate B. Assume that a signal with slew s 1 is applied at the input of gate A. Due to the load that it has to drive, the slew s 2 at the output of gate A may be more than s 1. Thus, one cause of degradation is the source gate not being capable of driving the load at its output. Next, even if the slew at the output of A, s 2,is within the specified limits, it could degrade as the signal traverses the net to the sink. Thus, at the sink, the signal could have an even larger slew of s 3.Thisisthesecond contribution to slew degradation. There are two main methods of fixing slew violations, as shown in Fig. 10. First, the source gate of the net can be sized up, so that the new gate can drive the load present. While this may fix violations on the net in question, the obvious disadvantage is that the problem has been moved to the input of the source gate, where the input nets now have larger sink capacitances. However, this may or may not create violations on the input nets. Second, keeping the source at its original size, buffers can be inserted on the net in question. These isolate the load capacitance of the sink, and repower the signal on the net, so that slews are within the specified limits. Unlike resizing, this method does not affect the electrical state of any other nets, but the area overhead can be much higher. Additionally, the time required to determine where to best insert buffers is much greater than the time required to resize a gate. The causes of capacitance violations are similar to those of slew violations: sink and interconnect capacitance both Fig. 10. Causes of slew violations, and different methods of fixing them. (a) Slew degradation due to gate and interconnect. (b) Fixing slew violation by sizing source. (c) Fixing slew violation by buffering. Vol. 95, No. 3, March 2007 Proceedings of the IEEE 585

14 contribute to the existence of a violation. The fixes too are similar, using resizing and buffering. However, it is important to note that it is possible to have capacitance violations on a net that does not have slew violations, and vice versa. Therefore, both capacitance and slew violations have to be taken into consideration individually. The simplest way to perform electrical correction is via a sequential approach. First try resizing gates to fix violationswhilebeingcarefulnottooversizethem.forthose nets that cannot be solved with resizing, invoke a buffer insertion algorithm. This may require a second pass of resizing in order to properly size the newly inserted buffers and inverters. The most important drawback of this approach is that sizing and buffering used to fix violations are applied sequentially, with no communication or, indeed, knowledge of each other s capabilities. Thus, a pass of resizing or buffering tries to fix the violations that it sees, and assumes that the the other will be able to handle the violations that it cannot fix. Thus, if resizing is applied to a net to fix a slew violation on a sink, it may decide that buffering is the best solution, for a variety of reasons. However, in the next pass, when the net is passed to the buffer insertion routine, there may be conditions that prohibit the insertion of buffers, such as blockages. Subsequent passes of resizing and buffering are then needed with different settings, to overcome this situation, and there is no guarantee that any of these passes will fix the existing violation. C. An Integrated Approach Alternatively, we propose a framework that tightly integrates the selection of the two optimizations, allowing for the use of the correct optimization in a single pass over the design. This integrated approach seeks to selectively apply the resizing and buffering optimizations on a net-bynet basis. Nets are selected in topological order, from outputs to inputs, and on each net, the following operations are carried out. If there are no violations on the net, then the source (driving) gate is sized down as much as possible, without introducing new violations. Ifslewviolationsexistonthenet,thesourcegateis sized up as necessary, to fix the violations. If the previous step (resizing to fix violations) does notsucceed,thenetisbuffered. The rationale of this approach is as follows. First, nets are processed in output-to-input order; any side effect of resizing gates only impacts the input nets, which are yet to be processed. Sizing a gate up to remove a violation on its output has a detrimental affect on its input nets. This is handled by processing nets in the correct order. Second, sizing gates down when possible has two benefits. First, area is recovered when gates are larger than necessary, and second, reducing the load on input nets potentially removes violations that may exist, or reduces theirseverity.theareasalvagedinthisstepisbetterused for improving delay on critical paths of the circuit. Of course, this step can be skipped if the design has already been optimized for delay. Finally, if resizing cannot fix a violation, buffering is used to fix the net. Since buffering is the last resort, this optimization can be as aggressive as required, which is used to our advantage as shown later. This order (resizing followedbybuffering)isalsoadvantageousfromaruntime standpoint, since buffering a net is much slower than simply sizing the source gate. The approach to gate sizing is straightforward. Given an input slew rate and output load, we iterate through all available sizes, and select the smallest gate size that can deliver the required output slew. Buffering is based on the algorithm described in the next section. The algorithm selects the minimum area solution such that electrical constraints are satisfied. For runtime considerations, a coarse buffer library is often used for buffer insertion. The lack of granularity in the buffer library makes the potential to resize the buffer gates possible. Of course, a more finegrained library can be used, but can cause extra runtime. To decide whether a gate meets its required slew target, we adopt the model of Kashyap et al. [48] because of its simplicity. It is actually the slew equivalent of the Elmore delay model, but actually does not suffer as severely from inaccuracies caused be resistive shielding. The slew model can be explained using a generic example which is a path p from node v i (upstream) to v j (downstream) in a buffered tree. There is a buffer (or the driver) b u at v i,andthereisnobufferbetweenv i and v j. The slew rate sðv j Þ at v j depends on both the output slew s bu ;outðv i Þ at buffer b u and the slew degradation s w ðpþ along path p (or wire slew), and is given by [48] qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sðv j Þ¼ s bu ;outðv i Þ 2 þ s w ðpþ 2 : (4) The slew degradation s w ðpþ can be computed with Bakoglu s metric [49] as s w ðpþ ¼ln 9 DðpÞ (5) where DðpÞ is the Elmore delay from v i to v j. The basic framework presented above is flexible, and lends itself to multiple refinements as follows. Once a net is buffered, the integrated framework allows for a quick sizing of the newly added buffers. The buffering algorithm can therefore be used with a small library of buffers. Existing inverter trees can be ripped up and reinserted as required, keeping in mind signal polarity constraints on the sinks. If buffering does not fix a net, the cause of the failure can be analyzed on the fly, and different algorithms, e.g., for blockage avoidance can be used. Finally, if area is 586 Proceedings of the IEEE Vol.95,No.3,March2007

15 at a premium, both resizing and buffering can be applied to every net, and the solution with the lowest cost can be selected. D. Electrical Correction Summary The integrated framework allows PDS to efficiently perform electrical correction. However, in our initial implementation, we found that 80% 90% of the runtime takes place in the van Ginneken style buffer insertion algorithm, even with the speedups discussed above. For electrical correction, using a buffer insertion algorithm which optimizes for delay is wasteful, since the purpose of this stage is to simply produce an electrically correct design. This motivates a new buffer insertion formulation specifically for electrical correction that is discussed in the next section. V. FAST TIMERLESS BUFFERING The efficiency of electrical correction directly depends on the efficiency of the buffering algorithm. While Section III shows how one can speed up performance driven buffering, it still suffers from the fact that three constraints must be handled at once: area, slew, and delay. In electrical correction, one can afford to ignore the last objective, delay. The assumption is that if a tree buffered by electrical correction subsequently becomes part of a critical path, it can always be ripped up and rebuffered by the critical path optimizations while taking into account themostuptodatetiminganalysis.ingeneral,wefind that only a relatively small percentage of nets (e.g., 5%) need to be rebuffered. Thus, this section proposes a simpler buffer formulation that ignores delay constraints in order to achieve a more runtime and area efficient result. The key observation that motivates this approach is that traditional buffer insertion requires pruning based on three components: capacitance, slack (or delay), and area (or power). Because a candidate has to be inferior in all three categories to be pruned, the list of possible candidates can grow quite large. However, to perform electrical correction, the optimal delay solution is not required and instead one wishes to fix electrical violations with minimum area. By using only two instead of three categories for pruning, one can obtain a much more efficient solution (that is actually linear time in the case of a single buffer). A. Problem Formulation For electrical correction, we seek the minimum area (or cost) buffering solution such that slew constraints are satisfied. Since one does not need to know required arrival time at sinks, it can be performed independently of timing analysis, hence the term, timerless buffering. While this new formulation is actually NP-complete, some highly efficient and practical algorithms can be utilized. The input to the timerless buffering problem includes a routing tree T ¼ðV; EÞ, where V ¼fs 0 g[v s [ V n, and E V V. Vertexs 0 is the source vertex, V s is the set of sink vertices, and V n is the set of internal vertices. Each sink vertex s 2 V s is associated with sink capacitance c s. Each edge e 2 E is associated with lumped resistance R e and capacitance c e. A buffer library B contains different types of buffers. Each type of buffer b has a cost w b, which can be measured by area or any other metric, depending on the optimization objective. Without loss of generality, we assume that the driver at source s 0 is also in B. Afunction f : V n! 2 B specifies the types of buffers allowed at each internal vertex. The output slew of a buffer, such as b u at v i,dependson the input slew at this buffer and the load capacitance seen from the output of the buffer. For a fixed input slew, the output slew of buffer b at vertex v is then given by s b;out ðvþ ¼R b cðvþþk b (6) where cðvþ is the downstream capacitance at v, R b and K b are empirical fitting parameters. This is similar to empirically derived K-factor equations [50]. We call R b the slew resistance and K b the intrinsic slew of buffer b. A buffer assignment is a mapping : V n! B [fbg where b denotesthatnobufferisinserted.thecostofa solution is wðþ ¼ P b2 w b. With the above notations, the basic timerless buffering problem can be formulated as follows. Timerless Buffering Problem: Given a Steiner tree T ¼ðV; EÞ a buffer library B, compute a buffer assignment such that the total cost wðþ is minimized such that the input slew at each buffer or sink is no greater than a constant. B. A Timerless Buffering Algorithm In the dynamic programming framework, a set of candidate solutions are propagated from the sinks toward the source along the given tree. Each solution is characterized by a three-tuple ðc; w; sþ, where c denotes the downstream capacitance at the current node, w denotes the cost of the solution and s is the accumulated slew degradation s w defined in (5). At a sink node, the corresponding solution has c equal to the sink capacitance, w ¼ 0 and s ¼ 0. The solution propagation is accomplished by the following operations. Consider to propagate solutions from a node v to its parent node u through edge e ¼ðu; vþ. Asolution v at v becomes solution u at u, which can be computed as cð u Þ¼cð v Þþc e ; wð u Þ¼wð v Þ and sð u Þ¼sð v Þþ ln 9 D e where D e ¼ R e ððc e =2Þþcð v ÞÞ. In addition to keeping the unbuffered solution u,a buffer b i can be inserted at u to generate a buffered Vol. 95, No. 3, March 2007 Proceedings of the IEEE 587

16 solution u;buf which can be then computed as cð u;buf Þ¼ c bi ; wð u;buf Þ¼wð v Þþw bi and sð u;buf Þ¼0. When two sets of solutions are propagated through left child branch and right child branch to reach a branching node, they are merged. Denote the left-branch solution set and the right-branch solution set by l and r, respectively. For each solution l 2 l and each solution r 2 r, the corresponding merged solution 0 can be obtained according to: cð 0 Þ¼cð l Þþcð r Þ; wð 0 Þ¼wð l Þþwð r Þ and sð 0 Þ¼maxfsð l Þ; sð r Þg. To ensure that the worst case in the two branches still satisfies slew constraint, we take the maximum slew degradation for the merged solution. For any two solutions 1, 2 at the same node, 1 dominates 2 if cð 1 Þcð 2 Þ, wð 1 Þwð 2 Þ and sð 1 Þ sð 2 Þ. Whenever a solution becomes dominated, it is pruned from the solution set without further propagation. Asolution can also be pruned when it is infeasible, i.e., either its accumulated slew degradation sðþ or the slew rate of any downstream buffer in is greater than the slew constraint. When a buffer b i is inserted into a solution, sðþ is set to zero and cðþ is set to cðb i Þ. This means that inserting one buffer may bring only one new solution,namely,theone with the smallest w. However,inminimumcosttiming buffering, a buffer insertion may result in many nondominated ðq; c; wþ tuples with the same c value, where q denotes the required arrival time (RAT). Consequently, in timerless buffering, at each buffer position along a single branch, atmostjbj new solutions can be generated through buffer insertion since c; s are the same after inserting each buffer. In contrast, buffer insertion in the same situation may introduce many new solutions in timing buffering. This sheds light on why timerless buffering can be much more efficiently computed. Another important fact is that the slew constraint is in some sense close to length constraint. In timerless buffering, solutions can soon become infeasible if we do not add a buffer into it and thus many solutions, which are only propagated through wire insertion, are often removed soon. An extreme case demonstrating this point is that in standard timing buffering, the solutions with no buffer inserted can always live until being pruned by driver given a loose timing constraint. This may not happen in timerless buffering: this kind of solutions soon become infeasible as long as the slew constraint is not too loose. Due to these special characteristics of the timerless buffering problem, a linear time optimal algorithm for buffering with a single buffer type is possible. In timing buffering, it is not known how to design a polynomial time algorithm in this case. From these facts, the basic differences between these two somewhat related buffering problems are clear. C. Results and Summary Table 3 compares timerless buffering to timing-driven buffering for 1000 high capacitance nets from an ASIC design for slew constraints ranging from 0.4 to 2.0 ns. A library of 48 buffers was used. The experiment shows that timerless buffering does result in a consistent degradation in slack, which is not surprising since it does not utilize timing information. Because timerless buffering minimizes area in its objective function, it is more efficient in buffering area and the number of buffers used. The area savings tends to increase as the slew constraint is relaxed. Finally, the CPU time advantage is clear as speedups of 25 to over 100 are observed. The timing-driven buffering used here does utilize preslack pruning and squeeze pruning, but not library lookup. Obviously the latter technique would reduce the advantage somewhat. Since electrical correction can result in millions of buffers being inserted, one needs to do this as fast as possible. Even with the speedups in Section III, a delay driven technique is not suitable for this task. Instead, using a timerless formulation that seeks to minimize area proves significantly faster and actually uses less area. Ultimately, one needs a large back of buffering solutions depending on where one is in the physical synthesis flow. For early electrical correction, a faster timerless algorithm is appropriate. For critical path optimization, a van Ginneken style algorithm is needed. However, one often may need to pay attention to the blockages or placement and routing congestion that may exist in the design. The next section shows a framework for dealing with any of these layout characteristics. Table 3 Comparison of Timerless Buffering With Timing-Driven Buffering 588 Proceedings of the IEEE Vol.95,No.3,March2007

VI. LAYOUT AWARE FAST AND FLEXIBLE BUFFER TREES Given a Steiner tree, we can insert buffers for critical path optimization using timing-driven buffering or electrical correction using timerless

However, this construction ignores the blockages and congestion present in the layout. Ignoring this can potentially cause several design headaches. A. Types of Layout Issues For example, Fig.

17 VI. LAYOUT AWARE FAST AND FLEXIBLE BUFFER TREES Given a Steiner tree, we can insert buffers for critical path optimization using timing-driven buffering or electrical correction using timerless buffering. The quality of the results strongly depends on the Steiner tree used, and so we use a buffer-aware tree construction as described in [39]. However, this construction ignores the blockages and congestion present in the layout. Ignoring this can potentially cause several design headaches. A. Types of Layout Issues For example, Fig. 11(a) illustrates the Balley[ problem, in which space is limited between two large fixed blocks. The space between blocks is highly desirable since routes that cross the blockages have only potential insertion space in the alley. Fig. 11(b) shows the buffer Bpile-up[ phenomenon. Several nets may desire buffers to be inserted in the black congested region, yet since there is no space for buffers there, the buffers are inserted as close to the boundary as possible. As more nets are optimized, these buffers pile up and spiral out further from their ideal locations. This could be alleviated by only allowing buffers from critical path optimization (not electrical correction) to use these scarce resources. As technology continues to scale, the optimum distance between consecutive buffers continues to decrease. In hierarchical design, this means allocating spaces within macro blocks for buffering of global nets. An example is shown in Fig. 12(a). The space for buffers is potentially limited, so non-critical nets should be routed around the blocks while critical ones can use the holes. Long noncritical nets still require buffers to fix slew and/or capacitance violations. In addition, these nets could be critical, but have a wide range of possible buffering solutions that may bring them into the non-critical group. In the figure, the top net is non-critical and requires three buffers, while the bottom net is critical and needs only two by exploiting holes punched in the block. Even without holes in block, designs may have pockets of low density for which inserting buffers is preferred, as Fig. 11. Buffer insertion can potentially: (a) fill up constrained alleys and (b) cause buffer pile-ups. Fig. 12. Some environmental based constraints include: (a) holes in large blocks; (b) navigating large blocks and dense regions; (c) distinguishing between critical and noncritical preferred routes;and(d)avoidingroutingcongestion. shown in Fig. 12(b). In the figure, the Steiner route is located in the low density part of the chip, which makes the buffers inserted along the route also use low density regions. Fig. 12(c) shows an example where one may be willing to insert buffers in high density regions if a net is critical. The 2-buffer route above the block yields faster delays than the 4-buffer route below the block that is better suited for noncritical nets. Finally, Fig. 12(d) shows routing congestion between two blocks; the preferred buffered route avoids this congestion without sacrificing timing. There are some buffering approaches that attack a subset of these type of problems by simultaneously integrate the layout environment, build a Steiner tree, and buffer (e.g., [51], [52]), but doing too much work at once inherently makes these algorithms too inefficient for this application. Instead, we propose the following flow: Step 1: construct a fast timing-driven Steiner tree (e.g.,[39])thatisignorantoftheenvironment. Step 2: reroute the Steiner tree to preserve its topology while navigating environmental constraints. Step 3: insert buffers via the algorithms in Section III or V. This section focuses on solving the problem in Step 2. B. Rerouting Algorithm Overview To reroute the tree, the design area is divided into tiles, as in global routing, and stores the placement and routing density characteristics for each tile. The algorithm takes the existing Steiner tree and breaks it into disjoint 2-paths, i.e., paths which start and end with either the source, a sink, or a Steiner point such that every internal node has degree two. For example, the nets shown in Fig. 13(a) and (b) both decompose into three 2-paths. Finally, each 2-path is rerouted in turn to minimize cost, Vol. 95, No. 3, March 2007 Proceedings of the IEEE 589

18 of tile t. andlerðtþ be its routability (used tracks divided by total tracks available). Then one could use eðtþ ¼dðtÞ 2 þð1 ÞrðtÞ 2 (7) Fig. 13. Example of a three-pin net: (a) before and (b) after rerouting. The shaded square region is the plate and the dotted region is the solution search space for the final 2-path. where 0 1 trades off between routing and placement cost. For fixing electrical violations, one wants the net to avoid high cost tiles, while still making an attempt to minimize wirelength. For this case, consider costðtþ ¼1 þ eðtþ: (8) starting from the sinks and ending at the source. The new Steiner tree is assembled from the new 2-path routes. Essentially, the algorithm is performing maze routing for each subsection of the tree. The two key components of achieving a good result are plate expansion, which allows the Steiner points to migrate and deriving the right maze routing cost function. If a Steiner point is in a congested region, it needs to migrate from its original location. One could consider allowing it to move anywhere in the layout, but since the original Steiner layout was presumably Bgood[ we restrict it to move only within a specified Bplate[ region. This is one key for enabling the algorithm to be efficient. The plateneedstobelargeenoughtoenablethesteinerpoint to migrate to a less congested tile. During maze rerouting, one considers routing to any tile in the plate instead of just the original tile. Fig. 13(a) shows a routing tree after Step 1. The striped tile is the Steiner point, and the shaded region shows a 5 5 plate centered at the original Steiner point. Fig. 13(b) shows a Steiner tree that might result after rerouting. The Steiner point has moved to a different location within the plate; where it ends up depends on the cost function that is optimized. The dotted region shows the potential search space for the rerouting of the 2-path from the Steiner point to the source. In this case, the bounding box containing the two endpoints was expanded by one tile. C. Maze Routing Cost Function for Electrical Correction Each tile is assigned cost that should reflect potentially inserting a buffer and/or routing through the tile. Let eðtþ 1 be the environmental cost of using tile t, where eðtþ ¼0 if the tile is totally void of any resource utilization, while eðtþ ¼1 represents a fully utilized tile. As an example, for placement congestion, let dðtþ could be the placement density (cell area divided by total area available) This cost function implies that a fully utilized tile has cost twice that of a tile that uses no resources. The constant of one can be viewed as a Bdelay component.[ Let the cost of a path be equal to the cost of all tiles in the path, and initially assign all sinks to zero initial cost. We wish to minimize the cost of the entire tree being constructed. For atilet that corresponds to a Steiner point, with subtree children L and R, the cost of the tree routed at t is costðtþ ¼costðLÞþcostðRÞ. D. Maze Routing Cost Function for Critical Path Optimization For critical nets, the cost impact of the environment is relatively immaterial. We seek the absolute best possible slack, but still need the route to avoid regions where buffers cannot be inserted at all. When a net is optimally buffered (assuming no obstacles), its delay is a linear function of its length [53]. Of course, this solution must be realizable. To minimize delay, we simply minimize the number of tiles to the most critical sink. Thus, the cost for a tile is just costðtþ ¼ 1 (there is no eðtþ term). When merging branches, one wants to choose the branch with worst slack, so the merged cost costðtþ is: maxðcostðlþ; costðrþþ. To initialize the slack, a notion of which sink is critical is needed. Since our cost function basically counts tiles as delay, the required arrival time (RAT) must be converted to tiles. Let DpT be the minimum delay per tile achievable on an optimally buffered line. For asinks, thecostðsþ is initialized to RATðsÞ=DpT. The more critical a sink, the higher its initial cost. The objective is to minimize cost at the source. Fig. 14(a) shows one of several possible solutions for rerouting the net in Fig. 13 using this cost function, where s 2 is considered two tiles more critical than s 1.Notethatit achieves a shortest path to s 2. Contrast that with the electrical correction cost function shown in Fig. 14(b), in which the Bblob[ represents an area of high cost. In this case, the route avoids the congested area even though it means the route to the critical sink is much longer. 590 Proceedings of the IEEE Vol.95,No.3,March2007

Examples of the (a) critical and (b) non-critical net cost functions. The shaded area represents a region of high cost. E.

19 node expansion once this threshold is reached. This guarantees that the resulting Steiner tree will have sufficient area for buffers so that slew violations can be fixed by subsequent dynamic programming. Fig. 14. Examples of the (a) critical and (b) non-critical net cost functions. The shaded area represents a region of high cost. E. General Cost Function The previous cost functions can generate extreme behavior; however, one can trade off between the two cost functions. Let 0 K 1 be the tradeoff parameter, where K ¼ 1 corresponds to a electrical correction and K ¼ 0 corresponds to a critical net. The cost function for tile t is then costðtþ ¼1 þ K eðtþ: (9) For critical nets, merging branches is a maximization function, while it is an additive function for non-critical nets. These ideas can be combined with to yield: costðtþ ¼maxðcostðLÞ; costðrþþ þ K minðcostðlþ; costðrþþ: (10) Finally, the sink initialization formula becomes G. Example and Summary The effect of rerouting can be shown by the example in Fig. 15, which displays the a placement density map for a given 7-pin net of an industrial design. The source is marked with a white x, while sinks are marked with dark squares. The white dots are potential buffer insertion locations, and the diamonds are the inserted buffers. The route on the left is the solution with K ¼ 1:0, while the one on the right is the solution for K ¼ 0:1. Observe that the left route totally avoids the large blockage, which ultimately leads to a 4134 ps slack improvement over the unbuffered solution. However, for when K ¼ 0:1, the route successfully finds the prime real estate (the holes inside the block) and places buffers in them where it deems it appropriate. This improves the slack by 4646 ps. The simple parameter setting of the cost function yields a different Steiner route that can recognize layout constraints depending on the particular phase of physical synthesis. Optimizations that ignore the layout can cause severe headaches for timing closure and routability. The maze rerouting technique proposed in this section is general enough to handle any kinds of layout configurations, whether blockages, regions packed with dense cells, or routing congestion. One does not need to deploy this throughout physical synthesis though. Instead one could wait for the Bmess[ and then clean it up. For example, PDS has a phase to identify all buffers in routing congested regions, rip-up those buffers, then reroute them using this maze routing strategy. This clean-up-the-mess strategy enables more overall efficient optimization than trying to always preemptively avoid the mess. The next section explains how a different kind of legalization algorithm is costðsþ ¼ðK 1ÞRATðsÞ=DpT: (11) Thus, K trades off the cost function, the merging operation, and sink initialization. In practice, we use K ¼ 1 for electrical correction and subsequently smaller values up to K ¼ 0:1 for critical path optimization. F. Slew Threshold Constraint As described the maze routing cost functions do not guarantee slew constraints will be satisfied. Let T be the maximum number of tiles that can be driven by a buffer before the slew constraint is violated. If the route goes over more than T consecutive blocked tiles, there will be an unavoidable slew violation when buffering. Hence, during maze routing we track the number of consecutive blocked tiles and forbid it from exceeding T by not performing Fig. 15. Illustration of the different routes obtained with the general maze routing cost function for a layout containing a large block with punched out holes. (a) A routed net with K ¼ 1:0. (b) The same net with K ¼ 0:1. Vol. 95, No. 3, March 2007 Proceedings of the IEEE 591

20 more effective at cleaning up messes made from synthesis operations. VII. DIFFUSION-BASED PLACEMENT TECHNIQUES FOR LEGALIZATION During electrical correction and critical path optimization, some gates may be resized while new ones are inserted into the design. PDS does not assign a location right away, but rather assigns a preferred location that may overlap existing cells. Periodically, legalization needs to run to snap these cells from overlapping to legal locations. If one waits too long between legalization invocations, cells may end up quite far from their preferred location which may severely hurt timing. This section discusses a new legalization paradigm called diffusion that was first described in [28]. Diffusion tries to avoid this behavior by keeping the relative ordering of the cells intact. Of course, there are other methods that can also achieve legalization without moving any one cell too far away. Brenner et al. [54] describe a network flow algorithm that superimposes a flow network on top of grid bins and then flows cells from overly dense bins to bins that are under capacity. More recently, Luo et al. superimpose a Delauney triangluation on top of the cells and use this structure to enforce relative order while achieving local density targets. Techniques for local cell movement, swapping and shifting to improve placement quality after legalization can be found in [55], [56]. During optimization, local regions can become overfull at which point synthesis, buffering, and repowering optimizations may become handcuffed if they are forbidden to add to the area in an already full bin. The main advantage of diffusion is that it can allow the optimizations to proceed anyway, knowing that cells will not be moved too far away from their intended location. Further, diffusion canbeimplementedorruninjustafewminutes,evenon designs with millions of gates. Diffusion is a well-understood physical process that moves elements from a state with non-zero potential energy to a state of equilibrium. The process can be modeled by breaking down the movements into several small finite time steps, then moving each element the distance it would be expected to move during that time step. Our legalization approach follows this model; it moves each cell a small amount in a given time step according to its local density gradient. The more time steps the process is run, the closer the placement gets toward achieving equilibrium. Assume that a placement is close to legal if all that is required to legalize the placement is to snap cells to rows or perhaps perform minor cell sliding in order to fit the cells. Also, assume the chip layout is divided into small, equally sized bins which can fit around 5 15 cells. Let d max be the maximum allowed density of a bin, where commonly d max ¼ 1. The placement is considered close to legal if the area density of every bin is less than or equal to d max. For all bins with density greater than d max,cellsmustbe migrated out of those bins into less dense ones. The goal of legalization is to reduce the density of each bin to no more than d max while avoiding moving these cells far from their original locations and also to preserve the ordering induced by the original placement. Once each bin satisfies its density requirement d j;k d max, a legal placement solution can generally be easily achieved (since each bin is guaranteed sufficient space), e.g., through local slide and spiral optimization. A. The Diffusion Process Diffusion is driven by the concentration gradient, which is the slope and steepness of the concentration difference at a given point. The increase in concentration in a cross section of unit area with time is simply the difference of the material flow into the cross section and the material flow out of it. Diffusion reaches equilibrium when the material concentration is evenly distributed. Mathematically, the relationship of material concentration with time and space can be described using the following partial differential x;y ¼r 2 d x;y ðtþ (12) where d x;y ðtþ is the material concentration at position ðx; yþ at time t. Equation (12) states that the speed of density change is linear with respect to its second-order gradient over the density space. In the context of placement, cells will move quicker when their local density neighborhood has a steeper gradient. When the region for diffusion is fixed (as in placement), the boundary conditions are defined as rd xb ;y b ðtþ ¼0 for coordinates ðx b ; y b Þ on the chip boundary. We also define coordinates over fixed blocks in the same way in order to prevent cells from diffusing on top of fixed blocks. This forces cells to diffuse around the blocks. In diffusion, a cell migrates from an initial location to its final equilibrium location via a non-direct route. This route can be captured by a velocity function that gives the velocity of a cell at every location in the circuit for a given time t. This velocity at certain position and time is determined by the local density gradient and the density itself. Intuitively, a sharp density gradient causes cells to move faster. For every potential ðx; yþ location, define a 2-D velocity field v x;y ¼ðv H x;y ; vv x;yþ of diffusion at time t as follows: v H x;y ðtþ x;yðtþ =d x;y v V x;y ðtþ x;yðtþ =d x;y ðtþ: 592 Proceedings of the IEEE Vol.95,No.3,March2007

21 Given this equation, and a starting location ðxð0þ; yð0þþ for a particular location, one can find the new location ðxðtþ; yðtþþ for the element at time t by integrating the velocity field Z t xðtþ ¼xð0Þþ yðtþ ¼yð0Þþ 0 Z t 0 v H xðt 0 Þ;yðt 0 Þ ðt0 Þdt 0 v V xðt 0 Þ;yðt 0 Þ ðt0 Þdt 0 : (14) Equations (12) (14) are sufficient to simulate the diffusion process. Given any particular element, one can now find the new location of the molecule at any point in time t. To apply this paradigm to placement, one needs to migrate from this continuous space to a discrete place since cells have various rectangular sizes and the placement image itself is discrete. The next section presents a technique to simulate diffusion specifically for placement. B. Diffusion Based Placement One can discretize continuous coordinates by dividing the placement areas into equal sized bins indexed by ðj; kþ. Assume the coordinate system is scaled so that the width and height of each bin is one. Then location ðx; yþ lies inside bin ðj; kþ ¼ðbxc; bycþ. We can also discretize continuous time t as nt, wheret is the size of the discrete time step. Instead of the continuous density d x;y, we now can describe diffusion in the context of the density d j;k of bin ðj; kþ. The initial density d j;k ð0þ of each bin ðj; kþ can be defined as d j;k ð0þ ¼^A i where ^A i is the overlapping area of cell i and bin ðj; kþ. For simplicity, assume that if a fixed block overlaps a bin, it overlaps the bin completely. In these cases, the bindensityisdefinedtobeone,thoughboundaryconditions prevent cells from diffusing on top of fixed blocks. Assume that the density d j;k ðnþ has already been computed for time n. Now one needs to find how the density changes and cells movements for the next time step n þ 1. We use the Forward Time Centered Space (FTCS) [57] scheme to discretize (12). The new bin density is given by d j;k ðnþ1þ¼d j;k ðnþþ t 2 d jþ1;kðnþþd j 1;k ðnþ 2d j;k ðnþ þ t 2 d j;kþ1ðnþþd j;k 1 ðnþ 2d j;k ðnþ : (15) The new density of a bin at time n þ 1 depends only on its density and the density of its four neighbor bins. Note that one does not actually use the cell locations at time n þ 1to compute the density. Just as (12) can be discretized to compute placement bin density, (13) can be discretized to compute the velocity for cells inside the bins. For now, assume that each cell in thebinisassignedthesamevelocity,thevelocityforthe bin, given by v H j;k ðnþ ¼ d jþ1;kðnþ d j 1;k ðnþ 2d j;k ðnþ v V j;k ðnþ ¼ d j;kþ1ðnþ d j;k 1 ðnþ : (16) 2d j;k ðnþ The horizontal (vertical) velocity is proportional to the differences in density of the two neighboring horizontal (vertical) bins. To make sure that fixed cells and bins outside the boundary do not move, we enforce v V ¼ 0 at a horizontal boundary and v H ¼ 0 at a vertical boundary. Assuming that each cell in a bin has the same velocity fails to distinguish between the relative locations of cells within a bin. Further, two cells that are right next to each other but in different bins can be assigned very different velocities which could change their relative ordering. Since the goal of placement migration is to preserve the integrity of the original placement, this behavior cannot be permitted. To remedy this behavior, we apply velocity interpolation to generate a horizontal (vertical) velocity v H x;y ðvv x;yþ and for a given ðx; yþ. The interpolation looks at the four closest bins for each cell and interpolates from the velocities assigned to each of those bins, generating a unique velocity vectory for a cell at location ðx; yþ. Finally, since the velocity for each cell can be determined at time n ¼ t=t, one can compute its new placement via a discretized form of (14). Suppose at time step n a cell has location ðxðnþ; yðnþþ. Its location for the next time stamp is given by xðn þ 1Þ ¼xðnÞþv H xðnþ;yðnþ t yðn þ 1Þ ¼yðnÞþv V xðnþ;yðnþ t: (17) An example is shown in Fig. 16 in which a cell takes nine discrete time steps. Observe how the cell never overlaps a blockage and also how the magnitude of its movements becomes smaller toward the tail end of its path. C. Making it Work Since the diffusion process reaches equilibrium when each bin has the same density, we can expect the final Vol. 95, No. 3, March 2007 Proceedings of the IEEE 593

22 density after diffusion to be the same as the average density d j;k =N. This can cause unnecessary spreading, even if every bin s density is well below d max. This additional spreading will no doubt degrade the placement quality of results. Essentially, what we would like is to run diffusion for the regions which require it, perhaps for legalization or even to remove routing congestion while leaving the rest of the design (which may be in very good shape) alone. The idea of local diffusion is to only run diffusion on cells in a window around bins that violate the target density constraint. Local diffusion also has the advantages of less work to do each iteration and faster convergence. Although we use (15) to compute bin densities during diffusion, the computed densities are not exactly the same as the real placement densities. The mathematics of the diffusion process [(15), (16), and (17)] assume continuously distributed equal size particle distribution. However, the real standard cell distribution does not always satisfy this condition. This happens because cells are not equally distributed inside a bin and because cells have different sizes. Periodically, one should update the density based on the real cell placement when the error exceeds a certain threshold, then restart the diffusion algorithm from the new placement map. illegal placement. Finally, the bottom figure shows the result of diffusion based legalization, in which the continuity of the colored regions is relatively well preserved. This example illustrates that diffusion is able to perform a smooth spreading, which is less disruptive to the state of the design. To see how effective diffusion-based legalization can be in a physical synthesis engine, we ran PDS physical synthesis optimization on seven ASIC testcases in which we did not legalize at all during the run. This results in a large amount of overlaps caused by physical synthesis. We ran a greedy and flow-based legalizer for comparison and measure the best results obtained by those approaches [28]. Compared to the traditional approaches, diffusion averages about 4% improvement in the total wirelength of the design. Further, the timing of the worst slack path is 48% better on average and the overall number of negative paths is 36% better. The improvement can be observed for all seven designs. The ability of diffusion to minimize timing degradation, to smoothly spread out the placement, and to attack local hotspots of either placement or routing congestion makes it a powerful technique for physical synthesis. For starters, one can afford to run legalization less often since diffusion is less likely to significantly disrupt the state of the design. D. Diffusion Summary Fig. 17 shows an example of diffusion-based legalizationinaregionsurroundedbyotherplacedcellsandfixed blocks. The top-left figure shows an initial illegal placement in which the colored regions represent areas of cell overlap. The top-right figure shows what happens when traditional legalization is invoked. Observe how the integrity of the regions is no longer preserved as the colored cells mix. This shows how some cells can move quite far away from their neighboring cells from the top Fig. 16. An example cell movement from diffusion. VIII. CONCLUSION A. Impact of the Stages of Physical Synthesis This paper discussed various techniques to achieve fast physical synthesis which may be applied in all the phases of physical synthesis. Recall the four main phases that we are considering in this paper are: 1) initial placement and optimization; 2) timing-driven placement and optimization; 3) timing-driven detailed placement; 4) optimization techniques. One need not apply all the techniques in performing design closure, and frequently designers mix and match the pieces depending upon their needs. For example, the first phase is especially useful during the floorplanning process. The designer may wish to find the locations of large blocks and also restrict the movement of key logic. Through placement and optimization, the designer can reasonably evaluate the quality of the floorplan. If the designer is happy, with this result he or she may skip all the way to the last technique to push down the timing on any remaining critical paths. In general, the timing after performing the first step will be far from achieving closure, e.g., the cycle time may be double what is required by the design specifications. Performing timing-driven placement and optimization generally helps significantly and results in many fewer negative paths. The third stage generally does not help 594 Proceedings of the IEEE Vol.95,No.3,March2007

Fig. 17. Diffusion-based legalization example. timingbutmayimprovewiringbyanywherefrom2%to 5%, and this can make a huge difference in achieving a routable design.

23 Fig. 17. Diffusion-based legalization example. timingbutmayimprovewiringbyanywherefrom2%to 5%, and this can make a huge difference in achieving a routable design. Finally, unless the design is for some reason Beasy,[ the last stage of optimization is critical for actually achieving timing closure. Designers exploit this stage the most duringtheiriterationsastheytweakthedesign.ifonly minor changes are required, going back to global placement would be far too disruptive and potentially put the design in a completely different state. The ability to iterate and perform in-place synthesis is critical in garnering the last bit of performance out of the design. However, if the timing of the design is in really bad shape, optimization alone will not be able to close on timing. The designer must go back and iterate on the floorplan and global placement steps. B. Future Directions Physical synthesis is a runtime intensive, complex system that requires the integration and cooperation of several types of algorithms and functions. Exacerbating the turnaround time problem is that designs sizes will likely soon move from the millions to tens of millions of placeable objects. There are numerous research directions in the timing closure space that we believe are worth pursuing to achieve both faster runtime and higher quality of results. In general, achieving better quality can also be a great way to achieve a faster system, as the back end optimization could have far fewer negative paths to work on. Some promising research directions include the following. 1) Better net weighting for timing-driven placement. For example, consider two critical paths A and B, both Vol. 95, No. 3, March 2007 Proceedings of the IEEE 595

24 of which are equally critical, but A spends 80% of its delay traversing fixed blocks and 20% through moveablelogic,whilebspends20%and80%in fixed and moveable logic, respectively. In this case, A does not have much room for error as placement needs to fix the 20% of the logic that can be fixed, while B has considerably more opportunity for placement to straighten out the 80% of logic that it can affect. Thus, net weighting should give more priority to nets in path A than B. There are numerous other scenarios that can be studied and modeled to improve net weighting. 2) Removing a global placement. In the flow described, placement is run twice. If clever net weighting and crude placement estimation is used, it may be possible to significantly improve runtime by skipping a placement step altogether and still retain solution quality. 3) Latch pipeline placement. As designs require multiple cycles to get from one side of the chip to the other, placement needs to recognize that latches must be placed in such a way to guarantee that one can get from one latch to another within the given cycle time. For example, assume latch A drives latch B, which drives latch C, and A is fixed on the left side and C is fixed on the right. If B is too close to A, then the path from B to C becomes critical. If one applies a higher net weight to the connection from B to C, then B may be moved too close to C and then the A to B path becomes critical. One has to teach placement to find an appropriate balance, and it is unlikely net weighting alone can achieve this kind of result. 4) BDo no harm[ detailed placement. Detailed placement is a powerful technique for improving wirelength but typically does not improve timing. In fact, it is risky to run it late in the fourth stage of the flow because it may worsen paths that were already carefully optimized. The idea of Bdo no harm[ detailed placement [58] is to recognize moves that degrade the timing and forbid them, while only accepting moves that improve wirelength and timing. 5) Force-directed placement. As discussed earlier, force-directed placement is emerging as a promising technique both in terms of quality [7] mpl [8] mfar [9] and speed [10]. This technique also has the advantage of stability in that small changes to net weights likely will not create entirely different global placements. Its spreading ability (like that of diffusion) makes it appealing for handling incremental netlist changes. 6) Parallelism. As designs truly become large, the designs can potentially be partitioned into smaller physical pieces that do not require an inordinate amount of cross-partition communication. One can then apply physical synthesis on each piece relatively independently. While this approach seems simple enough, it is fraught with choices, any of which could lead to significantly degraded solution. One must be careful with the partition pin assignment, buffering strategy, and timing contracts between partitions. 7) Complex transforms. Transforms which perform multiple operations simultaneously could potentially have a big impact on timing. For example, consider a cell B on the left side connected to cells A and C on the right side. Clearly B wants to be near A and C, but if the nets connected to B have already been buffered, those buffers act as anchors which keep B from moving to the right. One needs toripupthebuffertrees,thenconsidermovingb, then put the buffer trees back in to evaluate whether this was worthwhile. Another example is simultaneous buffering and cloning. This list is just a sampling of possible research directions. As design technology scales to 65 nm and below, the problem of timing closure will continue to evolve into the even more complex problem of design closure. Design closure requires that accurate modeling of theclocktreenetworkandroutingbeincorporatedearlier and earlier up the physical synthesis pipeline to take into account their effects on timing and signal integrity. The need to meet a global power constraint, e.g., by incorporating multithreshold logic gates and voltage islands, also becomes more critical. One must pay attention to how physical design choices impact manufacturability. Requiring physical synthesis to meet and incorporate these additional constraints only further exacerbates the runtime issue. Therefore, research which discovers more efficient techniques for core physical synthesis optimizations like placement, buffering, legalization, repowering, incremental timing, routing, and clock tree synthesis will continue to be of high value. h Acknowledgment The PDS physical synthesis system has had many contributors over the years. The authors sincerely thank everyone who has helped both with driving the work presented here and for overall contributions to IBM s PDS tool. These contributors include Lakshmi Reddy, Ruchir Puri, David Kung, Leon Stok, Charles Bivona, Louise Trevillian, Michael Kazda, Pooja Kotecha, Nate Heiter, Erik Kusko, Mike Dotson, Carl Hagen, Zahi Kurzum, Gopal Gandham, Stephen Quay, Tuhin Mahmud, Jiang Hu, Milos Hrkic, Kristian Zoerhoff, William Dougherty, Brian Wilson, Bryon Wirtz, Tony Drumm, Elaine D Souza, Shyam Ramji, Alex Suess, Jose Neves, Veena Puresan, Arjen Mets, Andrew Sullivan, Jim Curtain, David Geiger, Tsz-mei Ko, and Pete Osler. 596 Proceedings of the IEEE Vol.95,No.3,March2007

25 REFERENCES [1] L. Trevillyan, D. Kung, R. Puri, L. N. Reddy, and M. A. Kazda, BAn integrated environment for technology closure of deep-submicron IC designs, IEEE Des. Test Comput., vol. 21, no. 1, pp , Jan. Feb [2] P. G. Villarrubia, BPhysical design tools for hierarchy,[ in Proc. ACM Int. Symp. Physical Design, [3] P. Saxena, N. Menezes, P. Cocchini, and D. A. Kirkpatrick, BRepeater scaling and its impact on CAD, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 23, no. 4, pp , Apr [4] J. Cong, Z. D. Kong, and T. Pan, BBuffer block planning for interconnect planning and prediction, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 6, pp , Dec [5] C. J. Alpert, J. Hu, S. S. Sapatnekar, and P. G. Villarrubia, BA practical methodology for early buffer and wire resource allocation,[ in Proc. Design Automation Conf., [6] G.-J. Nam, C. J. Alpert, P. G. Villarrubia, B. Winter, and M. Yildiz, BThe ISPD2005 placement contest and benchmark suite,[ in Proc. ACM Int. Symp. Physical Design, 2005, pp [7] A. B. Kahng and Q. Wang, BImplementation and extensibility of an analytic placer, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 5, pp , May [8] T. Chan, J. Cong, T. Kong, J. Shinnerl, and K. Sze, BAn enhanced multilevel algorithm for circuit placement,[ in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, 2003, pp [9] B. Hu and M. M. Sadowska, BFine granularity clustering-based placement, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 23, no. 4, pp , Apr [10] N. Viswanathan and C.-N. Chu, BFastplace: Efficient analytical placement using cell shifting, iterative local refinement and a hybrid net model,[ in Proc. ACM Int. Symp. Physical Design, 2004, pp [11] B. Halpin, C. Y. R. Chen, and N. Sehgal, BTiming driven placement using physical net constraints,[ in Proc. IEEE/ACM Design Automation Conf., 2001, pp [12] R.-S. Tsay and J. Koehl, BAn analytic net weighting approach for performance optimization in circuit placement,[ in Proc. IEEE/ACM Design Automation Conf., 1991, pp [13] X. Yang, B.-K. Choi, and M. Sarrafzadeh, BTiming-driven placement using design hierarchy guided constraint generation,[ in IEEE/ACM ICCAD, 2002, pp [14] K. Rajagopal, T. Shaked, Y. Parasuram, T. Cao, A. Chowdhary, and B. Halpin, BTiming driven force directed placement with physical net constraints,[ in Proc. Int. Symp. on Physical Design, Apr. 2003, pp [15] H. Ren, D. Z. Pan, and D. Kung, BSensitivity guided net weighting for placement driven synthesis,[ in Proc. Int. Symp. on Physical Design, Apr. 2004, pp [16] T. Kong, BA novel net weighting algorithm for timing-driven placement,[ in Proc. Int. Conf. Computer Aided Design, 2002, pp [17] D. Brand, R. F. Damiano, L. P. P. P. van Ginneken, and A. D. Drumm, BIn the driver s seat of booledozer,[ in ICCD, 1994, pp [18] L. Stok, D. S. Kung, D. Brand, A. D. Drumm, L. N. Reddy, N. Hieter, D. J. Geiger, H. H. Chao, P. J. Osler, and A. J. Sullivan, BBooledozer: Logic synthesis for asics, IBM J. Res. Dev., vol. 40, no. 4, pp , [19] W. Donath, P. Kudva, L. Stok, P. Villarrubia, L. Reddy, A. Sullivan, and K. Chakraborty, BTransformational placement and synthesis,[ in Proc. Design, Automation and Test in Europe, Mar [20] S. K. Karandikar, C. J. Alpert, M. C. Yildiz, P. G. Villarrubia, S. T. Quay, and T. Mahmud, BFast electrical correction using resizing and buffering,[ in Proc. Asia and South Pacific Design Automation Conf., [21] P. J. Osler, BPlacement driven synthesis case studies on two sets of two chips: Hierarchical and flat,[ in Proc. ACM Int. Symp. Physical Design, 2004, pp [22] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, BMultilevel hypergraph partitioning: Application in VLSI domain,[ in Proc. ACM/IEEE Design Automation Conf., 1997, pp [23] G.-J. Nam, S. Reda, C. Alpert, P. Villarrubia, and A. Kahng, BA fast hierarchical quadratic placement algorithm, IEEE Trans. CAD of ICs and Systems, vol. 25, no. 4, Apr [24] L. P. P. P. van Ginneken, BBuffer placement in distributed RC-tree networks for minimal Elmore delay,[ in IEEE Int. Symp. on Circuits and Systems, May 1990, pp [25] Z. Li, C. N. Sze, C. J. Alpert, J. Hu, and W. Shi, BMaking fast buffer insertion even faster via approximation techniques,[ in Proc. Asia and South Pacific Design Automation Conf., 2005, pp [26] S. Hu, C. J. Alpert, J. Hu, S. K. Karandikar, Z. Li, W. Shi, and C. N. Sze, BFast algorithms for slew constrained minimum cost buffering,[ in Proc. ACM/IEEE Design Automation Conf., 2006, pp [27] C. J. Alpert, M. Hrkic, J. Hu, and S. T. Quay, BFast and flexible buffer trees that navigate the physical layout environment,[ in Proc. ACM/IEEE Design Automation Conf., 2004, pp [28] H. Ren, D. Z. Pan, C. J. Alpert, and P. Villarrubia, BDiffusion-based placement migration,[ in Proc. Design Automation Conf., 2005, pp [29] W.-J. Sun and C. Sechen, BEfficient and effective placement for very large circuits, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 14, no. 5, pp , May [30] C. J. Alpert, J.-H. Huang, and A. B. Kahng, BMultilevel circuit partitioning, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 17, no. 8, pp , Aug [31] A. E. Caldwell, A. B. Kahng, and I. L. Markov, BCan recursive bisection alone produce routable placements? in Proc. Design Automation Conf., 2000, pp [32] A. Agnihotri, M. C. Yildiz, A. Khatkhate, A. Mathur, S. Ono, and P. H. Madden, BFractional cut: Improved recursive bisection placement,[ in Proc. Int. Conf. Computer Aided Design, 2003, pp [33] M. Wang, X. Yang, and M. Sarrafzadeh, BDragon2000: Standard-cell placement tool for large industry circuits,[ in Proc. Int. Conf. Computer-Aided Design, 2000, pp [34] H. Eisenmann and F. M. Johannes, BGeneric global placement and floorplanning,[ in Proc. ACM/IEEE Design Automation Conf., 1998, pp [35] P. Spindler and F. M. Johannes, BFast and robust quadratic placement combined with an exact linear net model,[ presented at the IEEE/ACM Int. Conf. Computer-Aided Design, San Jose, CA, [36] J. Vygen, BAlgorithms for large-scale flat placement,[ in Proc. ACM/IEEE Design Automation Conf., 1997, pp [37] D.-H. Huang and A. B. Kahng, BPartitioning based standard cell global placement with an exact objective,[ in Proc. ACM Int. Symp. Physical Design, 1997, pp [38] C. J. Alpert and A. B. Kahng, BRecent developments in netlist partitioning: A survey, Integr. VLSI J., vol. 19, pp. 1 81, [39] C. J. Alpert, G. Gandham, M. Hrkic, J. Hu, A. B. Kahng, J. Lillis, B. Liu, S. T. Quay, S. S. Sapatnekar, and A. J. Sullivan, BBuffered Steiner trees for difficult instances, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 21, no. 1, pp. 3 14, Jan [40] J. Cong, A. Kahng, and K. Leung, BEfficient algorithm for the minimum shortest path steiner arborescence problem with application to VLSI physical design, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 17, no. 1, pp , Jan [41] J. Lillis, C. K. Cheng, and T. Y. Lin, BOptimal wire sizing and buffer insertion for low power and a generalized delay model, IEEE J. Solid-State Circuits, vol. 31, no. 3, pp , Mar [42] C. J. Alpert, A. Devgan, and S. T. Quay, BBuffer insertion for noise and delay optimization,[ in Proc. ACM/IEEE Design Automation Conf., 1998, pp [43] C. J. Alpert, A. Devgan, and S. T. Quay, BBuffer insertion with accurate gate and interconnect delay computation,[ in Proc. ACM/IEEE Design Automation Conf., 1999, pp [44] W. Shi and Z. Li, BAn O(nlogn) time algorithm for optimal buffer insertion,[ in Proc. IEEE/ACM Design Automation Conf., 2003, pp [45] W. Shi, Z. Li, and C. J. Alpert, BComplexity analysis and speedup techniques for optimal buffer insertion with minimum cost,[ in Proc. Asia and South Pacific Design Automation Conf., 2004, pp [46] C. J. Alpert, R. G. Gandham, J. L. Neves, and S. T. Quay, BBuffer library selection,[ in Proc. ICCD, 2000, pp [47] J. Lillis, C. K. Cheng, and T.-T. Y. Lin, BOptimal wire sizing and buffer insertion for low power and a generalized delay model, IEEE Trans. Solid-State Circuits, vol. 31, no. 3, pp , Mar [48] C. Kashyap, C. Alpert, F. Liu, and A. Devgan, BClosed form expressions for extending step delay and slew metrics to ramp inputs,[ in Proc. Int. Symp. Physical Design (ISPD), 2003, pp [49] H. Bakoglu, Circuits, Interconnects, and Packaging for VLSI. Reading, MA: Addison-Wesley, [50] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design. Reading, MA: Addison-Wesley, 1993, pp [51] M. Hrkic and J. Lillis, BS-tree: A technique for buffered routing tree synthesis,[ in Proc. ACM/IEEE Design Automation Conf., 2002, pp [52] X. Tang, R. Tian, H. Xiang, and D. F. Wong, BA new algorithm for routing tree construction with buffer insertion and wire sizing under obstacle constraints,[ in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, 2001, pp [53] C. J. Alpert, J. Hu, S. S. Sapatnekar, and C. N. Sze, BAccurate estimation of global buffer delay within a floorplan,[ in Proc. IEEE/ACM Int. Conf. Computer-Aided Design, 2004, pp Vol. 95, No. 3, March 2007 Proceedings of the IEEE 597

[54] U. Brenner, A. Pauli, and J. Vygen, BAlmost optimum placement legalization by minimum cost flow and dynamic programming,[ in Proc. Int. Symp. Physical Design, 2004, pp. 2 9. [55] S. W. Hur and J.

Zelikovsky, BOptimization of linear placements for wirelength minimization with free sites,[ in Proc. Asia and South Pacific Design Automation Conf., 1999, pp. 18 21. [57] W. H. Press, S. A. Teukolsky, W.

-J. Nam, and P. G. Villarrubia, BHippocrates: First-do-no-harm detailed placement,[ presented at the Asia and South Pacific Design Automation Conf., Yokohama, Japan, 2007. ABOUT THE AUTHORS Charles J.

degree in computer science from the University of California, Los Angeles (UCLA), in 1996.

He has over 80 conference and journal publications. His research centers upon innovation in physical synthesis optimization. Dr.

He has served as the general chair and the technical program chair for the Tau Workshop on Timing Issues in the Specification and Synthesis of Digital Systems and the International Symposium on

26 [54] U. Brenner, A. Pauli, and J. Vygen, BAlmost optimum placement legalization by minimum cost flow and dynamic programming,[ in Proc. Int. Symp. Physical Design, 2004, pp [55] S. W. Hur and J. Lilis, BMongrel: Hybrid techniques for standard cell placement,[ in Proc. Int. Conf. Computer-Aided Design, 2000, pp [56] A. B. Kahng, P. Tucker, and A. Zelikovsky, BOptimization of linear placements for wirelength minimization with free sites,[ in Proc. Asia and South Pacific Design Automation Conf., 1999, pp [57] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C++. Cambridge, U.K.: Cambridge Univ. Press, [58] H. Ren, D. Pan, C. Alpert, G.-J. Nam, and P. G. Villarrubia, BHippocrates: First-do-no-harm detailed placement,[ presented at the Asia and South Pacific Design Automation Conf., Yokohama, Japan, ABOUT THE AUTHORS Charles J. Alpert (Fellow, IEEE) received the B.S. degree in math and computational sciences and the B.A. degree in history from Stanford University, Stanford, CA, in 1991 and the Ph.D. degree in computer science from the University of California, Los Angeles (UCLA), in He currently works as a Research Staff Member at the IBM Austin Research Laboratory, Austin, TX, where he serves as the technical lead for the design tools group. He has over 80 conference and journal publications. His research centers upon innovation in physical synthesis optimization. Dr. Alpert has thrice received the Best Paper Award from the ACM/ IEEE Design Automation Conference. He has served as the general chair and the technical program chair for the Tau Workshop on Timing Issues in the Specification and Synthesis of Digital Systems and the International Symposium on Physical Design. He also serves as an Associate Editor of IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN. For his work in mentoring SRC funded research, he received the Mahboob Khan Mentor Award in Gi-Joon Nam (Member, IEEE) received the B.S. degree in computer engineering from Seoul National University, Seoul, Korea, and the M.S. and Ph.D. degrees in computer science and engineering from the University of Michigan, Ann Arbor. Since 2001, he has been with the International Business Machines Corporation Austin Research Laboratory, Austin, TX, where he is primarily working on the physical design space, particularly placement and timing closure flow. His general interests includes computer-aided design algorithms, combinatorial optimizations, very large scale integration system designs, and computer architecture. Dr. Nam has been serving on the technical program committee for the International Symposium on Physical Design (ISPD), the International Conference on Computer Design (ICCD), the Asia and South Pacific Design Automation Conference (ASPDAC) and the International System-on-Chip Conference (SOCC). He was also the organizer of ISPD 2005/2006 placement contest. Shrirang K. Karandikar received the B.E. degree from the University of Pune, Pune, India, in 1994, the M. S. degree from Clarkson University, Potsdam, NY, in 1996, and the Ph.D. degree from the University of Minnesota, Minneapolis, in He worked with Intel s Logic and Validation Technology group from 1997 to 1999, and is currently a Researcher Staff Member at IBM Austin Research Laboratory. His current interests are in the areas of logic synthesis and physical design of VLSI systems. Stephen T. Quay received two B.S. degrees in electrical engineering and computer science from Washington University, St. Louis, MO, in He is currently a Senior Engineer with the IBM Systems and Technology Group, Austin, TX. Since 1983, he has worked in many areas of chip layout and analysis for IBM. He currently develops design automation applications for interconnect performance optimization. Zhuo Li (Member, IEEE) received the B.S. and M.S. degrees in electrical engineering from Xi an Jiaotong University, Xi an, China, and the Ph.D. degree in computer engineering from Texas A&M University, College Station, in 1998, 2001, and 2005, respectively. From 2005 to 2006, he was with Pextra Corporation, College Station as a Cofounder and Senior Technical Staff working on VLSI extraction tools development. He is currently with IBM Austin Research Laboratory, Austin, TX. His research interests include physical synthesis optimization, parasitic extraction, circuit modeling and simulation, timing analysis and delay testing. Dr. Li was a recipient of Applied Materials Fellowship in He received a Best Paper Award at the Asia and South Pacific Design Automation Conference in Haoxing Ren (Member, IEEE) received the B.S. and M.Eng. degrees in electrical engineering from Shanghai Jiao Tong University, China, in 1996 and 1999, respectively, the M.S. degree in computer engineering from Rensselaer Polytechnic Institute, Troy, NY, in 2000, and the Ph.D. degree in computer engineering at the University of Texas, Austin, in He worked at IBM System and Technology Group from 2000 to Currently he is a Research Staff Member at IBM T.J. Watson Research Center. His research interests include logic synthesis and physical design of VLSI systems. 598 Proceedings of the IEEE Vol.95,No.3,March2007

C. N. Sze (Member, IEEE) received the B.Eng. and M.Phil.

Since then, he has been with the IBM Austin Research Laboratory, Austin, TX, where he focuses on integrated placement and timing optimization for ASIC and microprocessor designs.

His research interests include design and analysis of algorithms, computer-aided design technique for very large scale integration, physical design, and performance-driven interconnect synthesis.

27 C. N. Sze (Member, IEEE) received the B.Eng. and M.Phil. degrees from the Department of Computer Science and Engineering, the Chinese University of Hong Kong, in 1999 and 2001, respectively, and the Ph.D. degree in computer engineering at the Department of Electrical Engineering, Texas A&M University, College Station, in Since then, he has been with the IBM Austin Research Laboratory, Austin, TX, where he focuses on integrated placement and timing optimization for ASIC and microprocessor designs. He was a recipient of the DAC Graduate Scholarships. His research interests include design and analysis of algorithms, computer-aided design technique for very large scale integration, physical design, and performance-driven interconnect synthesis. He is known by the names of Chin-Ngai Sze and Cliff Sze. Mehmet C. Yildiz (Member, IEEE) received the B.S. degree in computer engineering from Marmara University, Turkey, in 1995, the M.S. degree in computer science from Yeditepe University, Turkey, in 1998, and the Ph.D. degree from Binghamton University, Binghamton, NY, in He worked in IBM Austin Research Laboratory as Postdoc for two years. He currently works as an Advisory Software Engineer in IBM EDA, Austin, TX. He has more than ten conference and journal papers. His work is currently focused on clock tree routing. Dr. Yildiz serves as an Advisory Board Member of SIGDA, responsible for the Web server. Paul G. Villarrubia received the B.S. degree in electrical engineering from Louisiana State University, Baton Rouge, in 1981 and the M.S. degree from the University of Texas, Austin, in He is currently a Senior Technical Staff Member at IBM, Austin, where he leads the development of placement and timing closure tools. He has worked at IBM in the areas of physical design of microprocessors, physical design tools development, and tools development for ASIC timing closure. Interests include placement, synthesis, buffering, signal integrity and extraction. He has 21 patents, 20 publications. Mr. Villarrubia has one DAC best paper award. He was a member of the 2005 ICCAD TPC, and an invited speaker at the 2002 and 2004 ISPD conferences. Vol. 95, No. 3, March 2007 Proceedings of the IEEE 599

Making Fast Buffer Insertion Even Faster Via Approximation Techniques

1A-3 Making Fast Buffer Insertion Even Faster Via Approximation Techniques Zhuo Li 1,C.N.Sze 1, Charles J. Alpert 2, Jiang Hu 1, and Weiping Shi 1 1 Dept. of Electrical Engineering, Texas A&M University,