Overcoming Wireload Model Uncertainty During Physical Design

Size: px

Start display at page:

Download "Overcoming Wireload Model Uncertainty During Physical Design"

Paulina Craig
5 years ago
Views:

Overcoming Wireload Model Uncertainty During Physical Design Padmini Gopalakrishnan, Altan Odabasioglu, Lawrence Pileggi, Salil Raje Monterey Design Systems 894 Ross Drive, Suite, Sunnyvale, CA

1 Overcoming Wireload Model Uncertainty During Physical Design Padmini Gopalakrishnan, Altan Odabasioglu, Lawrence Pileggi, Salil Raje Monterey Design Systems 894 Ross Drive, Suite, Sunnyvale, CA {padmini, altan, pileggi, ABSTRACT The advent of deep sub-micron technologies has created a number of problems for existing design methodologies. Most prominent among them is the problem of timing closure, whereby design time is dramatically increased due to iterations between gate-level synthesis and physical design. It is well known that the heart of this problem lies in the use of wireload models based on wirelength statistics from legacy designs. Some technology projections in [3] have suggested that wireload models will remain effective to block sizes on the order of 5k gates. This suggests that synthesis will not have to be changed much since this is approximately the maximum size for which logic synthesis is effective. However, our analyses on production designs show that the problem is not quite so straightforward, and the efficacy of synthesis using wireload models depends upon technology data as well as specific characteristics of the design. We analyze these effects and dependencies in detail in this paper, and draw some conclusions about the amount of physical information that is required for synthesis to be effective. Finally, we discuss the implications on hierarchical design flows, and propose a solution via physical prototyping. INTRODUCTION Until deep sub-micron (DSM) issues began to surface, design methodologies for synthesis and logic optimization were decoupled from placement and routing. Prior to physical design, wireload models based on statistical information from design legacy [4] were used to provide gate load models during logic optimization. For pre-dsm technologies the error associated with wireload estimates of interconnect capacitance had very little impact on the actual delays, since the device load-capacitances dominated the total net capacitance. However, as interconnect capacitance became more dominant at and below.25 microns, designers were forced to iterate, feeding back interconnect information from place and route to redo gate-level logic optimization [2]. Unfortunately, this loop has no guarantee of convergence, since the re-optimized netlist could result in a different place and route solution, with new values for the interconnect capacitances and resistances. As we will demonstrate with several examples from industrial designs, wireload models are always inaccurate in a relative sense, even under the best of circumstances [5]. Whether or not they are acceptable in an absolute sense depends on the ratio of interconnect to device capacitance, and the criticality of the paths on which they lie. As process technologies scale the impact of interconnect on the delays Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISPD', April -4, 2, Sonoma, California, USA. Copyright 2 ACM //4.$5.. becomes more prominent, and in some cases (such as for global nets) so dominant that it must be carefully considered as part of micro-architectural decision process. Clearly the trend is not just to estimate interconnect effects more accurately, but to do so effectively as early in the design flow as possible. Obviously these are somewhat conflicting requirements. The accuracy of interconnect estimation depends on the resolution of physical (e.g. placement) information which improves at late stages in the design flow. This raises the question of a suitable middle ground, or a point in the design flow when early interconnect estimation provides acceptable accuracy in spite of the physical uncertainties. Studies such as [3] suggested that this point corresponds to a block size of 5k gates. Our results in this paper, however, show through a series of experiments that the conclusions are not quite so simple. We find that the wireload model efficacy is strongly dependent on technology parameters and specific characteristics of the netlist topology and the floorplan. Understanding these dependencies has significant implications on understanding and practising blockbased and hierarchical design. Following our detailed analyses of the limitations of wireload models, we discuss some of these implications and propose some solutions. 2 WIRELOAD MODELS The relative impact of interconnect and device capacitances on delay determines whether or not wireload estimation is acceptable. To illustrate this point we performed the following experiments on production designs. Given the detailed placement of a finalized gatelevel netlist, we divided the chip into rectangular partitions of equal dimensions and area. Each partition represents a block of gates; thus any given block size represents a level of granularity in the placement. One can think of the nets within a partition as being local nets and the nets going across partitions as global nets. Figure. shows local net Figure. Placement partitioned into blocks. this setup of a detailed placement partitioned into blocks, including an example of a local net. We then analyzed the delays over different partition sizes and using various approximations for the local interconnect capacitances. The objective was to analyze the impact on the stage delays of estimating local interconnect via wireload models given various levels of physical design resolution. 82

2 2.. The Impact of Interconnect on Delay To obtain some understanding of the impact of local net capacitance on delay, our first experiment was to consider local delays with and without interconnect. The delay profiles for each partition size are generated as follows. Every cell retains its detailed placement coordinates. The topology of a net is modeled by a Steiner tree approximation [6]. We applied a crude layer assignment algorithm to assign higher metal layers to longer nets and lower metal layers to shorter nets. For each net, we compute the worst case delay from an input pin of the net driver to one of its fanouts. We consider the following cases.. Assume that the entire load on the driver is due to pin capacitances, and that interconnect has no effect: this is delay d. 2. Assume that the load on the driver is due to pin capacitances, as well as interconnect capacitance and resistance from the Steiner model: this is delay d2. We then plotted the distribution of the ratio (d/d2) over all the nets in the design that are completely within a partition (local nets). This ratio is always between and.. Smaller values of the ratio imply that interconnect has a significant impact on the worst case delay of this net; values close to. imply that device capacitances dominate. Since we consider nets that are completely within a partition, the bounding box of any such net must lie within the bounding box of the partition that encloses it Profiles for a.8 micron process. Here we show these distributions for an industrial design in a.8 micron process. The design has approximately 44k gates. In Figures 2-4 we show profiles for all 2 pin local nets over three different levels of partition sizes. Note from the figures that for larger partition sizes there are more local nets. As one would expect, the wirelength distributions for these local nets shows greater deviation from the mean for the larger partition sizes. As a result, at larger partition sizes we can observe a larger number of nets where the ratio (d/d2) is much less than.. However, it is important to note that even at the smallest partition size that we considered, where each partition contains only 6 gates, we see some local nets with a ratio as small as.65. If such a net lies on a critical path, the wireload model error associated with it can easily result in a failure to achieve timing closure (stage-delay w/o interconnect)/(stage-delay with interconnect) Figure 2. Partition size is roughly 34 x 28 sq. microns, corresponding to approximately 7k gates per partition To better understand why interconnect dominates some nets more than others, we look at the following parameters for all nets that are local to a partition.. The ratio of the net-length to the half perimeter of the rectangle that forms a partition: which we will refer to as r. This ratio is a measure of the relative length of a net. 2. The ratio (d/d2) described earlier in this section, which we will refer to as r2. As mentioned above, this is a measure of how dominant interconnect is for a net. Nets with a low value of r and a high value of r2 are short nets, (stage-delay w/o interconnect)/(stage-delay with interconnect) Figure 3. Partition size is roughly 22 x 8 sq. microns, corresponding to approximately 3 gates per partition (stage-delay w/o interconnect)/(stage-delay with interconnect) Figure 4. Partition size is roughly 6 x 5 sq. microns, corresponding to approximately 2 gates per partition for which interconnect does not significantly impact delay. Nets with a low value of r and a low value of r2 are short nets for which interconnect is dominant because the driver is weak. Nets with a high value of r and a high value of r2 are long nets, but generally strong drivers lessen the effect of the interconnect. Nets with a high value of r and a low value of r2 are the ones which fall into the category of interconnect dominated. In Figures 5-7 below we show scatter plots of the local 2 pin nets profiled in Figures 2-4, with r2 on the x-axis and r on the y- axis. Note that the length of a net is the same, irrespective of size of Ratio r Ratio r2 Figure 5. Partition size is roughly 34 x 28 sq. microns, corresponding to approximately 7k gates per partition the partitions. However its length as a fraction of partition size decreases as partition size increases. The value of r2 is a constant for a given net. The profiles show what we would expect: in general, the interconnect has a greater impact for longer nets. Some of the extreme cases that were observed, especially for the smallest partition size, are primarily attributable to weak drivers. Since we use exact detailed placement coordinates for cells, but only a routing model for nets, the profiles shown in Section 2... present a best case picture from a routing perspective. Namely, routing obstacles and congestion which can cause nets to be even longer were not considered. There could potentially be further variation since the detailed routes include exact layer assignments, meandering 83

3 sider wireload statistics generated from exact data for our design under test. For example, we derive the wireload statistics from the actual detailed placement coordinates, which while impractical for the general design problem, will clearly represent a best-case for the wireload modeling error. Even with this best case model, we can show that at some level of block size the error incurred is too large. Starting with the detailed placement of the design under investigation, we divide the chip area into partitions as described in Section 2.. For each partition size we generate a wireload model that estimates the interconnect length of a local net as a function of its pin-count. The wireload model is generated as follows: For a given partition size, we determine which nets are local. Then we generate a wirelength distribution for these local nets for each pin-count (i.e. we have one distribution for 2 pin nets, one distribution for 3 pin nets and so on.). As before, a net is modelled by a Steiner tree. We compute the worst case delay for each local net from a driver input to a fanout by substituting its actual length with the length estimated by the wireload model. In these initial experiments we used the mean, or average wirelength as our wireload model predictor. The wireload model uses only these statistics from the detailed placement, and there is no motion across partitions or any kind of change in the netlist after the statistics are compiled. We consider only local nets which are those nets fully contained within a partition. Therefore, for a given partition size, this wireload model represents the most accurate average prediction of wirelength that is possible as a function of only the pin-count of a net. In Figures 9 - we show scatter plots of 2 pin local nets at different partition sizes, similar to those shown without wireload models in Figures 2-4. The plots show the actual delay of a net along the x axis and the wireload model predicted delay along the y axis. The design example here is the same as in Section 2.. From these plots we can see that there is a lot of difference between the actual and predicted delays at large partition sizes. The variation in the delays of 2 pin local nets is quite significant at large partition sizes, as shown in Figure 9. The correlation becomes better as partition size decreases, with the points clustering closer to the straight line x = y. As can be seen from Figure, the variation in the actual delays of these nets is much smaller too. To quantify the errors in estimation, we generate a distribution of the ratio of the estimated delay of a net to its actual delay: which we will refer to as r3. We show these distributions for 2 pin local nets at different partition sizes in Figures Ratio r Ratio r2 Figure 6. Partition size is roughly 22 x8 sq. microns, corresponding to approximately 3 gates per partition.9 Ratio r Ratio r2 Figure 7. Partition size is roughly 6 x 5 sq. microns, corresponding to approximately 2 gates per partition due to congestion, vias and jogs in the routes, and more precise capacitance and resistance values Profiles for a.25 micron technology We now profile nets for an industrial design in a.25 micron technology that contains approximately 48k gates. As for the data in Figures 2-4, the delay calculation uses a steiner tree to model net topology, and does a rough layer assignment based on net length. A profile for 2 pin local nets at a partition size of 3 x 3 sq. microns is shown in Figure 8. The level of placement granularity that it Estimated Delay (stage-delay w/o interconnect)/(stage-delay with interconnect) Figure 8. Partition size is roughly 3 x 3 sq. microns corresponding to approximately 3k gates per partition Actual Delay Figure 9. Partition size is roughly 34 x 28 sq. microns, corresponding to approximately 7k gates per partition corresponds to is roughly the same (actually slightly coarser) as that of the.8 micron design shown in Figure 3. Comparing the two profiles, we can see that a larger percentage of the nets profiled here have a ratio of (d/d2) close to.. Thus, for this design, errors in wireload estimates impact stage delays to a smaller degree. Figures 2-4 clearly show the distribution getting narrower at smaller partition sizes, and as expected, the wireload estimate becoming more accurate. We can also see that there are partition sizes at which the error in estimation is very large. In other words, at these levels of placement granularity, the wireload model breaks down. Optimizations that are based on these estimates would be significantly in error. Further, we have shown earlier in this section that this is the best possible wireload model that could be found; so a wireload model based on design legacy statistics would in all likelihood be much worse. Moreover, given that the wireload model has significant error even with coarse placement information, it will have much 2.2. A Perfect Wireload Model In Section 2.., we showed the error that would be incurred by completely ignoring the impact of the local interconnect for an industrial design. Next we consider the error incurred by using the best wireload model. In general, wireload models are assembled from statistical data over a population of designs. In this experiment we con84

4 Estimated Delay Actual Delay Figure. Partition size is roughly 22 x8 sq. microns, corresponding to approximately 3 gates per partition Estimated Delay Actual Delay Figure. Partition size is roughly 6 x 5 sq. microns, corresponding to approximately 2 gates per partition Figure 2. Partition size is roughly 34 x 28 sq. microns, corresponding to approximately 7k gates per partition Figure 3. Partition size is roughly 22 x 8 sq. microns, corresponding to approximately 3 gates per partition greater error when used in gate-level synthesis, which is completely devoid of placement information Adding More Pessimism? The obvious next question to ask is: what if we use a more pessimistic wireload model? For example, what if we use the mean + standard-deviation of the distribution instead of just the mean? Figures 5-7 again show the ratio distributions of the estimated delay to the actual delay for 2 pin local nets at different partition sizes. Comparing these distributions with those from Figures 2-4 we Figure 4. Partition size is roughly 6 x 5 sq. microns, corresponding to approximately 2 gates per partition. can clearly observe that the means of the distributions shift to a greater value as a result of the increased pessimism. But figuring out how much to shift these estimates, without overdesigning, is a difficult problem. Moreover, too much of a shift can adversely impact the fast-path problem in terms of hold margins, which is becoming an increasingly difficult problem with faster operating frequencies and shallower logic depths Figure 5. Partition size is roughly 34 x 28 sq. microns, corresponding to approximately 7k gates per partition Figure 6. Partition size is roughly 22 x 8 sq. microns, corresponding to approximately 3 gates per partition Figure 7. Partition size is roughly 6 x 5 sq. microns, corresponding to approximately 2 gates per partition 3 DSM TECHNOLOGY IMPLICATIONS From the data in Section it is apparent that predicting the im- 85

5 pact of interconnect has become a challenge since we have entered the DSM range for technologies. As expected, we see that errors in interconnect estimation have the greatest impact for large block sizes. Further, we also showed that even a non-causal wireload model that is based on the actual placement breaks down at large block sizes; hence using wireload models for gate-level synthesis is not meaningful. But what do we expect with further scaling for CMOS technologies? In general, we would expect things to get worse, but why, and by how much? 3.. Increasing Interconnect Dominance For pre-dsm technologies, shrinking device sizes were evidenced by improvement in switching speeds. This was primarily due to the increase in drive currents with reductions in channel length. As channel lengths reduce to less than.25 micron, however, the drive current remains more or less constant because of velocity saturation. Decreases in gate delays are, therefore, due mainly to reductions in gate oxide thickness. One would thus expect to see a slower rate of increase in device speeds with continued scaling [3]. At the same time, interconnect delays are increasing with scaling for two reasons. First of all, interconnect capacitance dominates the total net capacitance due to: a) increased routing densities that have led to shrinking wiring pitches; and b) aspect ratios that attempt to keep the resistance of these narrower wires constant. Both have resulted in an increase in the capacitance per unit length, particularly due to inter-layer coupling capacitance []. Secondly, since chip sizes are also growing; global wires are longer than ever before. As wire widths decrease with scaling to accommodate a greater density of routing, the interconnect resistance effects for these long wires start to become evident. Via resistances also increase as processes scale, making long interconnect delays very dependent on detailed routing, layer assignment and the number of layer changes in the routes. To study the trends of increasing interconnect dominance we consider a logic stage consisting of a NAND gate driving a net with a fixed length, layer assignment and capacitive load on its fanouts. We compute the worst stage delay to a fanout point with and without interconnect loading included. This is done for different driver sizes in process technologies at.25 and.8 microns respectively. The length of the net is approximately 38 microns; hence any contribution of interconnect delay to the stage delay is mostly due to capacitive rather than resistive effects. We measure the dominance of the interconnect by the ratio r of stage delay without interconnect to stage delay with interconnect; A smaller value of r indicates that interconnect delay dominates to a greater extent. Delays are computed assuming that this stage is driven by a close-by buffer which is driven by an input transition of. ns. The results of these measurements are compiled in Table and Table 2. Table. Dependence on driver sizes in.8 micron Drive strength of driver worst delay without interconnect worst delay with interconnect worst slope at fanout ratio r.5 x x x x x x x x We can see that for any given driver size the value of r is smaller in the.8 micron process, which measures the difference in interconnect dominance. As driver sizes increase for both technologies, Table 2. Dependence on driver sizes in.25 micron Drive strength of driver worst delay without interconnect worst delay with interconnect worst slope at fanout ratio r.5 x x x x x x x x we can see that interconnect delays are gradually swamped out; as shown by the asymptotic increase in the value of r. In Table and Table 2 we also show the worst slopes to a fanout point. It is easy to see that the driver with drive-strength of 2x gives the minimum delay for this stage and also has a reasonable slope at its output. Since we have assumed that the driver was driven by a close-by buffer, we can assume that upstream gates are shielded from any effect of sizing the NAND gate. This driver size therefore represents an optimal choice for this stage. It is important to note that the optimal point has a relatively low value of r. Thus, picking a driver size that would allow us to neglect the effect of interconnect for this stage is clearly sub-optimal from the point of view of performance, even for this local net for which only capacitive effects are evident. For global wires that are dominated by metal resistance as well, accounting for interconnect will be even more important Criticality of Layer Assignment One implication of the increasing dominance of interconnect is that layer assignment can have a dramatic impact on the delay of a net. The extent of this varies from one technology to the next; some processes have somewhat balanced capacitances per layer, whereas others do not. We have computed stage delays with and without interconnect for the stage described in Section 3.. by varying only the layer assignment of the interconnect. The length of the net considered is about 66.6 microns, hence both resistive and capacitive effects show up in the delay. Table 3 shows the capacitance per unit length in pf per micron (including both the lateral and fringe capacitances), and the resistance in ohms per square for each layer considered. Table 3. Interconnect Capacitances and Resistances Metal layer Capacitance (.8 um tech.) Resistance (.8 um tech.) Capacitance (.25 um tech.) Metal Meta Metal Metal Metal Metal Resistance (.25 um tech.) We can see from Figure 8. that there are variations in stage delays as a function of routing layer assignment only. This makes the problem of accurate interconnect estimation more complex, since the routing layer is difficult to predict prior to global routing. 86

6 Ratio r Routing layer Increasing driver strength Figure 8. Results showing the impact of layer assignment for a.8 micron process. Ratio r Routing layer Increasing driver strength Figure 9. Results showing the impact of layer assignment for a.25 micron process. 4 IMPACT ON DESIGN FLOWS In the previous sections we have analyzed the impact of increasing interconnect dominance in DSM technologies, and taken a closer look at the limitations of wireload models. The ultimate question to answer is: what impact do these trends and issues have on current and future design methodologies? What must be changed in the way we do gate-level synthesis for DSM designs, and block level assembly for hierarchical designs? 4.. Appropriate Block Sizes for Synthesis In [3] it was predicted that an approximate block size of 5k gates --- which is about the size of a logic block that a designer might want to deal with -- would be of acceptable size for wireload models to be effective, now and into the foreseeable future. Based on our analyses above, however, we believe that other technology and design factors must be considered, and that only the granularity of the physical information can ultimately determine the efficacy of the wireload models Technology and Design Dependence The influence of interconnect on delay is dependent on a number of factors, including the process technology; as shown in Section 3. One example was the increasing influence of interconnect layer assignment on overall performance. There are also effects which are a combination of technology issues and design dependence. For example, intra-layer capacitance is becoming more dominant for smaller feature sizes, which makes the impact of interconnect more dependent on neighboring line switching and routing congestion. Routes are forced to meander in congested areas regions thereby increasing the overall net capacitance. The impact of neighboring line switching can considerably increase the effective inter-layer capacitance -- which is becoming a more dominant component of the total capacitance. Since congestion impacts wireload model predictability, the prelayout timing prediction for a block is also impacted by the overall netlist connectivity. Some netlists have an inherently higher connectivity than others, forcing certain blocks of logic to be placed together; sometimes resulting in congestion hotspots. Datapath dominated designs are a good example of designs with this strong dependency. To illustrate this particular form of design dependency we performed the following experiment on the 48k gate,.25um datapath design from Section We reordered the IO pins on the block slightly from the ordering used above (simply interchanged the bit orderings for two 64bit busses entering the block), then compared the placed and routed results for both cases. Figure 2 shows a scatter plot of the delays of 2 pin nets for both placements. The variation is quantified in Figure 2 which shows a profile of the ratio of the delay of a stage in one placement to that in the other over all nets. The mean of this profile is approximately.95, and there are a significant number of nets for which this ratio is substantially different from.. Stage Delay With Floorplan Stage Delay with Floorplan 2 Figure 2. Scatter plot showing the impact of lo dependencies. Number of 2 pin nets (stage-delay with floorplan )/(stage-delay with floorplan 2) Figure 2. Distribution showing the impact of lo dependencies With such a substantial dependence on the chosen technology and the design specificity, stating some absolute block size as appropriate for synthesis seems questionable. One could perhaps only calculate an upper bound on such a block size, and for our results shown here for.8 micron technologies, such a bound would be significantly smaller than 5k gates. 5 GETTING MORE PHYSICAL In order to account for physical effects during synthesis some form of early estimation of net capacitances is clearly necessary. We showed previously that a wireload estimator based on statistics from legacy designs breaks down at some level of placement granularity, even for small designs. From these results we would expect that floorplanning provides insufficient physical detail for wireload prediction. This suggests the need for a new block synthesis methodology. These block-design methodology implications also have an impact on hierarchical design styles and capabilities. When blocks are designed separately and then assembled together at the chip level, their netlists and constraints may be in different stages of completion at different times in the design process. The challenge in hierarchical design is to be able to efficiently implement individual blocks while taking into account the global view of the chip. Recall that changing 87

7 the pin orderings for a small datapath had a significant impact on the performance of the datapath block. Should the pin assignment for blocks be done top-down or bottom-up? 5.. Approaches to Physical Synthesis We first consider proposed solutions for block level synthesis. Recent approaches to synthesis begin with some estimate of physical interconnect effects for a first pass of synthesis, followed by some interactive loop between physical design and synthesis to achieve timing closure. While such approaches can alleviate the wiring dominance problem, clearly we should be searching for new opportunities to incorporate the ultimate physical realities as early as possible in the synthesis flow. Another possibility would be to use drivers that are strong enough to make any errors in interconnect estimation inconsequential. This assumption was implicit in the 5k gate block size result in [3], where a typical driving transistor was assumed to have a W/L ratio of 2. While this approach can make wireload models and predictability more effective, there is a price paid in terms of overdesigning, as illustrated by the results in Table. Since power is becoming an extremely precious commodity in IC design, this style of synthesis might be unacceptable. Our best hope, therefore, may be to determine the point in the physical design or floorplanning flow where we can achieve sufficient confidence in the accuracy of interconnect estimation, but prior to the actual completion of the physical design so that gate sizing can still be controlled and modified. Only placement data can guarantee some level of resolution where the error in interconnect estimation is acceptable for DSM designs. The coarsest level of placement detail that provides acceptable estimation will be a function of the design style and the process technology; as discussed earlier. Once this level has been reached, synthesis can be done with confidence in the accuracy and optimality of the result Hierarchical design flows Overcoming the wireload modeling inaccuracy for synthesis and physical creation of the blocks is only half of the problem. An equally difficult task, especially due to the increasing dominance of the global interconnect, is the assembly of these blocks as part of a hierarchical design flow. In current methodologies, individual blocks in the hierarchy are designed independently using conservative constraints on area and timing, then assembled at the chip level using an abstract timing model for each block. There are several problems with this approach. Firstly, the floorplan level will not, in general, provide sufficient physical detail for estimating timing behavior. Chip-level constraints are arrived at initially without any knowledge of whether individual blocks are feasible or not.the chip-level context is not very accurately known before individual blocks are implemented. Furthermore, the implementation of a block depends on factors such as global routes and pin assignments which are known only at the chip level. Adjustments in the block timing budgets during chip assembly is what leads to costly design iterations with no guarantee of convergence. This bottom-up methodology also makes it difficult to implement ECO changes. Instead of using an abstract model, another approach is to instantiate individual blocks flat at the chip level after their physical implementation is completed. While this enables accurate estimates of timing, congestion and area, it leads to problems with capacity since detailed information about each block must now be handled at the chip level. Once again, a large number of iterations would be required since the implementation of each block is independent of a global design view. Clearly rather than performing bottom-up design, an ideal flow would look at chip-level and block-level issues concurrently. Any practical implementation of this would also require quick and accurate estimates of how changes in any one context affect the other. In the following section, we will discuss such a methodology called physical prototyping. 6 PHYSICAL PROTOTYPING Typically, verifying that a design satisfies timing constraints is done at the floorplanning stage following the first gate-level synthesis with wireload models. Since this stage does not always provide sufficient modeling accuracy, we propose to refine the coarse placement further until the wireload modeling error becomes acceptable. 6.. How much physical detail is enough? Assuming that a coarse placement is available, we quantify the level of resolution as follows. The placement area is divided into regions of roughly equal area such that the exact standard cell placement location is known to within the precision specified by the size of that region. This is analogous to the location uncertainty for the cell locations described by the region partitioning in Figure. To determine the size of the regions for which such a coarse placement would provide sufficient modeling precision we can consider an extension of the experiments in Section 2.2. In this experiment we partitioned the design into regions of roughly equal area, as shown in Figure. A wireload model of choice was used to compute the delays of all local nets (i.e. nets which are completely within a block). The regions sizes for which the mean and deviation of the wirelength profiles are acceptable would be the level at which this particular wireload model can be used for synthesis. Obviously, the more closely the wireload model correlates with the actual placement statistics, the less physical modeling detail that is required. Estimating the wirelengths within these bounded regions can be done in a variety of ways, including via analytical models [7][8], graph properties of the netlist [9], or empirical observations [][][2][3]. The deviation in net delays for local nets are computed for each block size and the region size at which this deviation is acceptable is the level of which wireload models can be used Prototyping Designs Once we have obtained an acceptable level of placement detail for wireload estimation, we can construct a physical prototype of the final design. At this level of coarse placement, gate-sizing and remapping will work with a correct knowledge of path criticality and stage delays. Furthermore, we also have an estimate of interconnect length such that required routing resources can be approximated. Performing a congestion analysis with this level of physical detail will determine whether or not routability constraints can be satisfied. The corresponding availability of accurate delay estimates also enables approximate clock tree synthesis, which in turn improves the accuracy of congestion estimation. Since the error in wireload estimation is reasonably bounded, a timing analysis of this prototype will correlate closely to a timing analysis of the final physical design. If the design constraints are not satisfied, the designer can make the necessary RTL or behavioral level changes to the netlist, modify the floorplan and constraints and then repeat the prototyping process. Power and IR drop analysis of the design can also be done at this point and if necessary, changes to the power grid incorporated Designing blocks For the coarse placement that is used to construct a physical prototype, the placement area is divided into regions and standard cells are distributed among these regions. After converging on a level of physical detail for the prototype that satisfies top-level constraints, the physical implementation of the blocks is carried out via concurrent synthesis and placement for each of them individually, but while maintaining the global view from the physical prototype. 88

8 6.4. Hierarchical Design Ideally, during hierarchical design we would maintain both a global context for the entire chip and a local context for each block in the hierarchy. The initial chip context is obtained from something equivalent to an initial block level floorplan using very abstract timing and area estimates for individual blocks. First a physical prototype of the entire chip is generated using these models. Here top-level routing and optimizations like buffer insertion, pin assignment and top-level clock tree synthesis are done. These are then used to generate constraints for the individual blocks. Cycle Time (ns) Physical Prototyping Cycle Time Comparison After physical prototyping 2.4% After physical design 2.35% 6.59% 7.2% 5.44%.76% d d2 d3 d4 d5 d6 Designs Figure 23. Physical Prototyping - comparison of cycle times with actual physical design Block Context Figure 22. Hierarchical design showing block and chip contexts Given these block level constraints, a physical prototype can be obtained for an individual blocks. This uses information pushed down from the global context such as the drivers at the block s inputs, the loads on its outputs and the topology of top-level routes. Once the physical prototype of a block is generated, it is used to refine the block level context and constraints. It is also used to refine the chip level context, since information is now available about the loads on the input pins of the block, drivers on the output pins and the resources available for routing over the block. Physical prototyping enables the accurate estimation of what the final placed and routed solution of a block will look like, and thus is a powerful tool that can be used in chip level optimization. Given a design, it can also be used to generate an optimal partition of the netlist into hierarchical blocks and provide timing and area estimates for them. 7 SOME RESULTS To demonstrate that the prototype does present a realistic picture of the final design, we present results of some industry designs using the physical design system described above. Figure 23 shows comparisons between the cycle time estimated at the physical prototyping level and the cycle time at the end of placement and routing. The margin of error in most cases is within %. The runtime required for physical prototyping is between % to 2% of the runtime required to obtain a detailed placement; as described earlier it depends on technology and design characteristics. If runtimes for global and detailed routing are included as well, the fraction of time required for physical prototyping will be even smaller. 8 CONCLUSIONS In summary, the level of placement resolution at which wireload models can be used depends on the extent to which interconnect delays dominate stage delays. For pre-dsm technologies, this corresponded to the floorplan level. However, our analyses clearly show that this is no longer sufficient for the post-dsm era. We further conclude that specifying a block size for which synthesis with wireload models is effective is an over-simplification, since such a block size depends on combinations of process technology, design style, floorplan and the wireload estimator that is used. Our results clearly show that a 5k or any other fixed block size assumption for the acceptable level at which wireload models are applicable, is not realistic even for today s.8 micron designs. Based on these analyses we proposed the generation of a physical prototype to accurately estimate timing from the coarse placement information. We described the implications of this methodology on block and hierarchical design flows. 9 REFERENCES [] Semiconductor Industry Association, National Technology Roadmap for Semiconductors, 999, [2] S. Hojat, P. Villarrubia, An Integrated Placement and Synthesis Approach for Timing Closure of PowerPC Microprocessors, Intl. Conference on Computer Design, October 997. [3] D. Sylvester and K. Keutzer, Getting to the Bottom of Deep Submicron, Intl. Conference on Computer-Aided Design, November 998. [4] N.H.E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A Systems Perspective, Addison Wesley, 2nd Edition, 993. [5] D. MacMillen, DSM: It s the Heights and not the Depths that are Dangerous, IEEE/ACM Workshop on Timing in the Specification and Analysis of Digital Systems (TAU), March 999. [6] F.K. Hwang, D.S. Richards and P.Winter, The Steiner Tree Problem, Elsevier Science Publishers, 992. [7] W. E. Donath, Placement and average interconnection lengths of computer logic. IEEE Trans. on Circuits and Systems, 26(4), April 979. [8] A. E. Caldwell, A. B. Kahng, S. Mantik, I. L. Markov and A. Zelikovsky, On Wirelength Estimations for Row-Based Placement, IEEE Trans. on CAD 8(9), 999. [9] T. Hamada, C.-K. Cheng, and P. M. Chau, A wire length estimation technique utilizing neighborhood density equations. In Proc. ACM/ IEEE Design Automation Conf., 992. []D. Stroobandt and J. Van Campenhout, Accurate Interconnection Length Estimations for Predictions Early in the Design Cycle, VLSI Design, Special Issue on Physical Design in Deep Submicron, v(!), 999. []M. Pedram and B. Preas, Interconnection length estimation for optimized standard cell layouts. Intl. Conf. on Computer-Aided Design, pp , 989. [2]C. Sechen, Average interconnection length estimation for random and optimized placements. Intl. Conf. on Computer-Aided Design, 987. [3]S.Bodapati and F.N.Najm, Pre-Layout Estimation of Individual Wire Lengths, In Proceedings ACM International Workshop on System-Level Interconnect Prediction (SLIP), 2. 89

Wojciech P. Maly Department of Electrical and Computer Engineering Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA

Wojciech P. Maly Department of Electrical and Computer Engineering Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA Interconnect Characteristics of 2.5-D System Integration Scheme Yangdong Deng Department of Electrical and Computer Engineering Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213 412-268-5234