Partitioning. Hidenori Sato Akira Onozawa Hiroaki Matsuda. BTM. Bakoglu et al. [2] proposed an H-tree structure.

Balanced-Mesh Clock Routing Technique Using Circuit Partitioning Hidenori Sato kira Onozawa Hiroaki Matsuda NTT LSI Laboratories 3-1, Morinosato Wakamiya, tsugi-shi, Kanagawa Pref., 243-01, Japan. bstract clock routing technique using a balanced-mesh routing is proposed, which incorporates the advantages of both the well-known balanced-tree and xed-mesh routing method. The circuit is partitioned into subblocks called Mesh-Routing Regions(MR's) in which clock skew is suppressed below a constant by mesh routing. Then the net from the clock source toeach MR is routed asabalanced-tree. In using the technique to the design of MPEG2-encoder LSI, a skew of 210 ps was achieved. 1 Introduction In a synchronous VLSI design, circuit speed is limited by the critical path delay and clock skew. The critical path delay is the maximum path delay through the combinational circuits between two synchronous elements, i.e., ip-op's(). The clock skew is the maximum dierence in the delay times from the clock source to the 's. It is said that the clock skew must be less than 5% of the critical path delay time to build high performance electronic systems, which is a very tight design constraint. Until recently the delay in transistors was the dominant factor eecting the performance. However, with deep submicron technology, the interconnect delay makes up a large part of the overall delay [1]. Thus, the clock skew consideration in layout design is crucial. Several clock-routing techniques have been proposed in recent years. These approaches can be classied into those based on the Balanced-Tree Method (BTM), where the clock net is routed as a tree so that the delay times of clock signal are balanced [2, 3, 4, 5, 6], and those based on the Fixed- Mesh Method (FMM), where the clock net is routed as a xed mesh driven by a large buer [7]. Many important works have been presented regarding BTM. Bakoglu et al. [2] proposed an H-tree structure. The H-tree can reduce the skew, but the placement and size of 's are subject to certain restrictions to keep the H-tree symmetric. Jackson et al. [3] proposed the Method of Means and Medians, which recursively partitions a circuit into two subsets and then connects the subsets considering their centers of mass. This method reduces skew even if 's are not placed symmetrically. Minami et al. [5] proposed the Path Delay Balancing Method which merges two subclock trees in a bottom-up manner at a point where the skew is minimized. Tsay [4] also proposed a method of this type and Edahiro [6] improved it to minimize the length of the clock net while keeping the skew zero. On the whole, BTM can achieve very low (possibly zero) skew, but it may increase the number of routing tracks and the delay due to detours that can not be predicted before routing. In other words, BTM may increase area and delay time by making the skew unnecessarily small. This is especially crucial in the design of chips having many 's, e.g. MPEG2 LSI's. FMM has been applied to the design of a DEC lpha chip [7] 1. The entire chip is covered by a big mesh of interconnect metal that drives all the 's. lthough it could achieve clock skew of less than 300ps for a 0.75-m technology, the power dissipated by the clock was almost 40% of the total power dissipation of the chip because FMM tends to overestimate the skew leading to an increased number of interconnects and the need for a large buer. However, a xed mesh is easy to route and at most one routing track is required in each channel, which means, unlike in BTM, the area increase due to clock routing is predictable. Taking the advantages of both of BTM and FMM into account, we developed a practical clock routing method called the Balanced-Mesh Method (BMM), in which the circuit is partitioned into some sub-blocks 1 mesh is called a grid in the original paper. ED&TC 96 0-89791-821/96 $5.00 1996 IEEE

Balanced or Minimum-Delay Tree Mesh Routing for Clock Net Layout Block Mesh Routing Cell Rows Clock Source of Mesh Routing Clock Source of the Chip Clock Buer Clock Source Figure 1: The Balanced-Mesh Method and the clock net in each sub-block is routed as a mesh (see Fig. 1). Each mesh is driven by a relatively small clock buer placed at its center row and these buers are routed from the clock source using a balanced tree or a minimum-delay tree. The circuit is partitioned so that each sub-block's skew and the clock-signal delay time can be bounded under given allowances, based on the relationship among the clock skew, delay time, and density in a chip. This relationship is determined beforehand by circuit simulation. Since the simulation is performed under the worst condition, the clock skew and the delay time can be made lower than the given allowances in actual design. Since in general the area covered by the meshes and the buers driving them are smaller in BMM than in FMM, it can reduce the power dissipation of the clock signal signicantly. Furthermore, it provides the advantages of FMM, i.e., easy routing with at most one routing track in each channel. nother important point is, in general, the delay with mesh routing is smaller than with tree-based approaches [8, 9]. The BMM was applied to several circuits including a MPEG2 LSI. The results show that the BMM can be used to design circuits with more than 100-MHz clock frequency with almost no area overhead. Section 2 discusses the eect of partitioning on the clock skew and delay time and describes the layout ow using BMM. Section 3 overviews the partitioning algorithm, and Section 4 shows the experimental results and the method's eect. 2 Balanced-Mesh Method This section presents an empirical model for the clock skew and delay time of mesh routing and the Input Capacitance of s Figure 2: Simulation Model for Mesh Routing layout ow based on that model. 2.1 Characterization of Mesh Routing mesh for a clock net is a combination of a loop, horizontal interconnections in every other channel, and a vertical center interconnection. The clock source is placed at the center of the loop (see Fig. 2). buer is connected to this source. We rst simulated the eect of mesh routing by HSPICE [10] under the following conditions: 1) the aspect ratio of the loop is 1, 2) the input capacitances of 's are localized to realize the worst case, and 3) the clock interconnects are twice as thick as the others. The interconnections were modeled by RC ladders. We examined the clock skew S and delay time d in relation to the number of 's N and the mesh area, because it is reasonable to assume that the clock-signal delay time depends on the total clock net length, which depends on, and the amount of input capacitance, which is proportional to N. The simulation results are summarized in Fig. 3. Figure 3 (a) shows the constant-skew curves in terms of N and. The hatched area shows the region where the skew is less than a constant S 1. Figure 3 (b) shows the constant-delay curves in terms of N and. Here the delay time is the average of the delay times from the clock source to 's. s a result, S and d are approximately represented as follows: S = (N 2 )+Const:; (1) d = N + + Const:: (2)

N 3 1 2 S S S (a) 1 2 3 S < S <S N d 1 2 3 d < d < d 3 d 2 d 1 Figure 3: (a) Constant-skew curves and (b) Constantdelay curves in terms of N and (b) N S d max N (b) min N min Range of Delay Time Density of Chip Delay Upper Bound Delay (a) Lower Bound max Here,, and are positive constants determined from the simulations. These constants depend only on the technologies used and the drivability of the clock buer, not on the circuit type. That is, by carefully partitioning the circuit into sub-blocks having at most N 's and area that satisfy eqs. (1) and (2) for given S and d, we can suppress the skew and delay in each partitioned block. sub-block region that ensures skew is called a Mesh-Routing Region(MR). Given curves (1) and (2), constraints for N and that ensure given S and d are determined as follows. s shown in Fig. 4, N and must be below both of the constant-skew curve for S and constant-delay curve for d. In addition, we need a lower bound for delay times to equalize them to reduce the skew among MR's. N and realizing an MR must be bounded by these two delay curves and a skew curve. We further shrink this region into the hatched rectangle shown in Fig.4 to have a safe margin to the skew and delay boundaries. This is done as follows: We draw a line indicating the ` density', i.e., the ratio N =, of the chip. Then points (a) and (b) are determined as the intersections of this line, the constant-skew and constant-delay curves. The region for an MR is the rectangle determined by these two points, where MR's will have the density close to that of the chip and can have a safe margin to the delay and skew boundaries. In reality, this rectangle is also enough large to maintain the freedom of the partitioning. Formally an MR is dened as follows: Denition 2.1 Sub-block SB(i) having N (i) 's and area (i) is an MR, if it satises N min N (i) N max ; (3) min (i) max ; (4) where N min Figure 4: MR Constraints, N max, min and max are the constants that represent lower and upper limits of N (i) and (i) determined as shown in Fig.4, respectively. We call (3) and (4) by MR constraints. We can widen these ranges further by providing a number of sizes of clock buers. 2.2 Layout Flow using BMM The layout ow using BMM is as follows: (Step1.) (Step2.) (Step3.) (Step4.) (Step5.) (Step6.) MR Partitioning, FloorPlan, Placement, Global Routing of Clock Net, Global Routing of Other Signal Nets, Detailed Routing. Before layout, we rst determine eqs. (1) and (2) from simulation results. This is done only once for a technology. MR constraints are determined regarding the required performance of a particular design. In (Step1.), the circuit is partitioned into MR's so that they satisfy the MR constraints, and then a clock buer, the size of which depends on the values of N and, is selected in each MR (see Fig. 5). We perform this step in advance to the placement, because the placement quality can not be degraded very much by partitioning since a MR is fairly large and can contain a few logic blocks. Further, it is dicult to nd a room for clock buers after the placement due to their sizes. Next, the oorplan of MR's is performed considering the number of nets crossing the MR's. This is followed by the cell placement without taking the clock

Clock Net MR Partitioning Cell Inter-MR Clock Net MR1 Intra-MR Clock Net Clock Buer Figure 5: Image of MR Partitioning MR2 net into consideration. Cells are placed within each MR they belong to. The clock buer is positioned either at the left or right edge of the center row of the corresponding MR to adjoin the power lines, because the clock buer dissipates a lot of power. In (Step4.), the routing is classied into two types: intra-mr and inter-mr. Intra-MR routing is the mesh routing in each MR. Inter-MR routing is the minimum-delay-time-routing from the clock source to all MR's, because even the minimum-delay-timerouting gives small clock skew(see Table 2) when circuit is not so large. The balanced-tree can also be used since the number of the MR's is small and very little area is wasted due to detours. The global routing of other signal nets is performed after this. The last step is the channel routing. We developed a MR-partitioning program and a clock-global-routing program. We describe the MRpartitioning algorithm in detail in the next section. 3 The MR-Partitioning lgorithm 3.1 Problem Formulation Let G(V; E) be a hyper-graph with a vertex set V and an edge set E, where v 2 V and e 2 E correspond to a cell and a net of the given circuit, respectively. Hereafter, we use the graph notation and the circuit interchangeably. The set V is partitioned into N p subsets. (N p will be dened later in this section.) Let G i G i (V i ;E i ) be a graph of a partition of graph G where S i=n p i=1 V i = V; V i \ V j = (i 6= j; i; j = 1; :::; N p ). The MR-partitioning problem: Minimize : P i;j C(i; j) Subject to : MR constraints (3) and (4), where C(i; j) is the number of nets connecting G i and G j (i; j =1; :::; N p ). N p is: N p = either bn all =Ntarget c or dn all =Ntarget e; (5) where N all target and N are the total number of 's and the expected number of 's in MR's whose default value is the mean of N min and N max, respectively. s shown in eq. (5), N p can be calculated in one of two ways. The one satisfying the following feasibility constraints is used: N p 1 N min N all N p 1 N max ; (6) N p 1 min N p 1 max : (7) Here, is the area of the entire circuit. If neither choice in eq. (5) satises eqs. (6) and (7), then the partitioning is infeasible, whichisvery rare in practice however. If the partitioning is feasible, G i can have approximately N target 's. 3.2 Mincut-based Bipartitioning Technique The circuit is partitioned into MR's by iterating the bipartitioning technique like in the conventional mincut algorithm [11, 12]. To satisfy the MR constraints at the nal level of partitioning, we impose constraints on the intermediate levels of partitioning. Let G P be a sub-circuit at an intermediate level that will be further partitioned into Np P ( N p) MR's. Let the number of 's and the area of G P be N P and P, respectively. Suppose we partition G P into two sub-circuits, G and G B, that have Np = dnp P =2e MR's and Np B = bnp P =2c MR's, respectively. We denote the number of 's and the area of G and G B by N,, N B and B, respectively. We impose the following constraints on G and G B : min 1 min 1 N = dn P 1 N p N P p e; (8) N B = N P 0 N ; (9) N N max max 1 N B N max B max 1 N N min N B N min ; (10) : (11) It can be shown that the partitioning satisfying the above constraints is always possible and MR's satisfying the MR constraints can be obtained at last if eqs. (6) and (7) hold. The MR partitioning algorithm is summarized below.

lgorithm 3.1 : MR Partitioning begin initial clustering; while N P p > 1 do begin set partitioning constraints (8), (9), (10) and (11); initial partitioning; while number of clusters in G and G B is more than 2 do begin swap clusters [11]; hierarchical clustering; end decompose clusters to initial cluster level; choose next sub-circuit if N P p > 1; end end G G B 1 4 3 6 10 11 14 9 15 12 8 2 net 13 5 7 Initial Cluster Cut-line (a) Improvement on Initial Clusters 13 8 6 14 12 3 4 11 5 15 7 2 9 10 1 13 8 6 14 12 3 4 11 5 9 10 1 15 7 2 Cluster (b) Hierarchical Clustering for Initial Clusters 8 6 14 12 3 4 11 15 7 2 9 10 1 13 5 In initial clustering, the cells are clustered according to the conventional denition of connectivity. Each cluster is forced to include at most one. With this clustering, cells having strong connectivity can be put in one MR and the complexity of the partitioning can be reduced. In initial partitioning, the clusters are partitioned into G and G B under constraints (8), (9), (10) and (11) maintaining a logical structure as much as possible. The partitioning is improved in the swapclusters step, which is based on the famous mincut algorithm [11]. Note here that the moving of clusters is restricted by constraints (8), (9), (10) and (11). fter the improving, the circuit is further clustered in the hierarchical-clustering step with the increased cluster size like in [12], and the partitioning is iterated again until the number of clusters in G and G B becomes one. s shown in Fig. 6, this algorithm can reduce local optimum solutions. 4 Experimental Results The MR-partitioning program and the clock-globalrouting program were implemented in C on a Sun Sparc Station 2. BMM was tested on three ISCS benchmark circuits, two industrial circuits, data1 and data2(see Table 1), and a MPEG2 LSI. First, we show the results for ISCS data. We assumed a 0:5-m CMOS technology, and set MR constraints (3), (4) to produce a skew below 180 ps, which enable a design of 100-MHz or faster. The clock interconnect width was twice the normal width so the resistance was half the normal interconnect resistance. Table 2 shows the skew from each clock buer to each (Intra-MR), the skew from clock source to clock buers (Inter-MR), and the skew from clock source to each (Overall). The maximum clock (c) Swapping Clusters ccording to Gain (d) More Hierarchical Clustering Figure 6: n example of the improvement obtained with the algorithm( hatched initial cluster is a cluster including a. In (a), the number of net that cross the cut-line(cuts) is 6, while that number is 3 in (c) after the hierarchical clustering of (a) and swapping clusters.) delay times from source to 's (Phase Delay) are also shown. They were calculated by HSPICE [10]. We achieved skew below 180 ps for all the data by BMM. Next, we examined the dependency of skew and phase delay on placement. ll tests used the same MR partitioning results and oorplan. Only the placement was dierent. s shown in Table 3, the dierence of both skew and phase delay time were at most only 30 ps or so even if placement is dierent. Table 4 compares results for the proposed routing (BMM) and (pseudo) steiner tree routing (Steiner). For each type of routing, we changed the width of the interconnect. Dbl indicates a double width interconnect and Min indicates a minimum width interconnect. With the proposed routing method, skew was below the allowance. However, skew was more than 1 ns with the conventional routing method. This indicates that BMM with wide interconnects is very eective in reducing the clock skew. Table 4 also lists the interconnect capacitances and the phase delay times of the clock net. Phase delay time of BMM is smaller than those of the steiner tree routing with the same width, although the capacitance of BMM is larger than those of the steiner tree routing. In addition, the area is also comparable. Since the delay times and the area of BTM must be larger than

Table 1: Experimental Data Circuit # modules # 's # MR's (except 's) S9234 5825 228 4 S13207 7951 669 4 S15850 9772 597 4 data1 2893 463 3 data2 3477 940 6 Table 2: Skew and Phase Delay with BMM Clock Skew(ps) Phase rea Circuit Intra Inter Over Delay -MR -MR -all (ns) (mm 2 ) S9234 40 13 51 0.89 4.5 S13207 120 23 110 1.10 8.4 S15850 130 25 130 1.09 8.8 with the steiner tree routing due to the detours 2,we can say that BMM is more ecient than BTM in terms of both delay and area. Furthermore, we examined the robustness of MR partitioning for data1 and data2 under the same MR constraints as ISCS data. The result is shown in Table 5. In the table, auto means the results were obtained by MR partitioning program, and logic means the partitioning was determined regarding the logical structure. ll MR's satisfy the MR constraints. The results show that both partitioning methods give almost the same area and total net length. In addition, the results indicate that BMM gives the skew below the allowance and almost the same phase delay time if MR satises the MR conditioning. We applied BMM to a design of 0:5-m 122-MHz MPEG2-encoder LSI whose die size is 14214 mm 2 and that contains 1:6 M transistors. We set constraints to produce a skew below 350 ps. The chip was partitioned into 14 MR's, and they were routed by an approximately balanced tree from a clock-root buer placed at the center. Note worthy is that this root buer is much smaller than the one used in DEC lpha design [7] since it only drives 14 buers. The achieved clock skew was 210 ps and the total interconnect capacitance of the clock nets was 64 pf. To compare this result with FMM, we calculated the interconnect capacitance of the mesh that covers this chip. It contains the horizontal interconnects for every other channel and vertical interconnects routed at an interval of 1 mm. The result showed that interconnect capacitance could be 150 pf. This indicates that BMM is much more ecient than FMM in terms of reducing the power dissipation. If we take the buer size into account in the above calculation, then there would be a larger dierence in the capacitance (i.e. power) because FMM uses a lot bigger buer than 2 For Instance, Huang et al. [13] showed the total net length of zero skew tree can be more than 1.5 times larger than that of steiner. Table 3: Skew and Phase Delay of S13207 for Dierent Placements Test Clock Skew(ps) Phase rea Intra Inter Over Delay No. -MR -MR -all (ns) (mm 2 ) BMM does. 1 120 23 110 1.10 8.4 2 150 22 140 1.13 8.6 3 120 22 120 1.10 8.3 5 Conclusions We have proposed a clock routing technique called the Balanced-Mesh Method (BMM), which incorporates the advantages of both the well-known Balanced-Tree Method (BTM) and the Fixed-Mesh Method (FMM). We developed a MR-partitioning program and a clock-global-routing program to implement it. The experimental results for a couple of IS- CS benchmark circuits show that BMM can achieve a small skew and phase delay regardless of the placement. BMM was applied to the design of a MPEG2 LSI with 1.6 M transistors. The clock skew was 210 ps, which enabled a clock frequency of 122-MHz. In addition, it was experimentally shown that BMM is better than BTM in reducing phase delay, and provides much lower power dissipation than FMM. In BMM, MR constrains are determined from simulation results. In future work, we will analyze the theoretical basis of mesh routing and determine MR constraints theoretically. Further, we will extend BMM to multiple clocks. cknowledgements The authors would like to thank Toru dachi and Ryota Kasai for helpful discussions, and Hiroshi

Table 4: Skew, Phase Delay and Capacitance of S13207 for Dierent Routing Methods. (BMM & Dbl is the proposed method, and Steiner & Min is the conventional routing method.) Routing Clock Skew(ps) Phase Delay Capacitance rea Route Width Intra-MR Inter-MR Overall (ns) (pf) (mm 2 ) BMM Dbl 120 23 110 1.10 25.2 8.4 Steiner Dbl 330 43 340 1.25 23.3 8.2 BMM Min 300 31 320 1.19 16.1 7.8 Steiner Min 1170 64 1170 1.88 15.4 7.7 Table 5: Skew, Phase Delay, rea and Total Net Length for Dierent MR Partitioning Methods Data Partitoning Clock Skew(ps) Phase Delay rea Total net Method Intra-MR Inter-MR Overall (ns) (mm 2 ) Length(mm) data1 auto 32 9 41 0.72 4.6 1608 logic 32 10 28 0.75 4.6 1628 data2 auto 68 20 70 0.75 6.5 1882 logic 68 24 90 0.77 6.7 1787 Miyashita for HSPICE simulation. References [1] H. B. Bakoglu: \Circuits, Interconnections, and Packaging for VLSI", ddison-wesley Publishing Company(1990). [2] H. B. Bakoglu, J. T. Walker and J. D. Meindl: \ Symmetric Clock-Distribution Tree and Optimized High-Speed Interconnections for Reduced Clock Skew in ULSI and WSI Circuits", Proc. of IEEE International Conference on Computer Design, pp. 118{122(1986). [3] M.. B. Jackson,. Srinivasan and E. S. Kuh: \Clock Routing for High-Performance ICs", Proc. of CM/IEEE Design utomation Conference, pp. 322{327(1990). [4] R.-S. Tsay: \Exact Zero Skew", Proc. of International Conference on Computer ided Design, pp. 336{339(1991). [5] F. Minami and M. Takano: \Clock Tree Synthesis Based on RC Delay Balancing", Proc. of IEEE Custom Integrated Circuits Conference, pp. 28.3.1{28.3.4(1992). [6] M. Edahiro: \ Clustering-Based Optimization lgorithm in Zero-Skew Routings", Proc. of CM/IEEE Design utomation Conference, pp. 612{616(1993). [7] D. Dobberpuhl et al.: \ 200-MHz 64-b Dual- Issue CMOS Microprocessor", IEEE Journal of Sold State Circuits, pp. 1555{1567(1992). [8] P. K. Chan and K. Karplus: \Computing Signal Delay in General RC Networks by Tree/Link Partitioning", IEEE Transactions on Computer- ided Design, pp. 898{902(1990). [9] B.. McCoy and G. Robins: \Non-Tree Routing", Proc. of IEEE European Design and Test Conference, pp. 430{434(1994). [10] : \HSPICE USER'S MNUL(HSPICE Version H92)", MET-SOFTWRE(1992). [11] C. M. Fiduccia and R. M. Mattheyses: \ Linear- Time Heuristic for Improving Netwrok Partitions", Proc. of CM/IEEE Design utomation Conference, pp. 175{181(1982). [12] M. Edahiro and T. Yoshimura: \New Placement and Global Routing lgorithms for Standard Cell Layouts", Proc. of CM/IEEE Design utomation Conference, pp. 642{645(1990). [13] D. J.-H. Huang,. B. Kahng and C.-W.. Tsao: \On the Bounded-Skew Clock and Steiner Routing Problems", Proc. of International Conference on Computer ided Design, pp. 508{513(1995).