Analyzing Timing Uncertainty in Mesh-based Clock Architectures

Size: px

Start display at page:

Download "Analyzing Timing Uncertainty in Mesh-based Clock Architectures"

Mary Jacobs
5 years ago
Views:

1 Analyzing Timing Uncertainty in Mesh-based Clock Architectures Subodh M. Reddy Gustavo R. Wilke Λ Rajeev Murgai Fujitsu Laboratories of America, Inc. UFRGS Fujitsu Laboratories of America, Inc. California, USA Porto Alegre, Brazil California, USA Abstract Mesh architectures are used to distribute critical global signals on a chip, such as clock and power/ground. Redundancy created by mesh loops smooths out undesirable variations between signal nodes spatially distributed over the chip. However, one problem with the mesh architectures is the difficulty in accurately analyzing large instances. Furthermore, variations in process and temperature, supply noise and crosstalk noise cause uncertainty in the delay from clock source to flip-flops. In this paper, we study the problem of analyzing timing uncertainty in mesh-based clock architectures. We propose solutions for both pure mesh and (mesh + global-tree) architectures. The solutions can handle large design and mesh instances. The maximum error in uncertainty values reported by our solutions is 1-3ps with respect to the golden Monte Carlo simulations, which is at most 0.5% of the nominal clock latency of about 600ps. 1 Introduction needed to accurately model a fine mesh in a large design and a large number of metal loops present in the mesh structure. As a result, circuit simulators such as SPICE either require inordinate amount of memory or run-time. In fact, HSPICE and HSIM (Synopsys) failed to analyze even coarse meshes for an industrial design [4]. An added degree of complication is brought forth by variations in parameters that affect clock latency [18, 14, 3, 10]. Examples of such parameters are process (channel length, oxide thickness, interconnect width and thickness, etc), supply voltage, temperature and crosstalk noise. Variations in these parameters cause variations or uncertainty in delay from the clock root to flip-flops, both die-to-die and clock cycle-to-clock cycle [17, 8]. With technology scaling, the magnitude of parameter variations and the sensitivity of clock latency towards variations are increasing. The focus of this paper is to analyze the timing uncertainty of mesh-based clock architectures in the presence of parameter variations. We believe this is the first work that addresses this problem. We propose solutions for both pure mesh and (mesh + global-tree) architectures. The solutions can handle large design and mesh instances. We show that uncertainty values reported by our solutions are within 1-3ps of those obtained from the golden Monte Carlo simulations (e.g., 35ps vs. 33ps), where the nominal clock latency is about 600ps. Another major benefit of our scheme is that it is easily amenable to distributed- or grid-computing. The paper is organized as follows. Section 2 gives preliminaries. Section 3 describes previous work on clock mesh analysis. An overview of our methodology for uncertainty analysis of clock meshes under parameter variations is presented in Section 4. The details of our methodology are presented along with experimental results in Section 5. We conclude and give directions for future work in Section 6. Figure 1: A mesh-based clock architecture Mesh or grid architectures are popular for distributing critical global signals on a chip, such as clock and power/ground. The mesh architecture uses inherent redundancy created by loops to smooth out undesirable variations between signal nodes spatially distributed over the chip. These variations can be due to non-uniform switching activity in the design, within-die process variations, or asymmetric distribution of circuit elements (such as flip-flops). For power/ground, mesh can help reduce voltage variations at different nodes in the network due to non-uniform switching activities. For the clock signal, a mesh (Figure 1) has been shown to achieve very low skew in microprocessor designs, e.g., Digital Alpha [2]; IBM G5 S/390 [6], Power4 and PowerPC [12]; SUN Sparc V9 [13]. Mesh also has excellent jitter mitigation properties. However, one major problem that has limited the applicability of mesh architectures is the difficulty in analyzing them with sufficient accuracy. The main reasons are the huge number of circuit nodes Λ This work was done when the author was an intern at Fujitsu Labs. of America. 2 Preliminaries 2.1 Mesh-based Clock Architecture Figure 1 shows a typical mesh architecture used for distributing the clock signal from the PLL or root buffer to sequential elements such as flip-flops (FFs) and latches on the chip. It has three main components: 1) a (uniform) mesh, 2) a global buffered tree that drives the mesh, and 3) local interconnect, which connects the clock inputs of FFs directly to the nearest point on the mesh. The mesh is a uniform rectangular grid of wires spanning the entire chip area, driven by the mesh buffers and propagating the clock to the FFs. An mxn mesh or grid has m rows (horizontal wires) and n columns (vertical wires). Thesizeofthemeshismxn. For a given chip size, the greater the mesh size, the more fine-grain the mesh is. A mesh node (or grid node) is the point where each row is connected to each column. As shown in Figure 1, the global (H-)tree delivers the clock signal to the mesh nodes via buffers called mesh buffers. We assume a uniform array of kx` mesh buffers. In Figure 1, k = m =4and ` = n =4. The mesh wire between two adjacent mesh nodes is called a mesh segment, and represents one grid unit /DATE EDAA

2 Figure 2: Single-ß model for interconnect Figure 3: 3-ß model for interconnect Clock Network Model Each buffer (mesh buffer and tree buffer) is modeled using the BSIM3 transistor models for NMOS and PMOS. Since the mesh is largely composed of wires, it is important to have an accurate wire model. To model wires smaller than 100μ, a single-ß model, which has two capacitors, a resistor and an inductor, is used (Figure 2). For longer wires, a 3-ß model is used, as shown in Figure 3. Our study on Fujitsu s 0.11μ technology showed that this scheme is delay-accurate within 0.5% of 4-ß and 5-ß models [16]. It helps reduce the number of nodes in the SPICE model. The same rule is used to model wires that connect FFs to the mesh and wires on the global tree. The clock pin of a FF is modeled as an equivalent capacitance. 2.2 Clock Timing and Uncertainty In any clock distribution scheme, one of the most important concerns is to accurately compute the clock arrival time a (also called clock delay or latency) at the clock input pin of each FF. Assume we have apathp in a design whose start and end gates are FFs FF s and FF e. Let clock arrival times at these FFs be a s and a e respectively. The maximum delay d max allowed on P is a function of (a e as), the difference in clock arrival times at the two FFs. d max» ae as + fi t set up; (1) where fi is the clock cycle and t set up is the set-up time for FF e. a e as is known as the skew between FF s and FF e. By comparing the arrival times among all FFs, we can compute the worst relevant clock skew in the design. This is the maximum negative difference in arrival times at two FFs that are connected by a data path. For a fixed clock cycle, the worst skew limits the maximum delay in the data path. Thus, it has a direct impact on the design turnaround time. Alternatively, for a given design, the skew impacts the maximum clock frequency for which the design will function correctly. In practice, at a given flip-flop on a chip, two consecutive clock rising (or falling) edges may not be fi time units apart. Moreover, for the same corresponding flip-flop on two chips, the clock latencies from the clock source may be different. Clock timing uncertainty denotes the deviation of the timing of the clock edge from its expected value. Uncertainty affects a s and a e in (1) and hence d max or fi, as discussed above. Uncertainty in clock timing can be due to several factors. 1. Supply (V ) noise: This is caused by different sets of gates switching in different clock cycles. Since gate delay depends on the value of supply voltage, any change in the supply voltage of a clock buffer changes the clock arrival time at the FF. 2. Temperature (T ) variation: This variation arises due to different switching activities on the chip (both spatial and temporal) and because power and temperature are strongly coupled to each other, especially for leakage-dominant technologies. A block with higher switching activity dissipates higher dynamic power, leading to higher local temperatures. That, in turn, increases the leakage power dissipation, further increasing the total power. A gate operating at a higher temperature exhibits higher delay due to reduced carrier mobility. 3. Process variations (within die and die-to-die) P : Examplesof process variations include intrinsic variations such as random dopant fluctuations in a MOSFET channel and extrinsic variations such as channel length and oxide thickness variations. In a chemical mechanical planarization (CMP) process, interconnect width, thickness, spacing and height may vary significantly from the intended values. These variations cause gate and wire delays to deviate from their desired values. It is difficult to predict the precise magnitude of variations and hence the exact values of wire and gate delays after manufacturing. 4. Crosstalk noise X: Delay of a clock wire v can change if there is an aggressor a that is physically close to v and is switching. Since the aggressor s switching behavior can change from one cycle to the next, it can lead to timing variation on the victim. Clock is one of the most important signals in the design. V dd /V ss shielding is typically done on both sides of the clock to eliminate such crosstalk impact. Shielding, however, does not prevent crosstalk from the top and bottom layers, when a wide bus is going over the clock line. 5. PLL jitter. Clock generated from the PLL has an inherent jitter. Some of these parameters (such as process) have random unknown variation components, but once the chip is manufactured, the values are fixed. Other parameter variations are deterministic they depend on the state of the design and the last & current signal values, and have to be computed for each cycle. Examples are supply and crosstalk noise. Their exact computation typically requires prohibitive CPU and memory resources and may be infeasible in practice. Nevertheless, both kinds of parameter variations cause uncertainty in the timing of the clock edge at a flip-flop from its expected value. Let D denote the latency (path delay) from the clock root to a flip-flop. In general, D is a function of supply voltage V i at each clock buffer B i on the path, the temperature T i at each clock buffer and wire, the set of process parameters P, and crosstalk noise X. In short, we write D( ~ V; ~ T; ~ P; ~ X), where ~ V denotes the vector of all buffer voltages fv ig. In the presence of parameter variations, D is a distribution with mean μ and standard deviation ff. We define uncertainty in D, denoted U(D),askff. In this paper, we use k =3. Problem Statement: Given a mesh-based clock network and VTPX parameter variations for each component of the clock network (i.e., clock buffers and wires), determine the timing uncertainty U(D i) in the clock latency D i from the clock root to each flip-flop FF i. 3 Previous Work If the clock network is a tree, uncertainty analysis can be carried out using gate-level statistical static timing analysis [9, 15, 1]. However, such an approach is not directly applicable for a mesh-based clock network due to metal loops (cycles) present in the mesh. We are not aware of any work on clock mesh uncertainty analysis. The only known solution is that if the mesh model fits in the memory, we can run Monte Carlosimulations (MCSs) [7] assuming some distribution for parameter variations and obtain a delay distribution at each FF, from which timing uncertainties at FFs could be derived. However, this is possible only for small design and mesh instances. Not much has been published on the problem of clock mesh latency analysis. [12, 5] present a scheme to break the clock mesh into a tree and apply a smoothing algorithm to redistribute the mesh loads. The tree is analyzed for latency. However, no accuracy results are shown. In [2], the clock mesh is verified in two steps. First, an AWE-based reduction [11] is performed on the mesh to simplify the mesh elements. Then, the simplified circuits are simulated using SPICE. The accuracy and efficiency of this method depend on the accuracy and stability of the moment matching technique. Recently, a sliding window scheme (SWS) was proposed for latency analysis of clock meshes [4]. Since uncertainty analysis derives its basic idea from SWS, we describe it next. 3.1 Sliding Window Scheme for Mesh Latency In SWS, the mesh is modeled with two different resolutions: a detailed circuit model is used for the mesh elements geometrically close

3 W border around W W complete mesh Ca a preserve circuit detail inside W lump capacitance & ignore resistance outside W (except on mesh segments) slide W Figure 4: The sliding window scheme ing the region outside the window reduces the number of nodes in the circuit model. Approximating each FF saves either 7 nodes (if the wire is longer than 100μ) or 3 nodes (otherwise). In a typical design, where there are hundreds of thousands of FFs, reduction in the SPICE model size can be huge. It was shown in [4] that HSPICE could not finish on a 65x65 mesh with 100K FFs. It needed more than 2GB of memory, whereas SWS could complete in less than 1.5 hours within 1GB memory using four machines. The latencies computed by SWS, using a border of 1 grid unit, are almost always within 1% of the latencies computed from SPICE simulation of the complete mesh. It was also shown that using no border (i.e., a border of 0 grid units) does not yield accurate results; errors of up to 30% were seen. By increasing the border beyond 1 grid, the accuracy does not improve much. However, the runtime increases significantly. In short, empirically a border of 1 grid unit was found to be optimum. Also, window size was shown to have very little impact on accuracy. However, smaller window size means smaller model and hence better chances for large designs to fit in the memory. But smaller window also implies more simulations. Figure 5: Statistical simulation model for a buffer driving a wire to the nodes whose latency we are measuring and a simplified model is used for the mesh elements far from the nodes being measured. The simplification is with respect to the local FF connections. Given ameshofsizemxn, define a rectangular window W of size rxs, where r<mand s<n. Expand W by some border to obtain W 0 (Figure 4, in which the border is 1 grid unit). If the lower left corner of W 0 is fixed to a point on the mesh, W 0 covers some fixed region of the mesh (Figure 4). The connection of a FF within W 0 to the nearest mesh segment is modeled accurately by an appropriate ß model, as described in Section (single-ß or 3-ß, depending on the length of the connection). The clock input pin of the FF is modeled as a capacitance. FFs that lie outside W 0 and their connections to the mesh are modeled approximately. The wire connecting such a FF to the mesh is replaced by an equivalent single capacitance; the wire resistance is ignored. Given a mesh node a outside W 0, the region covered by a is the unit rectangle shown in Figure 4. Let C a be the sum of the clock input pin capacitances of all the FFs in this region along with the capacitances of the wires connecting them to the mesh. Then, C a is lumped as a single capacitance at a. The mesh segments outside W 0 are still modeled with appropriate ß models. The SPICE file corresponding to this model for the window location is generated and simulated. The clock latencies at all FFs in the inner window W are measured. 1 Next, the window W is slid horizontally or vertically so as not to overlap with the previous locations. Once again, a SPICE model is created and run. The entire mesh simulation is broken down into multiple independent window-based simulations. In fact, d m 1 r 1 eλdn 1 s 1 e SPICE simulations are needed to cover the entire mesh and all the FFs in the design. SWS is a divide-and-conquer partitioning technique. Approximat- 1 Latencies of the FFs in the border of W are ignored. These will be measured when these FFs will fall in the non-border area of other window(s). 4 Clock Mesh Uncertainty Analysis 4.1 Modeling Sources of Uncertainty We model various sources of uncertainty as follows. Refer to Figure 5, where inverting buffer1 drives inverting buffer2 through a wire. 1. Supply Noise V : Supply noise is modeled by supplying independent power supplies to each clock buffer, and allowing them to vary randomly according to a noise model. The amount of variation is controlled by a user input parameter, supply tolerance. 2. Temperature Variation T : Rising temperature causes CMOS circuits to operate more slowly, and wiring resistances to increase. Temperature variation of transistors is modeled by specifying an underlying temperature for the entire chip and then applying random local temperature variations on each clock buffer and interconnect. The variation to apply is given by a user input parameter, max deltemp. 3. Process Variation P : As shown in Figure 5, process variation of transistors is modeled using only channel length (l p and l n for PMOS and NMOS transistors respectively) and threshold voltage (delvt n and delvt p). Other variations, such as oxide thickness and dopant concentration, have the overall effect of varying the threshold voltage and hence are indirectly included in our model. The variations of threshold voltage and channel lengths are passed into each instance of the buffer sub-circuit models. Process variation of wiring is modeled by applying random process factors pf c and pf r to the wiring capacitance and resistance respectively in the wire models. 4. Crosstalk Noise X: Crosstalk noise is modeled by attaching external noise sources to the wire model (Figure 5) and by applying random inputs at these sources based on some probability distribution. The crosstalk factor associated with the instances must also be defined whenever a wire is instantiated. The crosstalk factor is a unique property of each design, and is supplied by the user through the parameter xtfactor. 5. PLL jitter: We will assume a maximum PLL jitter of 3ff PLL. 4.2 Computing Uncertainty: Basic Idea The basic idea is simple: we use SWS for analyzing timing uncertainty of a mesh. We attach variation parameters with each buffer and wire on the clock network, as illustrated in Figure 5. For each window W 0 of SWS, a SPICE model of the mesh is created (just as in [4]) and Monte Carlo simulations (MCSs) are carried out. In each run of the MCSs, the values of VTPX parameters for each component of the clock network are determined from their respective distributions, and the latency D i of each flip-flop FF i that lies in the core of W 0 (i.e., in W 0 W ) is computed. After all runs are completed, a distribution of the delay D i is available for each such FF i. The uncertainty U(D i)=3ff(d i) is then computed from this distribution. Finally, U(D i)s are collected from all windows W 0 to yield uncertainties at all the FFs in the design. In this paper, we do not use large design and mesh instances. The feasibility of SWS for those has already been shown [4]. Our focus is

4 parameters 3ff variations NMOS/PMOS channel length 08μ NMOS/PMOS threshold voltage 20mV interconnect resistance 20% interconnect capacitance 0% temperature 20C V dd 10% crosstalk switching probability 0.5 Input uncertainty Max. output uncertainty (ps) (ps) 8x8 mesh 16x16 mesh Table 2: Reduction of uncertainty by mesh Table 1: 3ff variations for different parameters to determine if SWS can be used for accurate uncertainty analysis of both pure mesh and (mesh + global tree) architectures, and if so, to derive a practical and usable methodology. The next section presents detailed results of our study. 5 Results 5.1 Experimental Set-up & Definitions All our experiments were conducted in Fujitsu s 0.11μ technology. The 3ff variations for various parameters are shown in Table 1. In the following, we will compare FF uncertainties obtained from a methodology M against those from a golden reference methodology G. For instance, M may correspond to the SWS-based uncertainty analysis, and G, to running MCSs on the flat single model of the mesh-based clock network. To evaluate the quality of uncertainty results, we use two metrics: 1) error in the maximum uncertainty, E-UMAX, and 2) the maximum uncertainty-error at a FF, MAXE-FF. E-UMAX is obtained by first computing UMAX, the maximum over uncertainties at all the flip-flop clock pins (i.e., UMAX = max FFi fu(d i)g), using M and then comparing it with the UMAX computed by G. MAXE-FF is calculated by first computing the percentage error in uncertainty at each FF under M with respect to the golden uncertainty value at that FF and then picking the maximum percentage error value over all the FFs. Note MAXE-FF E-UMAX. Since we use Monte Carlo simulations to compute timing uncertainties, the accuracy of results depends on the number of simulations. More simulations usually means higher accuracy. Since it was not feasible for us to run a large number of simulations due to limited CPU resources, we did an experiment to determine the number of simulations that yield uncertainty values within 10% accuracy, where the golden result used 800 simulations. It turned out that running 400 simulations resulted in MAXE-FF of about 5.5% with respect to the golden result, whereas with 100 simulations, we obtained MAXE-FF of about 16%. So we use 400 simulations in all our MCS runs (unless stated otherwise). We present results for two architectures: pure mesh with no global tree, and complete clock network with mesh and global tree. 5.2 Pure Mesh First, we study effectiveness of clock mesh in mitigating uncertainty. Then, we investigate accuracy of SWS-based uncertainty analysis methodology. In both experiments, only the mesh along with mesh buffers was modeled and simulated. The global tree was not explicitly included in the model Effectiveness in Uncertainty Mitigation Although the global tree was not explicitly included in the model, different values for maximum skew and uncertainty were used on the inputs of the mesh buffers. These model the skew and uncertainty due to the global tree driving the mesh. The mesh buffer inputs were assumed to be independent Gaussian distributions with mean clock arrival times satisfying the maximum skew and standard deviation ff related to uncertainty. The interconnect resistance and capacitance variations shown in Table 1 were applied to each wire in the mesh. A chip of size 500μ x 500μ was used, with 1000 flip-flops placed randomly with a uniform distribution. Two different mesh sizes were tried: 8x8 and 16x16. The flip-flops are connected to the closest mesh node. Different experiments were run for maximum input skews of 0ps & 5ps, and for 3ff uncertainties of 3ps, 15ps, 30ps and 150ps (with Gaussian distributions) at mesh buffer inputs Monte Carlo simulations were performed in each case. The values of the maximum 3ff output uncertainty over all mesh nodes for input skew of 0ps and different input uncertainties are presented in Table 2 for both 8x8 and 16x16 meshes. From the column 8x8 mesh, we see that the 8x8 meshis able to reduce the uncertainty at the mesh nodes by a factor of 7 to 8 when compared to the uncertainty at mesh buffer inputs. From the column 16x16 mesh, it is clear that by increasing the mesh size from 8x8 to 16x16, the uncertainty also reduces by a factor of 2. Thus, we can draw the following two conclusions. 1. Mesh is very effective in reducing timing uncertainty. 2. For the same chip size, a finer-grain mesh is more effective in reducing uncertainty than a coarse-grain mesh. The results for maximum input skew of 5ps are similar to those presented above and are omitted SWS In this section, we investigate if SWS-based MCSs can be used for mesh uncertainty analysis. We used a chip size of 5mm x 5mm, three different mesh sizes of 10x10, 18x18 and 26x26, and 1000 FFs distributed randomly over the chip. The clock root is directly connected to all the mesh buffer inputs. We compare the SWS-based methodology with respect to a golden reference methodology, in which Monte Carlo simulations are run on the entire mesh model. We intentionally chose a small problem size so that the golden model could fit in the memory and run in reasonable CPU time. The window size in SWS was fixed at one-fourth the mesh size. Figure 6 shows E- UMAX (both in ps and percentage) as a function of the border length for different mesh sizes. The golden UMAX is the lowest horizontal line in all the graphs. It can be seen that for all mesh sizes, E-UMAX is very high more than 50% for a border of 0 (i.e., no border), but decreases rapidly as the border is expanded. For 10x10 mesh and border of 1 grid, the error is almost 0%; for 18x18 mesh and border of 2 grids, the error is around 7%, and for 26x26 mesh and border of 3, the error is around 15%. In all cases, SWS was able to achieve E-UMAX of less than 0.1ps. Interestingly, this behavior is markedly different from that of SWS latency, where increasing the border from 0 to 1 grid units reduced the latency error significantly, but no further improvement was obtained by increasing the border beyond 1 unit. We conclude that the SWS-based methodology is effective for accurately analyzing timing uncertainty of clock meshes. The error in uncertainty goes down rapidly as the window border is increased. The border required to achieve a given accuracy in uncertainty vis-avis the golden reference is a monotonic function of the mesh size. 5.3 Complete Clock Network Having established that SWS is accurate for analyzing the timing uncertainty of a pure clock mesh, we now investigate if the SWS-based uncertainty analysis can handle the clock network of Figure 1, which includes, in addition to the mesh, a global tree that drives the mesh through mesh buffers.

5 Figure 6: Impact of border on SWS accuracy for mesh uncertainty analysis mesh Golden M-Uncorrelated M-Correlated size UMAX UMAX E-UMAX UMAX E-UMAX (ps) (ps) (%) (ps) (%) 8x x Table 3: UMAX & E-UMAX for tree-mesh decoupling mesh M-Uncorrelated M-Correlated size U: M (G) MAXE-FF U: M (G) MAXE-FF (ps) (%) (ps) (%) 8x (30.6) (29.53) 15 16x16 4 (28.72) (19.88) 5.46 Table 4: MAXE-FF for tree-mesh decoupling experiment Figure 7: Global tree uncertainty analysis One straightforward way of analyzing uncertainty of the complete clock network with SWS is to include the entire tree for each location of the window in SWS-based Monte Carlo simulations. Though accurate, this scheme is time consuming, memory intensive and wasteful, since it re-analyzes the same tree for each window location. If we can decouple the tree uncertainty analysis from the mesh analysis and carry out the two separately, the complete clock network uncertainty analysis can be sped up, using less memory as well Decoupling Tree Analysis and Mesh Analysis To ascertain the validity of decoupling for analyzing timing uncertainty, we carried out the following comparison. The golden methodology G comprised of running MCSs on the entire monolithic clock network model (with global tree and mesh together), and measuring the uncertainty at each FF. The methodology M corresponded to decoupling the tree and mesh analyses. It comprised of running MCSs on the global tree, deriving mean and standard deviation of the clock arrival time at the input of each mesh buffer, and using them as inputs to the mesh uncertainty analysis. One single simulation model was created for the mesh. The mean and standard deviation of the latency at the input of a mesh buffer are the same as those derived from the global tree analysis. Moreover, the latency variables for mesh buffer inputs are assumed to be independent Gaussian variables. The mesh uncertainty analysis computes uncertainty at every FF. The comparison of M and G results is shown in Tables 3 and 4 for two mesh sizes (8x8 and 16x16) for a 5mm x 5mm chip having 1000 FFs placed with a uniform random distribution. Table 3 shows UMAX, the maximum uncertainty over all FFs, for the golden methodology G (column Golden) and the methodology M (column M-Uncorrelated). It can be seen that E-UMAX is huge: both in ps (20ps and 25ps) and in percentage (57% and 77%) for the two mesh sizes respectively. Table 4 column M-Uncorrelated shows results for the flip-flop with % Error 1 Mesh8,8; Window = 2 Window = 5 % Error 1 Mesh16,16; Window = 4 Window = 8 Window = 12 Figure 8: E-UMAX for complete clock network using SWS maximum error in uncertainty (i.e, MAXE-FF). For the 8x8 mesh case, 11.48ps is the uncertainty U (with the decoupled methodology) of the FF with maximum error, whereas its golden uncertainty is 30.6ps, resulting in a percentage error of 62.5%. MAXE-FF for the 16x16 mesh is even larger: 82.5%. The reason for such huge errors is that the latency variables at the mesh buffer inputs are not all independent: they are correlated to each other. Correlation between the latency variables at two mesh buffers depends on the tree edges shared between the paths from the clock tree root to the two buffers. This is shown in Figure 7, where the paths from the root A to mesh buffers X and Y share edges AB and BC. Each of these edges contributes the same delay to the two paths. This is not considered in the independent variable assumption. One way to incorporate common path correlations at mesh buffer inputs is as follows. The tree uncertainty analysis generates delay distribution for each stage of the global clock tree (e.g., mean μ AB and standard deviation ff AB of the delay of edge AB in Figure 7).

6 % Error Mesh8,8; Window = 2 Window = 5 % Error Mesh16,16; Window = 4 Window = 8 Window = 12 Figure 9: MAXE-FF for complete clock network using SWS For each run of the mesh MCS, generate a delay sample for each tree stage from its delay distribution. Generate latency of each path in the clock tree by adding the delays of stages on the path. Use these path latency values as inputs in a particular MC run of the mesh analysis. Thus, each edge in the tree contributes the same delay to all the paths it belongs to. The results using this approach are shown in the column M-Correlated in Tables 3 and 4. It can be seen that E-UMAX values for the two mesh sizes are 6% and 2.5%, whereas MAX-EFF values are 10% and 5.5%. The absolute ps difference in uncertainties is at most 3ps. This implies that when decoupling the tree and mesh analyses, common path correlations in the tree must be taken into account. Then decoupling methodology yields accurate results vis-a-vis the golden monolithic methodology. One problem with our approach is that it ignores delay correlations between two successive stages on a single path. Accuracy of the decoupling approach can be further improved by incorporating the stage delay correlations Decoupling with SWS In this experiment, we used the same set-up and the golden methodology G as in Section However, the methodology M used the decoupled tree and mesh analyses with correlations, using SWS for the mesh with different window sizes. Figures 8 and 9 show E- UMAX and MAXE-FF values respectively. With window dimension about half of the mesh dimension (e.g., 8x8 window for 16x16 mesh), E-UMAX is <7% and MAXE-FF is <12%. From Table 3 column Golden UMAX, the maximum FF uncertainty values were in the 30-35ps range. A 12% error in uncertainty translates to about 4ps, which is really small, given nominal clock latencies of 570ps for the 8x8 mesh and 647ps for the 16x16 mesh. As for the impact of border, the percentage errors seem to go down with increasing border. However, in some cases, the error goes up. One possible explanation is that a 1-2% change in the percentage error with border is only ps, which falls within the accuracy limit of SPICE. 6 Conclusions We addressed the problem of computing timing uncertainty of meshbased clock architectures in the presence of parameter variations. We believe ours is the first work to address this problem. First, we showed that clock meshes are effective in reducing timing uncertainty, finer meshes being more effective than coarser meshes. We came up with an efficient and accurate solution based on the sliding window scheme, which was proposed recently for computing clock latency in mesh-based architectures. However, there are significant differences in the behavior of the SWS-based latency and uncertainty schemes, e.g., the optimum border length. We applied our solution to pure mesh and (mesh + global-tree) architectures. For (mesh + global-tree), we showed that the decoupled methodology must take into account common path correlations in the tree. By doing so, this methodology yielded a maximum error of 1-3ps in uncertainty values with respect to the golden MCS-based values on the monolithic complete clock network model, which is at most 0.5% of the nominal clock latency (around ps). Since our methodology is based on SWS, it is capable of analyzing uncertainty of large meshes and design instances, and is easily amenable to distributed- or grid-computing. Future work is in the following directions. 1) Running several hundred MCSs on a large design & fine mesh can be time consuming if hundreds of compute-servers are not available. We plan to work on making our methodology faster. 2) For the complete clock network, the decoupling method should handle correlations between consecutive stages on a path. 3) In this work, we modeled variation sources for each wire and buffer independently. However, supply voltage and temperature of components located close to each other are usually correlated. We will extend our model to handle these correlations. References [1] A. Agarwal, V. Zolotov, and D. T. Blaauw. Statistical Clock Skew Analysis Considering Intra-die Process Variations. In IEEE Trans. on CAD, pages , August [2] D. W. Bailey and B. J. Benscheneider. Clocking Design and Analysis for a 600-MHz Alpha Microprocessor. In IEEE JSSC Vol 33., No. 11, pages , November [3] K. A. Bowman, S. G. Duvall, and J. D. Meindl. Impact of Die-to-Die and Within-Die Parameter Fluctuations on the Maximum Clock Frequency Distribution for Gigascale Integration. In IEEE JSSC, pages , February [4] H. Chen, C. Yeh, G. Wilke, S. Reddy, H. Nguyen, W. Walker, and R. Murgai. A Sliding Window Scheme for Accurate Clock Mesh Analysis. In ICCAD, pages , November [5] P. J. Camporeseet al.. X-Y Grid Tree Tuning Method. In U.S. Patent, No. 6,205,571 B1, March [6] G. Northrop et. al. A 600-MHz G5 S/390 Microprocessor. In ISSCC Tech. Dig., pages 88 89, February [7] Hitchcock, R. Timing Verification and the Timing Analysis Program. In DAC, pages , June [8] Y. Liu, S. R. Nassif, L. T. Pillegi, and A.J. Strojwas. Impact of Interconnect Variations on the Clock Skew of a Gigahertz Microprocessor. In DAC, pages , June [9] M. Berkelaar. Statistical Delay Calculation, A Linear Time Method. In TAU, pages 15 24, December [10] M. Orshansky, L. Milor, P. Chen, K. Keutzer, and C. Hu. Impact of Systematic Spatial Intra-chip Gate Length Variability on Performance of High-speed Digital Circuits. In ICCAD, pages 62 67, November [11] L. T. Pillage and R. A. Rohrer. Asymptotic Waveform Evaluation for Timing Analysis. In IEEE Transactions on Computer- Aided Design, pages , April [12] P.J. Restle et. al. A Clock Distribution Network for Microprocessor. In IEEE JSSC Vol 36., No. 5, May [13] R. Heald et. al. Implementation of a 3rd-Generation SPARC V9 64b Microprocessor. In ISSCC Dig. Tech. Papers, pages , February [14] S. B. Samaan. The Impact of Device Parameter Variations on the Frequency and Performance of VLSI Chips. In ICCAD, pages , November [15] C. Visweswariah, K. Ravindran, K. Kalafala, S. G. Walker, and S. Narayan. First-Order Incremental Block-Based Statistical Timing Analysis. In DAC, pages , June [16] Gustavo Wilke and Rajeev Murgai. Accuracy of Interconnect Pi Models. In Fujitsu Laboratories of America Internal Document, August [17] S. Zanella, A. Nardi, A. Neviani, M. Quarantelli, S. Saxena, and C. Guardiani. Analysis of the Impact of Process Variations on Clock Skew. In IEEE Trans. on Semiconductor Manufacturing, pages , November [18] P. S. Zuchowski, P. A. Habitz, J. D. Hayes, and J. H. Oppold. Process and Environmental Variation Impacts on ASIC Timing. In ICCAD, pages , November 2004.

CMOS Logic Gate Performance Variability Related to Transistor Network Arrangements

CMOS Logic Gate Performance Variability Related to Transistor Network Arrangements Digeorgia N. da Silva, André I. Reis, Renato P. Ribas PGMicro - Federal University of Rio Grande do Sul, Av. Bento Gonçalves