Throughput and Power Tradeoffs Associated with Synchronous and Asynchronous Communication Channel Design. Edward T. Malley 2002

Size: px

Start display at page:

Download "Throughput and Power Tradeoffs Associated with Synchronous and Asynchronous Communication Channel Design. Edward T. Malley 2002"

Charity Sullivan
6 years ago
Views:

1 Throughput and Power Tradeoffs Associated with Synchronous and Asynchronous Communication Channel Design Edward T. Malley 2002 Advisor: Prof. Pileggi ~t~, Electrical & Computer ENGINEERING

2 Throughput and Power Tradeoffs Associated with Synchronous and Asynchronous Communication Channel Design Master s Thesis - Edward T. Malley 1. Introduction With transistor switching speeds becoming faster while the propagation delays along interconnect are becoming slower, the integrated circuit and system performance bottleneck is no longer the delays within the logic blocks. Instead, communication among blocks has become a primary design challenge. It is no longer sufficient to simply wire two blocks together and expect them to communicate properly and efficiently. At the very least, repeater insertion is necessary, and, most recently it is even necessary to allocate multiple cycles to a signal s journey across the chip. This added importance placed on global communication makes it necessary to consider its impact in the earliest stages of the design process, including the architectural planning. In the past, during the architectural planning of a microprocessor, the circuit designer s primary goal was to design the fastest circuit possible. Now, however, some new design obstacles are beginning to emerge. The first of these obstacles is power consumption [12]. While designing low power circuits allowed lower packaging costs, it was previously less critical since the system was simply plugged into an outlet and consumed as much power as it required. Today, with wireless and handheld devices becoming more popular, chips must be designed with power dissipation requirements in mind. Eventually, the device whose battery lasts the longest may become one of the consumer s primary concerns. This raises the question, "How and where should current designs be improved in order to mitigate this power consumption problem?" Figure 1 shows the results of a power distribution study performed at IBM [1]. Based on some of the recent p and z series microprocessors. Clearly, the primary problem with power distribution lies in the clock, and with feature sizes decreasing and scale of integration increasing, this problem will continue to worsen. Thus, it can be concluded that improvements in clock

3 distribution techniques, especially local clock distribution, have the potential to lead to major power savings overall. 150% clk (LCBilatch) 50.0% ~///i arrays logic [] clk (global dtst) Dominant consumer of power Figure 1: Microprocessor power distribution study The other emerging obstacle is the effect of process variations on chip performance. In previous technology generations, when the transistors were larger and clock periods longer, a few picoseconds in any direction had little effect on a processor s functionality. However, process variations are not scaling relative to feature size or clock speed [3], and with clock periods entering the sub-ins ranges, these few picoseconds can be the difference between a very good fabrication yield and a very poor one. Figure 2 illustrates this point. The histogram, obtained using the circuit simulator discussed in [2], shows the delay spread for a 2cm wire that has been stretched from one end of a large chip to the other - more on this design example later in this thesis. Using 180nm technology, fifteen buffers were required to send a signal across this channel. While a few corner cases fall drastically outside the range of the primary spread, one can see that process variations can cause up to an 80ps difference in performance. If this logic is required to operate in the gigahertz range, 80ps makes up a significant portion of the clock period. 2

4 Rise Delays 20 15,...10 o Delay (ps) Figure 2: Effects of manufacturing variations on system 1.1 Is Asynchronou,s the answer? Some have suggested that a potential solution to both of [hese problems is to employ an asynchronous design methodology. Since the power consumed by the clock, especially the local clock, has been identified as the major contributor to overall chip power consumption, using an asynchronous design (i.e., a design with no clock) would seem to be a logical solution to this problem. An asynchronous design can also serve to improve the process variation problem. Typically, in a synchronous design, some sort of H-tree is used to globally distribute a clock throughout the chip as shown in figure 3. In the case of a long communication channel, the problem lies in the fact that the clock must travel a completely different path to the latch inputs than the data must tra,,el. If the data is on a critical path, this difference could result in a timing failure. However, in the case of an asynchronous design, all data and control bits will travel a similar path. As a result, any process variations they experience should be more localized, hence correlated, thereby reducing the potential timing variations and errors. 3

5 How global are manufacturing variations? i <:, Figure 3: Clock-Data skew problem 1.2 ff Asynchronous is the Answer, What is the Question? Knowing this, one must ask, "Is asynchronous design really better?" Looking at this limited example from the architectural level, it would seem that it is. Simply switch to an asynchronous design methodology, allowing you to remove the clock, and this results in a 60% power savings. However, when the problem is examined from an implementation standpoint, a number of problems arise. The following will attempt to describe some of them in detail. It should be noted that switching from a completely synchronous to a completely asynchronous design methodology is impractical. There are many reasons for this, some of which include the fact that most CAD tools are specialized for use in synchronous designs, and the simple fact that most logic blocks are designed to operate synchronously. So, instead of a complete switch, this paper will consider the case of a globally asynchronous locally synchronous (GALS) design [11]. In other words, all global clock distribution has been removed from the chip. Now, while all functional units remain synchronous, asynchronous communication channels exist between them. Specifically, we will examine the complete implementation of a 4

6 2cm long communication channel from both a synchronous and asynchronous standpoint in 180nm, 130nm, and 100nm processes. 2. Background To investigate the pros and cons of asynchronous vs. synchronous design channels, we will compare the results of optimized designs for both cases under several technology nodes. It is important for this comparison that we compare both designs on as fair a basis as possible. To do so we will rely on two components of background work. First, a comprehensive optimization methodology for n-bit bus lines will be used to construct the interconnect designs for maximum throughput in all cases. This methodology was developed for synchronous busses, but as we will demonstrate it can be applied to asynchronous circuits with equal efficacy. Secondly, we will employ the Gasp architecture [6] to provide for an optimal communication channel along a long signal bus path. 2.1 Bus Optimization Methodology Designing a communication channel between two blocks should be nothing new to most designers; however, it is no longer a task that can taken lightly. In order to yield the best possible performance, a number of bus characteristics need to be considered concurrently, thereby making this a somewhat complex optimization problem. The bus optimizer described in [9] outlines a complete flow for the design of IC communication channels. It states that the interconnect structure for any bus can be uniquely described using six parameters. Referring to Figs. 4 and 5, they are: 1. Wsi - Width of the signal wires 2. Wsh - Width of the shielding wires 3. Sp - Wire spacing 4. Sgate - Gate (repeater) size 5. Lseg - Length of wire segment 6. N - Shielding period

7 Figure 4: Uniform bus Figure 5: Bus cross-section Starting with a random set of values and the interconnect data associated with the process, the optimization methodology begins with a detailed RLC extraction [7][8] using [10], thereby producing the interconnect models for use in SPICE. The user decides which parameter (or parameters) to optimize. Each optimization parameter is a function of the six bus variables mentioned above. Any of these six characteristics can be set to a constant or bounded if necessary. In other words, if the user wants to shield all of the wires, N is set to 1 and not considered ~br optimization. The optimizer will optimize the performance measures for up to a six dimensional surface. During each iteration of the optimization run, there is a SPICE simulation. With the results from each simulation, the optimizer attempts to find a better set of bus characteristics than the previous one. This is accomplished using a sequential quadratic programming algorithm. Once a new set of values is determined, it is used as the initial values for the next iteration. The loop continues until the bus ceases to improve. A block diagram of the optimization loop can be seen in fig. 6. It is possible to optimize the bus based on a number of parameters. They include area, throughput, delay, and power. An important aspect of this bus optimization methodology is that instead of optimizing for delay, the algorithm can optimize for throughput (defined as frequency multiplied by bit width), which is a quantity that is more directly related to performance. It would appear that throughput and delay are strongly related to each other. For example, if one

8 would like to increase throughput, simply make the bus faster. It is possible, however, to look at throughput optimization in a different way. I Inlercon,~ect technoi~y D~,t;, I CMOS I Icchn,,lo.ey Data ()ptimizcd Bus l- abrics Fkx}r Plan & System l.cxcl Simtda~t~ ~ Figure 6: Bus optimization loop Optimization of delay will result in large repeaters and wide wires. A possible tradeoff is to design a wider bus with thinner wires, resulting in longer delay, but a higher overall throughput within the same channel area. This additional delay, however, results in an increased overall latency for the bus. As a result, this places more demands on additional latch repeaters. In other words, latches are placed at appropriate points throughout the channel in order to meet clocking requirements. While this does add additional area and latency to the bus, it is typically a fairly negligible amount. 2.2 Asynchronous Communication Channel Combining the above bus optimizer with a simple clock driver made it possible to build the synchronous examples. In order to build the asynchronous examples, in addition to employing the above techniques, it was also necessary to choose an appropriate asynchronous communication strategy. Many methods have been proposed, and the method used in [6] was chosen for our design comparison. 7

9 I PFET ~ se~f-reset keeper ETO,(4) j-~ self-reset state I node(), a _ state conducto SC1 conductor InvO PLACE "\ CC PLACE "\ / PLACE data in I-~ out I I I I data latch I I I I data latch Figure 7: Gasp circuit Figure 7 shows a circuit diagram of the asynchronous communication blocks that were used in our experiments. Their functionality is fairly simple. First, let s split the picture into the control blocks and datapath blocks. The datapath blocks consist of the cross coupled inverters, the passgate, and the driving inverter, while the remainder of the diagram is part of the control block. Each control block contains what is referred to as a state conductor (SC). This SC responsible for keeping track of whether or not the stage it is associated with contains data. A logical 0 on the SC means the stage is full, while a logical 1 means it is empty. Obviously, if the current stage contains data and the subsequent stage does not, the data from the current stage should be passed to the next one. Similarly, if the current and subsequent stages both contain data, the current stage needs to wait until the next one has passed its data on before it can continue. This can be better explained with an example. Assume, in Fig. 7, that SCI contains a 1, meaning it is empty, and that NFET1 is open SC0 is driven to a 0, meaning that the previous stage is now full. In this case, it is possible to pass data to the next stage. So, the 0 on SC0 causes inverted) to open NFET0. With both NFET s open, node0 is pulled down to a 0. This causes a number of events to occur. First, it

10 opens the passgate in the datapath block, allowing data to pass into the next stage. Second, PFET0 and NFET2 open, changing the states of SC0 and SC1. Finally, after 2 inverter delays, PFET1 opens, pulling node0 back up to a 1, at which point the control block returns to its steady state and can wait for new data. So, by including this extra logic, and removing the clock drivers, the synchronous logic could easily be migrated into its asynchronous counterpart. 3. Experimental Design Setup In order to fully contrast and compare the performance and power consumption of the asynchronous and synchronous communication channels, a series of experiments were performed for designs created in 180nm, 130nm, and 100nm technologies. Note that any specific references to transistor sizes or delay values are for the 180nm case unless stated otherwise. 3.1 Channel Specifications Consider the design problem of two functional blocks on opposite sides of a chip that need to communicate with each other. As an example, we chose a channel length of 2cm for our experiments to represent somewhat of a maximum size limit for a production IC die today. To keep the simulation runs manageable, a four-bit data bus was stretched across this channel. The bus was fully shielded in order to accurately examine power and throughput tradeoffs without consideration of noise constraints and the dominant influence of data dependent switching on delay. A switching factor of 15% was applied to each bit. This is both a realistic switching factor for most applications, and it is frequent enough that gating off the clock for the synchronous case impractical. Examples of the input switching patterns are shown below in Fig. 8. 9

11 Figure 8: Input switching pattern U P1 3.2 Synchronous Setup The aforementioned bus optimization program [9] was run using the parameters above and assuming a clock frequency of 1GHz. In the synchronous case, each simulation was setup as shown in Fig. 9. The number of latch repeaters in the channel was varied in order to achieve different throughputs across the simulations. The most complex component in the synchronous simulations is the latch repeater itself. Fig. 10 shows a transistor level diagram of its design. It is simply a basic, master/slave configured, D flip flop. Through experimentation it was found that a setup time of approximately 120ps was necessary to guarantee proper functionality. 3.3 Asynchronous Setup The asynchronous design is slightly more complicated. The data lines are first optimized the same way they were in the synchronous case. One can see from Fig. 11, though, that at least one extra bit line is present. This contains the state conductor bit discussed in the previous section. Initially, this extra line is given the same parameters as the data lines. However, it can also be seen that the buffers that exist on the control line are not simply inverters, like on the data line. This is due to the fact that data needs to travel in both directions on the control line -- the control blocks will always pass a logical 0 to the right and a logical 1 to the left. 10

12 Figure 9: Synchronous simulation setup Figure 10: Synchronous latch repeater Since an inverter is not capable of exhibiting this behavior, a bi-directional buffer was designed. A transistor level diagram of the buffer is shown in Fig. 12. Each side of the buffer contains an inverter, a keeper, and pull-up/down assist logic. If a 0 is being propagated to the fight, inverter0 will trip both the keeper (n0) and the actual pull-down NFET (n 1) on the right side of the buffer. As the D_RIGHT node is pulled down, inverterl drives a 1 onto node0, which II

13 subsequently drives a 0 onto the gate of nl, closing it and allowing the keeper to take over on its own. A similar series of events occurs when a 1 is being propagated to the left. The use of this buffer results in some useful behavior. One can see that it is necessary that the delay of the control signal be greater than the delay of the data lines. Otherwise, the next stage will attempt to take data before it is ready. Since the bi-directional buffer is noticeably slower than a single inverter, the control signal will always arrive after the data. Figure 11: Asynchronousimulation setup However, it was found, through experimentation, that the difference in delay between control line and the data lines was unacceptably large. For example, in the case where eight latch repeaters were used, it took ps for one bit to traverse the data line (this is between two consecutive latches, not the full 2cm), but ps for the control signal to arrive. To make all

14 of the data run 110ps slower because of the control line is unacceptable. As a result, the bus optimizer was run again just for the control line. This time, the buffer distance, latch repeater distance, and buffer size, were all fixed in order to make the placement of logic on the control lines match up with the data lines. This left the optimizer with the ability to alter wire width in order to speed up the line. Thus, this run was an optimization of area subject to a delay constraint. The second optimization run resulted in a 40% improvement in wire propagation delay, allowing the control signal to arrive at 340ps instead, This results in a higher throughput for the asynchronous case. But note that since the control line must always be slower than the data lines - even under worst case conditions and process variations - some delay margin must be built in to this control line optimization problem. ~ode( Figure 1.2: Bi-directional buffer design 3.4 Latch modifications This buffer interfaces with the control portion of the asynchronous latch found in Fig. 13. The latch s behavior was described on page 8 so it will not be discussed here. Notice one small difference, though. The simple nfet-only passgate in the original example has been replaced with

15 a passgate containing both and nfet and a pfet. This change was made because it was becoming difficult to pass a logic 1 into the latch. Figure 13: Asynchronous latch design 3.5 Asynchronous design methodology Based on the methods described above, an asynchronous design methodology can be derived. The steps are as follows: 1. Optimize data lines, as done in synchronous case 2. Fix repeater distance according to results obtained from step 1 3. Separately optimize control line based on data line delays o cannot allow control to be faster than slowest data line o results in wider control line, since distances remain fixed 3.6 General Observations Before examining the data collected in these simulations, it is necessary to point out one important point. For the same number of latches, the synchronous case yields twice the throughput of the asynchronous case. This is because the SC bit must first propagate a 0 all the 14

16 way to the right, indicating that the current stage is full, and then propagate a 1 all the way back to the left, indicating that the stage is now empty. While this seems to be a major problem with the asynchronous design, it is actually easily fixed. The only requirement is that an odd number of repeater stages exist between each pair of latches. If this is the case, it is possible to replace the hi-directional buffer in the middle of the channel with a simplified version of the asynchronous control portion of the latch. Now, once the data gets half way through the channel, it will start to send a signal back in the other direction, resulting in a matched throughput tbr both the asynchronous and synchronousimulations. Unfortunately, due to a limited number of repeater stages, this was only possible in one case of the 180nm runs (assuming that the stages were symmetric). However, the added repeater stages required to propagate the data over a 2cm distance for the 130nm and 100nm cases, ultimately provided a greater opportunity for optimization. 4. Experimental Data We compared our optimized designs on the basis of required power dissipation for equivalent throughput of the asynchronous and synchronous cases. The maximum possible normalized (by channel width) throughput was very similar for both cases over all technology nodes, but the power dissipation requirements varied considerably. 4.1 Channel specs according to process Table 1 contains the bus characteristics for each of the process technologies that we considered. Note that there are two 130nm technology nodes. This is because the 180nm [5] and first 130nm [ 13] processes were for existing ASIC technologies. We wanted to compare these results with those for a typical high performance-end microprocessor process, as derived from [4]. This provided us with an additional opportunity to consider extrapolation to a 100nm technology, which was available only for a high-end process node at this time [4]. 15

17 As can be seen from the table, when migrating to a smaller feature size technology, the bus is characterized by a shorter segment length (i.e., the distance between repeaters), expected. As the minimum feature sizes become smaller, and the bus optimization of maximum throughput continues to select the maximum number of wires in the channel solution, the metal resistance increases, and the demands on repeater insertion increase correspondingly. The side effects axe that larger repeaters are required (larger relative to minimum feature size, not in an absolute sense), and inserted more frequently in order to maintain the signal across the full two centimeters. In these experiments, the 180nm simulations required 15 repeaters, the ASIC 130nm required 31, the high-end 130nm required 19, and the 100nm required 25. Figures 14, 15, 16 and 17 compare power dissipation vs. throughput for the synchronous and asynchronous cases over all of these process nodes. Note that each point represents a throughput optimized design Process Wire Width Wire Spacing Gate Size [ Seg. Length(Lseg) Signal Spd. 180nm 0.3 um 0.3 um 44/22 um 1.36e7 m/s 130nm 0.2 um 0.21 um 34/17 um 690 um 1.03e7 m/s 130nm (p) um 35.5/17.75 um 1.10 mm 1.36e7 m/s 100nm 0.2 um 0.2 um 29/14.5 um ] ummm 1.35e7 m/s Table 1: Bus characteristics 4.2 Power comparisons The 180nm results are summarized in Fig. 14. First, the datapath power dissipation was compared. One would expect that since the asynchronous latches were smaller, less power would be consumed. However, in this case, we were not able to match throughput without using twice as many latches. Because the asynchronous control line must :first propagate a logic 0 to the right: in Fig. 11, indicating that the current stage is full, then propagate a logic l back to the left, indicating that the subsequent stage has accepted that data, it would nominally require two round trips to send each vector of data. By adding intermediate control logic at the midway point of the path, as described above, the data can be sent and acknowledged in the time period required for one round trip. As a result, however, as the required throughput is increased, the power 16

18 consumed by the datapath in the asynchronous case becomes slightly greater than that of the synchronous. Next, the power dissipated by the clock in the synchronous case was compared to the power dissipated by the control line in the asynchronous case. These results did not favor the asynchronous case as much as some m~ght have predicted. Significantly more power was burned in the asynchronous case than in the synchronous case. This can be attributed to the synchronous clock model used in these experiments, Typically, to clock the latches in this channel, buffers would be connected to the clock grid in appropriate places and sent to the latch inputs. While those buffers were modeled in this design, the interconnect associated with the clock grid itself was not. In contrast, the complete, 2cm control line needed to be constructed for the asynchronous case to filnction. We will comment more on the implications of the global clock power savings below. Datapath Power (180nm) Clock/Control Power (180nm) , ,03 ~" , , Throughput (Gb/s) Throughput (Gb/s) Figure 14: Datapath and control power consumption (180nm) Similar experiments were run for the 130nm node. The results of these experiments are summarized in Fig. 15. The major difference is that with more buffers in the channel, it was possible to apply the fix mentioned earlier and match the throughput between the asynchronous and synchronous design methodologies. Now, the results are closer to what one might expect. 17

19 The power consumed by the datapath of the synchronous example is now slightly higher than that of the asynchronous example. The clock followed the same trend as in the 180nm case, but with one small difference. It can be seen that the final point on the curve does not follow the linear trend seen in the 180nm example. This has to do with the size of the bi-directional buffers. Once implemented, the bidirectional buffers were actually larger than the asynchronous control block, and certainly smaller than the simplified version of it used to match the throughput of the synchronous case. When going from three buffers between latches to only one, all bi-directional buffers are removed from the control line, and only control blocks or simplified control blocks exist on the control line. This causes a more drastic change in energy consumption than in the previous cases, so the power dissipation does not increase as much as one might expect. Datapath Power (130nm) Clock/Control Power (130nm) ~0.01 o.oi 0.01 o.oo ~ ,1~ async 0.02 async_fast 0.02 & sync (Gb/s) Throughput (Gb/s) Throughput ~,async async_fast &sync Figure 15: Datapath and control power consumption (130nm) Notice that there are three curves in in the graphs shown in Fig. 15. The curve labeled async_fast is the data collected when the throughputs of the synchronous and asynchronous cases were matched, while the async curve did not attempt to match the throughputs. The two have been placed on the graph together to show that when the throughputs are the same in the two asynchronous designs, the power dissipation is virtually identical. As a result, choosing a design 18

20 becomes a tradeoff between the area consumed ( async required more latches, so more area is used) and the depth of the pipeline (more latches can hold more data). Next we considered 130nm and 100nm technologies that are representative of a high-end processor technology. In each case, the simplified asynchronous control blocks have been placed between the latches in order to match the throughput of the synchronous and asynchronous designs. The results of these experiments are displayed in Figs. 16 and 17. Clearly, the processor technology follows the same trends seen for the ASIC technologies. It is apparent for these examples, however, that there is a superlinear increase in the local.clock power dissipation for the synchronous cases. Thus, if throughput were to increase, the clock may begin to burn more power than the asynchronous control blocks. However, no more latches can be placed in any of the channels in order to increase throughput and possibly observe this behavior for this design example. As a result, for these technologies, and using this circuit design example, it is not possible for the asynchronous solution to burn less power than the synchronous solution. Datapath Power (130nm, p) Clock/Control Power (130nm, p) o8 ~ LI asyn% 0.0O2 I_ ,496 Throughput (Gb/$) Throughput (Gbls) Figure 16: Datapath and control power consumption (130nrn, p) 19

21 Datapath Power (100nm) , O, 182 0,382 0,582 Throughput (Gb/s) Clock/Control Power (100nm) Throughput (Gb/s) Figure 17: Datapath and control power consumption (100nm) Two final points should be made here. First., these graphs do not reflect the fact that the global spine of the clock tree can be removed when using globally asynchronous. Therefore, without considering the entire system design it is not definitive whether or not the additional power consumed by each of the asynchronous communication channels would be more or less than the total power saved by not having a global clock. Second, the overall latency of the bus was not discussed. This is primarily due to the fact that the latency changed little from one simulation to the next (it did change between processes, though). The only differences stemmed from the fact that the latches were somewhat slower than the buffers, but the difference was negligible in all cases. 4.5 Area Estimation The asynchronous design solution does appear to have an advantage over the synchronous solution in terms of transistor width (while an area comparison would be better, no layout for this design exists, so transistor width will be used as an estimation of how much area each design will consume). Fig. 18 contains transistor width statistics for each design. In each case, the maximum throughput solution is being explored. It can be seen that while the asynchronous solution starts out worse (this is due to the overhead involved in building the control line), as the width of the bus increases, the slope of the asynchronous design is less than 20

22 the slope of the synchronous design. Eventually, in every case, the total transistor width for the asynchronous design becomes lower than that of the synchronous design. So, it can be deduced that a wider asynchronous bus will use less silicon than the synchronous bus. Transistor Width (180nm) Transistor Width (130nm) OO async I O o Bus Width (bits) Bus Width (bits) [ ~sync - async) Transistor Width <130nm, p) Transistor Width (100nm) O0O o ~sync ] ~async o Bus Width (bits) Bus Width (bits) l ~sync ~async Figure 18: Transistor width analysis 5. Design Considerations and Conclusions These results from Section 4 raise several interesting questions that we consider below. 5.1 Bus width? Since this was only a 4 bit bus, it was possible to use only one control line for the entire structure. Clearly it would be impractical to attempt to use only one control line for wider busses for two main reasons. First, the increased load placed on the control portions of the latches would 21

23 make their area increase to an unacceptable size, resulting in added capacitance and, therefore, added power consumption. Second, the long wires used to distribute the control signals will be subject to process variations and skew, adding uncertainty to their arrival times. 5.2 Async setup? It is important to design for a safe "setup" time for the asynchronous latches. When reoptimizing the control lines, it was arbitrarily decided that the control signal should be approximately 10% slower than the slowest data line. While this seemed to be enough for the worst case simulations that we considered, more detailed models are required to ensure proper operation and high yielding silicon in general. 5.3 Global communication power? In order to determine just how useful (or useless) the application of an asynchronous communication channel would be, it would be worthwhile to determine what percentage of the chip s power consumption is dedicated to global corrununication. Based on IBM s study the that removal of the global clock should save about 10% of the overall power, but the asynchronous control line burns significantly more power than its synchronous counterpart. One must consider the entire architecture in order to make this assessment. 5.4 Shielding Effects As previously mentioned, all four bits in these experiments were fully shielded. This allowed us to examine the power consumption without having to handle additional constraints on noise. However, this also masked the potentially strong connection between delay and the input switching pattern. In other words, if adjacent bits switch concurrently, this can substantially increase or decrease the signal delay. Therefore, an unshielded bus design must be constructed so that the delay of the control line is always greater than the worst case delay of the data lines. This would impact both the asynchronous and synchronous design throughputs. 22

24 5.4 Future Work In addition to these questions, future work is to design and complete the layout for some of these optimized designs. This would allow us to extract more realistic parasitics from the layout and perform more accurate simulations of the communication channels. Finally, we would like to fabricate and test the ICs in order to see the full effect of manufacturing variations. 5o5.Conclusions In conclusion, while speed has been a primary design goal for many 1Cs in the past, it is apparent that presently we must consider power consumption and manufacturing variations as first order effects as well. Superficially, asynchronous design seems like a good solution to both of these problems. By removing the clock, which has been identified as the major power consumer for many ICs (especially microprocessors), and localizing process variations, one would think that the problems would get better. However, our actual circuit-level implementations demonstrated that this is not the case. While the use of an asynchronous design methodology does a good job of masking process variations, the power savings that one would expect to see are not as encouraging. Since it is impractical to jump from a purely synchronous to a purely asynchronous design, globally asynchronous locally synchronous (GALS) is an obvious compromise. But based on our experiments, even for a GALS system, the 10% power savings resulting from the removal of the global clock is likely to be overshadowed by the increase in power dissipation caused by the addition of the asynchronous control line. Our results further indicate that the benefits of asynchronous would become more apparent for a more local design problem, where smaller devices could be used. However, the benefits of locally asynchronous design styles are not apparent at this time. 23

25 6. References [11 B. Curran, P. Camporese, S. Carey, Y. Chan, R. Clemen, R. Crea, D. Hoffman, T. Koprowski, M. Mayo, T. McPherson, G. Northrop, L. Sigal, H. Smith, F. Tanzi, P. Williams, "A 1.1 Ghz First 64B Generation Z900 Microprocessor," IEEE International Solid-State Circuits Conference, February 2001, pp [21 E. Acar, F. Dartu, L.T. Pileggi, "TETA: Transistor-Level Waveform Evaluation for Timing Analysis", IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems Vol. 21, No. 5, pp , May 2002 [3] ~.~ublic.itrs.net/files/20011trs/desi ~[_, "International Technology Roadmap for Semiconductors," 2001 [4] [5] [61 Ivan Sutherland, Scott Fairbanks, "GasP: A Minimal FIFO Control", Proc. ASYNC, pp , 2001 [71 K. Nabors and J. White, "FastCap: A Multipole Accelerated 3D Capacitance Extraction Program," IEEE Trans. CAD, 10, pp , November 1991 [8] M. Kamon, M Tsuk, and J. White, "FastHenry: A Multipole Accelerated 3D Inductance Extraction Program," IEEE Trans. Microwave Theory and Techniques, 42, pp , September 1994 [91 Tao Lin, Lawrence T. Pileggi, "Throughput-Driven IC Communication Fabric Synthesis", Sttbmitted to ICCAD, 2002 [10] Tao Lin, Michael W. Beattie, Lawrence T. Pileggi, "On the Efficacy of Simplified 2D On-Chip Inductance Models", Proc. DAC, June 2002 [111 Thomas Mieneke, et. al., "Globally Asynchronous, Locally Synchronous Architecture for Large, High Performance ASICs," Proc. ISCAS, pp , 1999 [121 Y. Tiwari, et. al., "Reducing Power in High-Performance Microprocessors," Proc. DAC, pp , 1998 [13] "HCMOS9_GP Design Rules Manual: 0.13 Micron CMOS Process," Rev. C, Nov

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits