Throughput and Power Tradeoffs Associated with Synchronous and Asynchronous Communication Channel Design. Edward T. Malley 2002

Size: px
Start display at page:

Download "Throughput and Power Tradeoffs Associated with Synchronous and Asynchronous Communication Channel Design. Edward T. Malley 2002"

Transcription

1 Throughput and Power Tradeoffs Associated with Synchronous and Asynchronous Communication Channel Design Edward T. Malley 2002 Advisor: Prof. Pileggi ~t~, Electrical & Computer ENGINEERING

2 Throughput and Power Tradeoffs Associated with Synchronous and Asynchronous Communication Channel Design Master s Thesis - Edward T. Malley 1. Introduction With transistor switching speeds becoming faster while the propagation delays along interconnect are becoming slower, the integrated circuit and system performance bottleneck is no longer the delays within the logic blocks. Instead, communication among blocks has become a primary design challenge. It is no longer sufficient to simply wire two blocks together and expect them to communicate properly and efficiently. At the very least, repeater insertion is necessary, and, most recently it is even necessary to allocate multiple cycles to a signal s journey across the chip. This added importance placed on global communication makes it necessary to consider its impact in the earliest stages of the design process, including the architectural planning. In the past, during the architectural planning of a microprocessor, the circuit designer s primary goal was to design the fastest circuit possible. Now, however, some new design obstacles are beginning to emerge. The first of these obstacles is power consumption [12]. While designing low power circuits allowed lower packaging costs, it was previously less critical since the system was simply plugged into an outlet and consumed as much power as it required. Today, with wireless and handheld devices becoming more popular, chips must be designed with power dissipation requirements in mind. Eventually, the device whose battery lasts the longest may become one of the consumer s primary concerns. This raises the question, "How and where should current designs be improved in order to mitigate this power consumption problem?" Figure 1 shows the results of a power distribution study performed at IBM [1]. Based on some of the recent p and z series microprocessors. Clearly, the primary problem with power distribution lies in the clock, and with feature sizes decreasing and scale of integration increasing, this problem will continue to worsen. Thus, it can be concluded that improvements in clock

3 distribution techniques, especially local clock distribution, have the potential to lead to major power savings overall. 150% clk (LCBilatch) 50.0% ~///i arrays logic [] clk (global dtst) Dominant consumer of power Figure 1: Microprocessor power distribution study The other emerging obstacle is the effect of process variations on chip performance. In previous technology generations, when the transistors were larger and clock periods longer, a few picoseconds in any direction had little effect on a processor s functionality. However, process variations are not scaling relative to feature size or clock speed [3], and with clock periods entering the sub-ins ranges, these few picoseconds can be the difference between a very good fabrication yield and a very poor one. Figure 2 illustrates this point. The histogram, obtained using the circuit simulator discussed in [2], shows the delay spread for a 2cm wire that has been stretched from one end of a large chip to the other - more on this design example later in this thesis. Using 180nm technology, fifteen buffers were required to send a signal across this channel. While a few corner cases fall drastically outside the range of the primary spread, one can see that process variations can cause up to an 80ps difference in performance. If this logic is required to operate in the gigahertz range, 80ps makes up a significant portion of the clock period. 2

4 Rise Delays 20 15,...10 o Delay (ps) Figure 2: Effects of manufacturing variations on system 1.1 Is Asynchronou,s the answer? Some have suggested that a potential solution to both of [hese problems is to employ an asynchronous design methodology. Since the power consumed by the clock, especially the local clock, has been identified as the major contributor to overall chip power consumption, using an asynchronous design (i.e., a design with no clock) would seem to be a logical solution to this problem. An asynchronous design can also serve to improve the process variation problem. Typically, in a synchronous design, some sort of H-tree is used to globally distribute a clock throughout the chip as shown in figure 3. In the case of a long communication channel, the problem lies in the fact that the clock must travel a completely different path to the latch inputs than the data must tra,,el. If the data is on a critical path, this difference could result in a timing failure. However, in the case of an asynchronous design, all data and control bits will travel a similar path. As a result, any process variations they experience should be more localized, hence correlated, thereby reducing the potential timing variations and errors. 3

5 How global are manufacturing variations? i <:, Figure 3: Clock-Data skew problem 1.2 ff Asynchronous is the Answer, What is the Question? Knowing this, one must ask, "Is asynchronous design really better?" Looking at this limited example from the architectural level, it would seem that it is. Simply switch to an asynchronous design methodology, allowing you to remove the clock, and this results in a 60% power savings. However, when the problem is examined from an implementation standpoint, a number of problems arise. The following will attempt to describe some of them in detail. It should be noted that switching from a completely synchronous to a completely asynchronous design methodology is impractical. There are many reasons for this, some of which include the fact that most CAD tools are specialized for use in synchronous designs, and the simple fact that most logic blocks are designed to operate synchronously. So, instead of a complete switch, this paper will consider the case of a globally asynchronous locally synchronous (GALS) design [11]. In other words, all global clock distribution has been removed from the chip. Now, while all functional units remain synchronous, asynchronous communication channels exist between them. Specifically, we will examine the complete implementation of a 4

6 2cm long communication channel from both a synchronous and asynchronous standpoint in 180nm, 130nm, and 100nm processes. 2. Background To investigate the pros and cons of asynchronous vs. synchronous design channels, we will compare the results of optimized designs for both cases under several technology nodes. It is important for this comparison that we compare both designs on as fair a basis as possible. To do so we will rely on two components of background work. First, a comprehensive optimization methodology for n-bit bus lines will be used to construct the interconnect designs for maximum throughput in all cases. This methodology was developed for synchronous busses, but as we will demonstrate it can be applied to asynchronous circuits with equal efficacy. Secondly, we will employ the Gasp architecture [6] to provide for an optimal communication channel along a long signal bus path. 2.1 Bus Optimization Methodology Designing a communication channel between two blocks should be nothing new to most designers; however, it is no longer a task that can taken lightly. In order to yield the best possible performance, a number of bus characteristics need to be considered concurrently, thereby making this a somewhat complex optimization problem. The bus optimizer described in [9] outlines a complete flow for the design of IC communication channels. It states that the interconnect structure for any bus can be uniquely described using six parameters. Referring to Figs. 4 and 5, they are: 1. Wsi - Width of the signal wires 2. Wsh - Width of the shielding wires 3. Sp - Wire spacing 4. Sgate - Gate (repeater) size 5. Lseg - Length of wire segment 6. N - Shielding period

7 Figure 4: Uniform bus Figure 5: Bus cross-section Starting with a random set of values and the interconnect data associated with the process, the optimization methodology begins with a detailed RLC extraction [7][8] using [10], thereby producing the interconnect models for use in SPICE. The user decides which parameter (or parameters) to optimize. Each optimization parameter is a function of the six bus variables mentioned above. Any of these six characteristics can be set to a constant or bounded if necessary. In other words, if the user wants to shield all of the wires, N is set to 1 and not considered ~br optimization. The optimizer will optimize the performance measures for up to a six dimensional surface. During each iteration of the optimization run, there is a SPICE simulation. With the results from each simulation, the optimizer attempts to find a better set of bus characteristics than the previous one. This is accomplished using a sequential quadratic programming algorithm. Once a new set of values is determined, it is used as the initial values for the next iteration. The loop continues until the bus ceases to improve. A block diagram of the optimization loop can be seen in fig. 6. It is possible to optimize the bus based on a number of parameters. They include area, throughput, delay, and power. An important aspect of this bus optimization methodology is that instead of optimizing for delay, the algorithm can optimize for throughput (defined as frequency multiplied by bit width), which is a quantity that is more directly related to performance. It would appear that throughput and delay are strongly related to each other. For example, if one

8 would like to increase throughput, simply make the bus faster. It is possible, however, to look at throughput optimization in a different way. I Inlercon,~ect technoi~y D~,t;, I CMOS I Icchn,,lo.ey Data ()ptimizcd Bus l- abrics Fkx}r Plan & System l.cxcl Simtda~t~ ~ Figure 6: Bus optimization loop Optimization of delay will result in large repeaters and wide wires. A possible tradeoff is to design a wider bus with thinner wires, resulting in longer delay, but a higher overall throughput within the same channel area. This additional delay, however, results in an increased overall latency for the bus. As a result, this places more demands on additional latch repeaters. In other words, latches are placed at appropriate points throughout the channel in order to meet clocking requirements. While this does add additional area and latency to the bus, it is typically a fairly negligible amount. 2.2 Asynchronous Communication Channel Combining the above bus optimizer with a simple clock driver made it possible to build the synchronous examples. In order to build the asynchronous examples, in addition to employing the above techniques, it was also necessary to choose an appropriate asynchronous communication strategy. Many methods have been proposed, and the method used in [6] was chosen for our design comparison. 7

9 I PFET ~ se~f-reset keeper ETO,(4) j-~ self-reset state I node(), a _ state conducto SC1 conductor InvO PLACE "\ CC PLACE "\ / PLACE data in I-~ out I I I I data latch I I I I data latch Figure 7: Gasp circuit Figure 7 shows a circuit diagram of the asynchronous communication blocks that were used in our experiments. Their functionality is fairly simple. First, let s split the picture into the control blocks and datapath blocks. The datapath blocks consist of the cross coupled inverters, the passgate, and the driving inverter, while the remainder of the diagram is part of the control block. Each control block contains what is referred to as a state conductor (SC). This SC responsible for keeping track of whether or not the stage it is associated with contains data. A logical 0 on the SC means the stage is full, while a logical 1 means it is empty. Obviously, if the current stage contains data and the subsequent stage does not, the data from the current stage should be passed to the next one. Similarly, if the current and subsequent stages both contain data, the current stage needs to wait until the next one has passed its data on before it can continue. This can be better explained with an example. Assume, in Fig. 7, that SCI contains a 1, meaning it is empty, and that NFET1 is open SC0 is driven to a 0, meaning that the previous stage is now full. In this case, it is possible to pass data to the next stage. So, the 0 on SC0 causes inverted) to open NFET0. With both NFET s open, node0 is pulled down to a 0. This causes a number of events to occur. First, it

10 opens the passgate in the datapath block, allowing data to pass into the next stage. Second, PFET0 and NFET2 open, changing the states of SC0 and SC1. Finally, after 2 inverter delays, PFET1 opens, pulling node0 back up to a 1, at which point the control block returns to its steady state and can wait for new data. So, by including this extra logic, and removing the clock drivers, the synchronous logic could easily be migrated into its asynchronous counterpart. 3. Experimental Design Setup In order to fully contrast and compare the performance and power consumption of the asynchronous and synchronous communication channels, a series of experiments were performed for designs created in 180nm, 130nm, and 100nm technologies. Note that any specific references to transistor sizes or delay values are for the 180nm case unless stated otherwise. 3.1 Channel Specifications Consider the design problem of two functional blocks on opposite sides of a chip that need to communicate with each other. As an example, we chose a channel length of 2cm for our experiments to represent somewhat of a maximum size limit for a production IC die today. To keep the simulation runs manageable, a four-bit data bus was stretched across this channel. The bus was fully shielded in order to accurately examine power and throughput tradeoffs without consideration of noise constraints and the dominant influence of data dependent switching on delay. A switching factor of 15% was applied to each bit. This is both a realistic switching factor for most applications, and it is frequent enough that gating off the clock for the synchronous case impractical. Examples of the input switching patterns are shown below in Fig. 8. 9

11 Figure 8: Input switching pattern U P1 3.2 Synchronous Setup The aforementioned bus optimization program [9] was run using the parameters above and assuming a clock frequency of 1GHz. In the synchronous case, each simulation was setup as shown in Fig. 9. The number of latch repeaters in the channel was varied in order to achieve different throughputs across the simulations. The most complex component in the synchronous simulations is the latch repeater itself. Fig. 10 shows a transistor level diagram of its design. It is simply a basic, master/slave configured, D flip flop. Through experimentation it was found that a setup time of approximately 120ps was necessary to guarantee proper functionality. 3.3 Asynchronous Setup The asynchronous design is slightly more complicated. The data lines are first optimized the same way they were in the synchronous case. One can see from Fig. 11, though, that at least one extra bit line is present. This contains the state conductor bit discussed in the previous section. Initially, this extra line is given the same parameters as the data lines. However, it can also be seen that the buffers that exist on the control line are not simply inverters, like on the data line. This is due to the fact that data needs to travel in both directions on the control line -- the control blocks will always pass a logical 0 to the right and a logical 1 to the left. 10

12 Figure 9: Synchronous simulation setup Figure 10: Synchronous latch repeater Since an inverter is not capable of exhibiting this behavior, a bi-directional buffer was designed. A transistor level diagram of the buffer is shown in Fig. 12. Each side of the buffer contains an inverter, a keeper, and pull-up/down assist logic. If a 0 is being propagated to the fight, inverter0 will trip both the keeper (n0) and the actual pull-down NFET (n 1) on the right side of the buffer. As the D_RIGHT node is pulled down, inverterl drives a 1 onto node0, which II

13 subsequently drives a 0 onto the gate of nl, closing it and allowing the keeper to take over on its own. A similar series of events occurs when a 1 is being propagated to the left. The use of this buffer results in some useful behavior. One can see that it is necessary that the delay of the control signal be greater than the delay of the data lines. Otherwise, the next stage will attempt to take data before it is ready. Since the bi-directional buffer is noticeably slower than a single inverter, the control signal will always arrive after the data. Figure 11: Asynchronousimulation setup However, it was found, through experimentation, that the difference in delay between control line and the data lines was unacceptably large. For example, in the case where eight latch repeaters were used, it took ps for one bit to traverse the data line (this is between two consecutive latches, not the full 2cm), but ps for the control signal to arrive. To make all

14 of the data run 110ps slower because of the control line is unacceptable. As a result, the bus optimizer was run again just for the control line. This time, the buffer distance, latch repeater distance, and buffer size, were all fixed in order to make the placement of logic on the control lines match up with the data lines. This left the optimizer with the ability to alter wire width in order to speed up the line. Thus, this run was an optimization of area subject to a delay constraint. The second optimization run resulted in a 40% improvement in wire propagation delay, allowing the control signal to arrive at 340ps instead, This results in a higher throughput for the asynchronous case. But note that since the control line must always be slower than the data lines - even under worst case conditions and process variations - some delay margin must be built in to this control line optimization problem. ~ode( Figure 1.2: Bi-directional buffer design 3.4 Latch modifications This buffer interfaces with the control portion of the asynchronous latch found in Fig. 13. The latch s behavior was described on page 8 so it will not be discussed here. Notice one small difference, though. The simple nfet-only passgate in the original example has been replaced with

15 a passgate containing both and nfet and a pfet. This change was made because it was becoming difficult to pass a logic 1 into the latch. Figure 13: Asynchronous latch design 3.5 Asynchronous design methodology Based on the methods described above, an asynchronous design methodology can be derived. The steps are as follows: 1. Optimize data lines, as done in synchronous case 2. Fix repeater distance according to results obtained from step 1 3. Separately optimize control line based on data line delays o cannot allow control to be faster than slowest data line o results in wider control line, since distances remain fixed 3.6 General Observations Before examining the data collected in these simulations, it is necessary to point out one important point. For the same number of latches, the synchronous case yields twice the throughput of the asynchronous case. This is because the SC bit must first propagate a 0 all the 14

16 way to the right, indicating that the current stage is full, and then propagate a 1 all the way back to the left, indicating that the stage is now empty. While this seems to be a major problem with the asynchronous design, it is actually easily fixed. The only requirement is that an odd number of repeater stages exist between each pair of latches. If this is the case, it is possible to replace the hi-directional buffer in the middle of the channel with a simplified version of the asynchronous control portion of the latch. Now, once the data gets half way through the channel, it will start to send a signal back in the other direction, resulting in a matched throughput tbr both the asynchronous and synchronousimulations. Unfortunately, due to a limited number of repeater stages, this was only possible in one case of the 180nm runs (assuming that the stages were symmetric). However, the added repeater stages required to propagate the data over a 2cm distance for the 130nm and 100nm cases, ultimately provided a greater opportunity for optimization. 4. Experimental Data We compared our optimized designs on the basis of required power dissipation for equivalent throughput of the asynchronous and synchronous cases. The maximum possible normalized (by channel width) throughput was very similar for both cases over all technology nodes, but the power dissipation requirements varied considerably. 4.1 Channel specs according to process Table 1 contains the bus characteristics for each of the process technologies that we considered. Note that there are two 130nm technology nodes. This is because the 180nm [5] and first 130nm [ 13] processes were for existing ASIC technologies. We wanted to compare these results with those for a typical high performance-end microprocessor process, as derived from [4]. This provided us with an additional opportunity to consider extrapolation to a 100nm technology, which was available only for a high-end process node at this time [4]. 15

17 As can be seen from the table, when migrating to a smaller feature size technology, the bus is characterized by a shorter segment length (i.e., the distance between repeaters), expected. As the minimum feature sizes become smaller, and the bus optimization of maximum throughput continues to select the maximum number of wires in the channel solution, the metal resistance increases, and the demands on repeater insertion increase correspondingly. The side effects axe that larger repeaters are required (larger relative to minimum feature size, not in an absolute sense), and inserted more frequently in order to maintain the signal across the full two centimeters. In these experiments, the 180nm simulations required 15 repeaters, the ASIC 130nm required 31, the high-end 130nm required 19, and the 100nm required 25. Figures 14, 15, 16 and 17 compare power dissipation vs. throughput for the synchronous and asynchronous cases over all of these process nodes. Note that each point represents a throughput optimized design Process Wire Width Wire Spacing Gate Size [ Seg. Length(Lseg) Signal Spd. 180nm 0.3 um 0.3 um 44/22 um 1.36e7 m/s 130nm 0.2 um 0.21 um 34/17 um 690 um 1.03e7 m/s 130nm (p) um 35.5/17.75 um 1.10 mm 1.36e7 m/s 100nm 0.2 um 0.2 um 29/14.5 um ] ummm 1.35e7 m/s Table 1: Bus characteristics 4.2 Power comparisons The 180nm results are summarized in Fig. 14. First, the datapath power dissipation was compared. One would expect that since the asynchronous latches were smaller, less power would be consumed. However, in this case, we were not able to match throughput without using twice as many latches. Because the asynchronous control line must :first propagate a logic 0 to the right: in Fig. 11, indicating that the current stage is full, then propagate a logic l back to the left, indicating that the subsequent stage has accepted that data, it would nominally require two round trips to send each vector of data. By adding intermediate control logic at the midway point of the path, as described above, the data can be sent and acknowledged in the time period required for one round trip. As a result, however, as the required throughput is increased, the power 16

18 consumed by the datapath in the asynchronous case becomes slightly greater than that of the synchronous. Next, the power dissipated by the clock in the synchronous case was compared to the power dissipated by the control line in the asynchronous case. These results did not favor the asynchronous case as much as some m~ght have predicted. Significantly more power was burned in the asynchronous case than in the synchronous case. This can be attributed to the synchronous clock model used in these experiments, Typically, to clock the latches in this channel, buffers would be connected to the clock grid in appropriate places and sent to the latch inputs. While those buffers were modeled in this design, the interconnect associated with the clock grid itself was not. In contrast, the complete, 2cm control line needed to be constructed for the asynchronous case to filnction. We will comment more on the implications of the global clock power savings below. Datapath Power (180nm) Clock/Control Power (180nm) , ,03 ~" , , Throughput (Gb/s) Throughput (Gb/s) Figure 14: Datapath and control power consumption (180nm) Similar experiments were run for the 130nm node. The results of these experiments are summarized in Fig. 15. The major difference is that with more buffers in the channel, it was possible to apply the fix mentioned earlier and match the throughput between the asynchronous and synchronous design methodologies. Now, the results are closer to what one might expect. 17

19 The power consumed by the datapath of the synchronous example is now slightly higher than that of the asynchronous example. The clock followed the same trend as in the 180nm case, but with one small difference. It can be seen that the final point on the curve does not follow the linear trend seen in the 180nm example. This has to do with the size of the bi-directional buffers. Once implemented, the bidirectional buffers were actually larger than the asynchronous control block, and certainly smaller than the simplified version of it used to match the throughput of the synchronous case. When going from three buffers between latches to only one, all bi-directional buffers are removed from the control line, and only control blocks or simplified control blocks exist on the control line. This causes a more drastic change in energy consumption than in the previous cases, so the power dissipation does not increase as much as one might expect. Datapath Power (130nm) Clock/Control Power (130nm) ~0.01 o.oi 0.01 o.oo ~ ,1~ async 0.02 async_fast 0.02 & sync (Gb/s) Throughput (Gb/s) Throughput ~,async async_fast &sync Figure 15: Datapath and control power consumption (130nm) Notice that there are three curves in in the graphs shown in Fig. 15. The curve labeled async_fast is the data collected when the throughputs of the synchronous and asynchronous cases were matched, while the async curve did not attempt to match the throughputs. The two have been placed on the graph together to show that when the throughputs are the same in the two asynchronous designs, the power dissipation is virtually identical. As a result, choosing a design 18

20 becomes a tradeoff between the area consumed ( async required more latches, so more area is used) and the depth of the pipeline (more latches can hold more data). Next we considered 130nm and 100nm technologies that are representative of a high-end processor technology. In each case, the simplified asynchronous control blocks have been placed between the latches in order to match the throughput of the synchronous and asynchronous designs. The results of these experiments are displayed in Figs. 16 and 17. Clearly, the processor technology follows the same trends seen for the ASIC technologies. It is apparent for these examples, however, that there is a superlinear increase in the local.clock power dissipation for the synchronous cases. Thus, if throughput were to increase, the clock may begin to burn more power than the asynchronous control blocks. However, no more latches can be placed in any of the channels in order to increase throughput and possibly observe this behavior for this design example. As a result, for these technologies, and using this circuit design example, it is not possible for the asynchronous solution to burn less power than the synchronous solution. Datapath Power (130nm, p) Clock/Control Power (130nm, p) o8 ~ LI asyn% 0.0O2 I_ ,496 Throughput (Gb/$) Throughput (Gbls) Figure 16: Datapath and control power consumption (130nrn, p) 19

21 Datapath Power (100nm) , O, 182 0,382 0,582 Throughput (Gb/s) Clock/Control Power (100nm) Throughput (Gb/s) Figure 17: Datapath and control power consumption (100nm) Two final points should be made here. First., these graphs do not reflect the fact that the global spine of the clock tree can be removed when using globally asynchronous. Therefore, without considering the entire system design it is not definitive whether or not the additional power consumed by each of the asynchronous communication channels would be more or less than the total power saved by not having a global clock. Second, the overall latency of the bus was not discussed. This is primarily due to the fact that the latency changed little from one simulation to the next (it did change between processes, though). The only differences stemmed from the fact that the latches were somewhat slower than the buffers, but the difference was negligible in all cases. 4.5 Area Estimation The asynchronous design solution does appear to have an advantage over the synchronous solution in terms of transistor width (while an area comparison would be better, no layout for this design exists, so transistor width will be used as an estimation of how much area each design will consume). Fig. 18 contains transistor width statistics for each design. In each case, the maximum throughput solution is being explored. It can be seen that while the asynchronous solution starts out worse (this is due to the overhead involved in building the control line), as the width of the bus increases, the slope of the asynchronous design is less than 20

22 the slope of the synchronous design. Eventually, in every case, the total transistor width for the asynchronous design becomes lower than that of the synchronous design. So, it can be deduced that a wider asynchronous bus will use less silicon than the synchronous bus. Transistor Width (180nm) Transistor Width (130nm) OO async I O o Bus Width (bits) Bus Width (bits) [ ~sync - async) Transistor Width <130nm, p) Transistor Width (100nm) O0O o ~sync ] ~async o Bus Width (bits) Bus Width (bits) l ~sync ~async Figure 18: Transistor width analysis 5. Design Considerations and Conclusions These results from Section 4 raise several interesting questions that we consider below. 5.1 Bus width? Since this was only a 4 bit bus, it was possible to use only one control line for the entire structure. Clearly it would be impractical to attempt to use only one control line for wider busses for two main reasons. First, the increased load placed on the control portions of the latches would 21

23 make their area increase to an unacceptable size, resulting in added capacitance and, therefore, added power consumption. Second, the long wires used to distribute the control signals will be subject to process variations and skew, adding uncertainty to their arrival times. 5.2 Async setup? It is important to design for a safe "setup" time for the asynchronous latches. When reoptimizing the control lines, it was arbitrarily decided that the control signal should be approximately 10% slower than the slowest data line. While this seemed to be enough for the worst case simulations that we considered, more detailed models are required to ensure proper operation and high yielding silicon in general. 5.3 Global communication power? In order to determine just how useful (or useless) the application of an asynchronous communication channel would be, it would be worthwhile to determine what percentage of the chip s power consumption is dedicated to global corrununication. Based on IBM s study the that removal of the global clock should save about 10% of the overall power, but the asynchronous control line burns significantly more power than its synchronous counterpart. One must consider the entire architecture in order to make this assessment. 5.4 Shielding Effects As previously mentioned, all four bits in these experiments were fully shielded. This allowed us to examine the power consumption without having to handle additional constraints on noise. However, this also masked the potentially strong connection between delay and the input switching pattern. In other words, if adjacent bits switch concurrently, this can substantially increase or decrease the signal delay. Therefore, an unshielded bus design must be constructed so that the delay of the control line is always greater than the worst case delay of the data lines. This would impact both the asynchronous and synchronous design throughputs. 22

24 5.4 Future Work In addition to these questions, future work is to design and complete the layout for some of these optimized designs. This would allow us to extract more realistic parasitics from the layout and perform more accurate simulations of the communication channels. Finally, we would like to fabricate and test the ICs in order to see the full effect of manufacturing variations. 5o5.Conclusions In conclusion, while speed has been a primary design goal for many 1Cs in the past, it is apparent that presently we must consider power consumption and manufacturing variations as first order effects as well. Superficially, asynchronous design seems like a good solution to both of these problems. By removing the clock, which has been identified as the major power consumer for many ICs (especially microprocessors), and localizing process variations, one would think that the problems would get better. However, our actual circuit-level implementations demonstrated that this is not the case. While the use of an asynchronous design methodology does a good job of masking process variations, the power savings that one would expect to see are not as encouraging. Since it is impractical to jump from a purely synchronous to a purely asynchronous design, globally asynchronous locally synchronous (GALS) is an obvious compromise. But based on our experiments, even for a GALS system, the 10% power savings resulting from the removal of the global clock is likely to be overshadowed by the increase in power dissipation caused by the addition of the asynchronous control line. Our results further indicate that the benefits of asynchronous would become more apparent for a more local design problem, where smaller devices could be used. However, the benefits of locally asynchronous design styles are not apparent at this time. 23

25 6. References [11 B. Curran, P. Camporese, S. Carey, Y. Chan, R. Clemen, R. Crea, D. Hoffman, T. Koprowski, M. Mayo, T. McPherson, G. Northrop, L. Sigal, H. Smith, F. Tanzi, P. Williams, "A 1.1 Ghz First 64B Generation Z900 Microprocessor," IEEE International Solid-State Circuits Conference, February 2001, pp [21 E. Acar, F. Dartu, L.T. Pileggi, "TETA: Transistor-Level Waveform Evaluation for Timing Analysis", IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems Vol. 21, No. 5, pp , May 2002 [3] ~.~ublic.itrs.net/files/20011trs/desi ~[_, "International Technology Roadmap for Semiconductors," 2001 [4] [5] [61 Ivan Sutherland, Scott Fairbanks, "GasP: A Minimal FIFO Control", Proc. ASYNC, pp , 2001 [71 K. Nabors and J. White, "FastCap: A Multipole Accelerated 3D Capacitance Extraction Program," IEEE Trans. CAD, 10, pp , November 1991 [8] M. Kamon, M Tsuk, and J. White, "FastHenry: A Multipole Accelerated 3D Inductance Extraction Program," IEEE Trans. Microwave Theory and Techniques, 42, pp , September 1994 [91 Tao Lin, Lawrence T. Pileggi, "Throughput-Driven IC Communication Fabric Synthesis", Sttbmitted to ICCAD, 2002 [10] Tao Lin, Michael W. Beattie, Lawrence T. Pileggi, "On the Efficacy of Simplified 2D On-Chip Inductance Models", Proc. DAC, June 2002 [111 Thomas Mieneke, et. al., "Globally Asynchronous, Locally Synchronous Architecture for Large, High Performance ASICs," Proc. ISCAS, pp , 1999 [121 Y. Tiwari, et. al., "Reducing Power in High-Performance Microprocessors," Proc. DAC, pp , 1998 [13] "HCMOS9_GP Design Rules Manual: 0.13 Micron CMOS Process," Rev. C, Nov

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits

More information

Problem Formulation. Specialized algorithms are required for clock (and power nets) due to strict specifications for routing such nets.

Problem Formulation. Specialized algorithms are required for clock (and power nets) due to strict specifications for routing such nets. Clock Routing Problem Formulation Specialized algorithms are required for clock (and power nets) due to strict specifications for routing such nets. Better to develop specialized routers for these nets.

More information

Cluster-based approach eases clock tree synthesis

Cluster-based approach eases clock tree synthesis Page 1 of 5 EE Times: Design News Cluster-based approach eases clock tree synthesis Udhaya Kumar (11/14/2005 9:00 AM EST) URL: http://www.eetimes.com/showarticle.jhtml?articleid=173601961 Clock network

More information

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function. FPGA Logic block of an FPGA can be configured in such a way that it can provide functionality as simple as that of transistor or as complex as that of a microprocessor. It can used to implement different

More information

FPGA Power Management and Modeling Techniques

FPGA Power Management and Modeling Techniques FPGA Power Management and Modeling Techniques WP-01044-2.0 White Paper This white paper discusses the major challenges associated with accurately predicting power consumption in FPGAs, namely, obtaining

More information

THE latest generation of microprocessors uses a combination

THE latest generation of microprocessors uses a combination 1254 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 A 14-Port 3.8-ns 116-Word 64-b Read-Renaming Register File Creigton Asato Abstract A 116-word by 64-b register file for a 154 MHz

More information

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 09, 2016 ISSN (online): 2321-0613 A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM Yogit

More information

ASYNCHRONOUS RESEARCH CENTER Portland State University

ASYNCHRONOUS RESEARCH CENTER Portland State University ASYNCHRONOUS RESEARCH CENTER Portland State University Subject: Sixth Class Hand Round Robin FIFO Date: November 1, 2 From: Ivan Sutherland ARC#: 2-is53 References: ARC# 2-is43: Class 1 Ring Oscillators,

More information

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141 ECE 637 Integrated VLSI Circuits Introduction EE141 1 Introduction Course Details Instructor Mohab Anis; manis@vlsi.uwaterloo.ca Text Digital Integrated Circuits, Jan Rabaey, Prentice Hall, 2 nd edition

More information

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee

More information

Physical Implementation

Physical Implementation CS250 VLSI Systems Design Fall 2009 John Wawrzynek, Krste Asanovic, with John Lazzaro Physical Implementation Outline Standard cell back-end place and route tools make layout mostly automatic. However,

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

Digital Design Methodology (Revisited) Design Methodology: Big Picture

Digital Design Methodology (Revisited) Design Methodology: Big Picture Digital Design Methodology (Revisited) Design Methodology Design Specification Verification Synthesis Technology Options Full Custom VLSI Standard Cell ASIC FPGA CS 150 Fall 2005 - Lec #25 Design Methodology

More information

Power Consumption in 65 nm FPGAs

Power Consumption in 65 nm FPGAs White Paper: Virtex-5 FPGAs R WP246 (v1.2) February 1, 2007 Power Consumption in 65 nm FPGAs By: Derek Curd With the introduction of the Virtex -5 family, Xilinx is once again leading the charge to deliver

More information

On GPU Bus Power Reduction with 3D IC Technologies

On GPU Bus Power Reduction with 3D IC Technologies On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The

More information

SigmaRAM Echo Clocks

SigmaRAM Echo Clocks SigmaRAM Echo s AN002 Introduction High speed, high throughput cell processing applications require fast access to data. As clock rates increase, the amount of time available to access and register data

More information

A Practical Approach to Preventing Simultaneous Switching Noise and Ground Bounce Problems in IO Rings

A Practical Approach to Preventing Simultaneous Switching Noise and Ground Bounce Problems in IO Rings A Practical Approach to Preventing Simultaneous Switching Noise and Ground Bounce Problems in IO Rings Dr. Osman Ersed Akcasu, Jerry Tallinger, Kerem Akcasu OEA International, Inc. 155 East Main Avenue,

More information

Digital Design Methodology

Digital Design Methodology Digital Design Methodology Prof. Soo-Ik Chae Digital System Designs and Practices Using Verilog HDL and FPGAs @ 2008, John Wiley 1-1 Digital Design Methodology (Added) Design Methodology Design Specification

More information

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem. The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults

More information

Chapter 5: ASICs Vs. PLDs

Chapter 5: ASICs Vs. PLDs Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.

More information

The Impact of Wave Pipelining on Future Interconnect Technologies

The Impact of Wave Pipelining on Future Interconnect Technologies The Impact of Wave Pipelining on Future Interconnect Technologies Jeff Davis, Vinita Deodhar, and Ajay Joshi School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332-0250

More information

EEL 4783: HDL in Digital System Design

EEL 4783: HDL in Digital System Design EEL 4783: HDL in Digital System Design Lecture 10: Synthesis Optimization Prof. Mingjie Lin 1 What Can We Do? Trade-offs with speed versus area. Resource sharing for area optimization. Pipelining, retiming,

More information

Wave-Pipelining the Global Interconnect to Reduce the Associated Delays

Wave-Pipelining the Global Interconnect to Reduce the Associated Delays Wave-Pipelining the Global Interconnect to Reduce the Associated Delays Jabulani Nyathi, Ray Robert Rydberg III and Jose G. Delgado-Frias Washington State University School of EECS Pullman, Washington,

More information

Full Custom Layout Optimization Using Minimum distance rule, Jogs and Depletion sharing

Full Custom Layout Optimization Using Minimum distance rule, Jogs and Depletion sharing Full Custom Layout Optimization Using Minimum distance rule, Jogs and Depletion sharing Umadevi.S #1, Vigneswaran.T #2 # Assistant Professor [Sr], School of Electronics Engineering, VIT University, Vandalur-

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

Section 3 - Backplane Architecture Backplane Designer s Guide

Section 3 - Backplane Architecture Backplane Designer s Guide Section 3 - Backplane Architecture Backplane Designer s Guide March 2002 Revised March 2002 The primary criteria for backplane design are low cost, high speed, and high reliability. To attain these often-conflicting

More information

On Using Machine Learning for Logic BIST

On Using Machine Learning for Logic BIST On Using Machine Learning for Logic BIST Christophe FAGOT Patrick GIRARD Christian LANDRAULT Laboratoire d Informatique de Robotique et de Microélectronique de Montpellier, UMR 5506 UNIVERSITE MONTPELLIER

More information

Implementation of ALU Using Asynchronous Design

Implementation of ALU Using Asynchronous Design IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) ISSN: 2278-2834, ISBN: 2278-8735. Volume 3, Issue 6 (Nov. - Dec. 2012), PP 07-12 Implementation of ALU Using Asynchronous Design P.

More information

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry High Performance Memory Read Using Cross-Coupled Pull-up Circuitry Katie Blomster and José G. Delgado-Frias School of Electrical Engineering and Computer Science Washington State University Pullman, WA

More information

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Stratix vs. Virtex-II Pro FPGA Performance Analysis White Paper Stratix vs. Virtex-II Pro FPGA Performance Analysis The Stratix TM and Stratix II architecture provides outstanding performance for the high performance design segment, providing clear performance

More information

Synchronization In Digital Systems

Synchronization In Digital Systems 2011 International Conference on Information and Network Technology IPCSIT vol.4 (2011) (2011) IACSIT Press, Singapore Synchronization In Digital Systems Ranjani.M. Narasimhamurthy Lecturer, Dr. Ambedkar

More information

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 84 CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 3.1 INTRODUCTION The introduction of several new asynchronous designs which provides high throughput and low latency is the significance of this chapter. The

More information

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1 6T- SRAM for Low Power Consumption Mrs. J.N.Ingole 1, Ms.P.A.Mirge 2 Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1 PG Student [Digital Electronics], Dept. of ExTC, PRMIT&R,

More information

10. Interconnects in CMOS Technology

10. Interconnects in CMOS Technology 10. Interconnects in CMOS Technology 1 10. Interconnects in CMOS Technology Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October

More information

ADVANCED FPGA BASED SYSTEM DESIGN. Dr. Tayab Din Memon Lecture 3 & 4

ADVANCED FPGA BASED SYSTEM DESIGN. Dr. Tayab Din Memon Lecture 3 & 4 ADVANCED FPGA BASED SYSTEM DESIGN Dr. Tayab Din Memon tayabuddin.memon@faculty.muet.edu.pk Lecture 3 & 4 Books Recommended Books: Text Book: FPGA Based System Design by Wayne Wolf Overview Why VLSI? Moore

More information

An Interconnect-Centric Design Flow for Nanometer Technologies

An Interconnect-Centric Design Flow for Nanometer Technologies An Interconnect-Centric Design Flow for Nanometer Technologies Jason Cong UCLA Computer Science Department Email: cong@cs.ucla.edu Tel: 310-206-2775 URL: http://cadlab.cs.ucla.edu/~cong Exponential Device

More information

Hardware Design with VHDL PLDs IV ECE 443

Hardware Design with VHDL PLDs IV ECE 443 Embedded Processor Cores (Hard and Soft) Electronic design can be realized in hardware (logic gates/registers) or software (instructions executed on a microprocessor). The trade-off is determined by how

More information

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER Bhuvaneswaran.M 1, Elamathi.K 2 Assistant Professor, Muthayammal Engineering college, Rasipuram, Tamil Nadu, India 1 Assistant Professor, Muthayammal

More information

Chapter Operation Pinout Operation 35

Chapter Operation Pinout Operation 35 68000 Operation 35 Chapter 6 68000 Operation 6-1. 68000 Pinout We will do no construction in this chapter; instead, we will take a detailed look at the individual pins of the 68000 and what they do. Fig.

More information

White Paper Compromises of Using a 10-Gbps Transceiver at Other Data Rates

White Paper Compromises of Using a 10-Gbps Transceiver at Other Data Rates White Paper Compromises of Using a 10-Gbps Transceiver at Other Data Rates Introduction Many applications and designs are adopting clock data recovery-based (CDR) transceivers for interconnect data transfer.

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

The Design of the KiloCore Chip

The Design of the KiloCore Chip The Design of the KiloCore Chip Aaron Stillmaker*, Brent Bohnenstiehl, Bevan Baas DAC 2017: Design Challenges of New Processor Architectures University of California, Davis VLSI Computation Laboratory

More information

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad NoC Round Table / ESA Sep. 2009 Asynchronous Three Dimensional Networks on on Chip Frédéric ric PétrotP Outline Three Dimensional Integration Clock Distribution and GALS Paradigm Contribution of the Third

More information

Challenges and Opportunities for Design Innovations in Nanometer Technologies

Challenges and Opportunities for Design Innovations in Nanometer Technologies SRC Design Sciences Concept Paper Challenges and Opportunities for Design Innovations in Nanometer Technologies Jason Cong Computer Science Department University of California, Los Angeles, CA 90095 (E.mail:

More information

A Global Wiring Paradigm for Deep Submicron Design

A Global Wiring Paradigm for Deep Submicron Design 242 IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2000 A Global Wiring Paradigm for Deep Submicron Design Dennis Sylvester, Member, IEEE and Kurt

More information

CMOS Logic Gate Performance Variability Related to Transistor Network Arrangements

CMOS Logic Gate Performance Variability Related to Transistor Network Arrangements CMOS Logic Gate Performance Variability Related to Transistor Network Arrangements Digeorgia N. da Silva, André I. Reis, Renato P. Ribas PGMicro - Federal University of Rio Grande do Sul, Av. Bento Gonçalves

More information

Package level Interconnect Options

Package level Interconnect Options Package level Interconnect Options J.Balachandran,S.Brebels,G.Carchon, W.De Raedt, B.Nauwelaers,E.Beyne imec 2005 SLIP 2005 April 2 3 Sanfrancisco,USA Challenges in Nanometer Era Integration capacity F

More information

ProASIC PLUS SSO and Pin Placement Guidelines

ProASIC PLUS SSO and Pin Placement Guidelines Application Note AC264 ProASIC PLUS SSO and Pin Placement Guidelines Table of Contents Introduction................................................ 1 SSO Data.................................................

More information

A Survey of Techniques for Power Aware On-Chip Networks.

A Survey of Techniques for Power Aware On-Chip Networks. A Survey of Techniques for Power Aware On-Chip Networks. Samir Chopra Ji Young Park May 2, 2005 1. Introduction On-chip networks have been proposed as a solution for challenges from process technology

More information

On-Chip Variation (OCV) Kunal Ghosh

On-Chip Variation (OCV) Kunal Ghosh On-Chip Variation (OCV) Kunal Ghosh Ever thought what s an interviewer s favorite questions to rip you off all my previous ebooks. And On-Chip Variation (OCV) is one of them, specifically for Static Timing

More information

A 65nm LEVEL-1 CACHE FOR MOBILE APPLICATIONS

A 65nm LEVEL-1 CACHE FOR MOBILE APPLICATIONS A 65nm LEVEL-1 CACHE FOR MOBILE APPLICATIONS ABSTRACT We describe L1 cache designed for digital signal processor (DSP) core. The cache is 32KB with variable associativity (4 to 16 ways) and is pseudo-dual-ported.

More information

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras CAD for VLSI Debdeep Mukhopadhyay IIT Madras Tentative Syllabus Overall perspective of VLSI Design MOS switch and CMOS, MOS based logic design, the CMOS logic styles, Pass Transistors Introduction to Verilog

More information

Chapter 2 On-Chip Protection Solution for Radio Frequency Integrated Circuits in Standard CMOS Process

Chapter 2 On-Chip Protection Solution for Radio Frequency Integrated Circuits in Standard CMOS Process Chapter 2 On-Chip Protection Solution for Radio Frequency Integrated Circuits in Standard CMOS Process 2.1 Introduction Standard CMOS technologies have been increasingly used in RF IC applications mainly

More information

Unleashing the Power of Embedded DRAM

Unleashing the Power of Embedded DRAM Copyright 2005 Design And Reuse S.A. All rights reserved. Unleashing the Power of Embedded DRAM by Peter Gillingham, MOSAID Technologies Incorporated Ottawa, Canada Abstract Embedded DRAM technology offers

More information

Signal Integrity Comparisons Between Stratix II and Virtex-4 FPGAs

Signal Integrity Comparisons Between Stratix II and Virtex-4 FPGAs White Paper Introduction Signal Integrity Comparisons Between Stratix II and Virtex-4 FPGAs Signal integrity has become a critical issue in the design of high-speed systems. Poor signal integrity can mean

More information

10/5/2016. Review of General Bit-Slice Model. ECE 120: Introduction to Computing. Initialization of a Serial Comparator

10/5/2016. Review of General Bit-Slice Model. ECE 120: Introduction to Computing. Initialization of a Serial Comparator University of Illinois at Urbana-Champaign Dept. of Electrical and Computer Engineering ECE 120: Introduction to Computing Example of Serialization Review of General Bit-Slice Model General model parameters

More information

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,

More information

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.1, FEBRUARY, 2015 http://dx.doi.org/10.5573/jsts.2015.15.1.077 Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network

More information

Regularity for Reduced Variability

Regularity for Reduced Variability Regularity for Reduced Variability Larry Pileggi Carnegie Mellon pileggi@ece.cmu.edu 28 July 2006 CMU Collaborators Andrzej Strojwas Slava Rovner Tejas Jhaveri Thiago Hersan Kim Yaw Tong Sandeep Gupta

More information

A GENERIC SIMULATION OF COUNTING NETWORKS

A GENERIC SIMULATION OF COUNTING NETWORKS A GENERIC SIMULATION OF COUNTING NETWORKS By Eric Neil Klein A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of

More information

Clocked and Asynchronous FIFO Characterization and Comparison

Clocked and Asynchronous FIFO Characterization and Comparison Clocked and Asynchronous FIFO Characterization and Comparison HoSuk Han Kenneth S. Stevens Electrical and Computer Engineering University of Utah Abstract Heterogeneous blocks, IP reuse, network-on-chip

More information

Recent Advancements in Bus-Interface Packaging and Processing

Recent Advancements in Bus-Interface Packaging and Processing Recent Advancements in Bus-Interface Packaging and Processing SCZA001A February 1997 1 IMPORTANT NOTICE Texas Instruments (TI) reserves the right to make changes to its products or to discontinue any semiconductor

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017 Design of Low Power Adder in ALU Using Flexible Charge Recycling Dynamic Circuit Pallavi Mamidala 1 K. Anil kumar 2 mamidalapallavi@gmail.com 1 anilkumar10436@gmail.com 2 1 Assistant Professor, Dept of

More information

TABLE OF CONTENTS 1.0 PURPOSE INTRODUCTION ESD CHECKS THROUGHOUT IC DESIGN FLOW... 2

TABLE OF CONTENTS 1.0 PURPOSE INTRODUCTION ESD CHECKS THROUGHOUT IC DESIGN FLOW... 2 TABLE OF CONTENTS 1.0 PURPOSE... 1 2.0 INTRODUCTION... 1 3.0 ESD CHECKS THROUGHOUT IC DESIGN FLOW... 2 3.1 PRODUCT DEFINITION PHASE... 3 3.2 CHIP ARCHITECTURE PHASE... 4 3.3 MODULE AND FULL IC DESIGN PHASE...

More information

An Overview of Standard Cell Based Digital VLSI Design

An Overview of Standard Cell Based Digital VLSI Design An Overview of Standard Cell Based Digital VLSI Design With examples taken from the implementation of the 36-core AsAP1 chip and the 1000-core KiloCore chip Zhiyi Yu, Tinoosh Mohsenin, Aaron Stillmaker,

More information

NoCIC: A Spice-based Interconnect Planning Tool Emphasizing Aggressive On-Chip Interconnect Circuit Methods

NoCIC: A Spice-based Interconnect Planning Tool Emphasizing Aggressive On-Chip Interconnect Circuit Methods 1 NoCIC: A Spice-based Interconnect Planning Tool Emphasizing Aggressive On-Chip Interconnect Circuit Methods V. Venkatraman, A. Laffely, J. Jang, H. Kukkamalla, Z. Zhu & W. Burleson Interconnect Circuit

More information

International Journal of Scientific & Engineering Research, Volume 5, Issue 2, February ISSN

International Journal of Scientific & Engineering Research, Volume 5, Issue 2, February ISSN International Journal of Scientific & Engineering Research, Volume 5, Issue 2, February-2014 938 LOW POWER SRAM ARCHITECTURE AT DEEP SUBMICRON CMOS TECHNOLOGY T.SANKARARAO STUDENT OF GITAS, S.SEKHAR DILEEP

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 10: Three-Dimensional (3D) Integration

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 10: Three-Dimensional (3D) Integration 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 10: Three-Dimensional (3D) Integration Instructor: Ron Dreslinski Winter 2016 University of Michigan 1 1 1 Announcements

More information

VLSI Test Technology and Reliability (ET4076)

VLSI Test Technology and Reliability (ET4076) VLSI Test Technology and Reliability (ET4076) Lecture 8 (1) Delay Test (Chapter 12) Said Hamdioui Computer Engineering Lab Delft University of Technology 2009-2010 1 Learning aims Define a path delay fault

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

ProASIC3/E SSO and Pin Placement Guidelines

ProASIC3/E SSO and Pin Placement Guidelines ProASIC3/E SSO and Pin Placement Guidelines Introduction SSO Effects Ground bounce and VCC bounce have always been present in digital integrated circuits (ICs). With the advance of technology and shrinking

More information

Determination of Worst-case Crosstalk Noise for Non-Switching Victims in GHz+ Interconnects

Determination of Worst-case Crosstalk Noise for Non-Switching Victims in GHz+ Interconnects Determination of Worst-case Crosstalk Noise for Non-Switching Victims in GHz+ Interconnects Jun Chen ECE Department University of Wisconsin, Madison junc@cae.wisc.edu Lei He EE Department University of

More information

RTL Power Estimation and Optimization

RTL Power Estimation and Optimization Power Modeling Issues RTL Power Estimation and Optimization Model granularity Model parameters Model semantics Model storage Model construction Politecnico di Torino Dip. di Automatica e Informatica RTL

More information

FAST time-to-market, steadily decreasing cost, and

FAST time-to-market, steadily decreasing cost, and IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 10, OCTOBER 2004 1015 Power Estimation Techniques for FPGAs Jason H. Anderson, Student Member, IEEE, and Farid N. Najm, Fellow,

More information

National Semiconductor Application Note 368 Larry Wakeman March 1984

National Semiconductor Application Note 368 Larry Wakeman March 1984 An Introduction to and Comparison of 54HCT 74HCT TTL Compatible CMOS Logic The 54HC 74HC series of high speed CMOS logic is unique in that it has a sub-family of components designated 54HCT 74HCT Generally

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Low Power PLAs. Reginaldo Tavares, Michel Berkelaar, Jochen Jess. Information and Communication Systems Section, Eindhoven University of Technology,

Low Power PLAs. Reginaldo Tavares, Michel Berkelaar, Jochen Jess. Information and Communication Systems Section, Eindhoven University of Technology, Low Power PLAs Reginaldo Tavares, Michel Berkelaar, Jochen Jess Information and Communication Systems Section, Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands {regi,michel,jess}@ics.ele.tue.nl

More information

Calibrating Achievable Design GSRC Annual Review June 9, 2002

Calibrating Achievable Design GSRC Annual Review June 9, 2002 Calibrating Achievable Design GSRC Annual Review June 9, 2002 Wayne Dai, Andrew Kahng, Tsu-Jae King, Wojciech Maly,, Igor Markov, Herman Schmit, Dennis Sylvester DUSD(Labs) Calibrating Achievable Design

More information

Advanced Computer Architecture (CS620)

Advanced Computer Architecture (CS620) Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).

More information

Recent Topics on Programmable Logic Array

Recent Topics on Programmable Logic Array Seminar Material For Graduate Students 2001/11/30 Recent Topics on Programmable Logic Array Department of Electronics Engineering, Asada Lab. M1, 16762, Ulkuhan Ekinciel Abstract: The programmable logic

More information

ESE 570 Cadence Lab Assignment 2: Introduction to Spectre, Manual Layout Drawing and Post Layout Simulation (PLS)

ESE 570 Cadence Lab Assignment 2: Introduction to Spectre, Manual Layout Drawing and Post Layout Simulation (PLS) ESE 570 Cadence Lab Assignment 2: Introduction to Spectre, Manual Layout Drawing and Post Layout Simulation (PLS) Objective Part A: To become acquainted with Spectre (or HSpice) by simulating an inverter,

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism

More information

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic ECE 42/52 Rapid Prototyping with FPGAs Dr. Charlie Wang Department of Electrical and Computer Engineering University of Colorado at Colorado Springs Evolution of Implementation Technologies Discrete devices:

More information

A Novel Pseudo 4 Phase Dual Rail Asynchronous Protocol with Self Reset Logic & Multiple Reset

A Novel Pseudo 4 Phase Dual Rail Asynchronous Protocol with Self Reset Logic & Multiple Reset A Novel Pseudo 4 Phase Dual Rail Asynchronous Protocol with Self Reset Logic & Multiple Reset M.Santhi, Arun Kumar S, G S Praveen Kalish, Siddharth Sarangan, G Lakshminarayanan Dept of ECE, National Institute

More information

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment

Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Xin-Wei Shih, Tzu-Hsuan Hsu, Hsu-Chieh Lee, Yao-Wen Chang, Kai-Yuan Chao 2013.01.24 1 Outline 2 Clock Network Synthesis Clock network

More information

Interfacing RLDRAM II with Stratix II, Stratix,& Stratix GX Devices

Interfacing RLDRAM II with Stratix II, Stratix,& Stratix GX Devices Interfacing RLDRAM II with Stratix II, Stratix,& Stratix GX Devices November 2005, ver. 3.1 Application Note 325 Introduction Reduced latency DRAM II (RLDRAM II) is a DRAM-based point-to-point memory device

More information

Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools

Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools Shamik Das, Anantha Chandrakasan, and Rafael Reif Microsystems Technology Laboratories Massachusetts Institute of Technology

More information

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 5, MAY 1998 707 Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor James A. Farrell and Timothy C. Fischer Abstract The logic and circuits

More information

Network on Chip Architecture: An Overview

Network on Chip Architecture: An Overview Network on Chip Architecture: An Overview Md Shahriar Shamim & Naseef Mansoor 12/5/2014 1 Overview Introduction Multi core chip Challenges Network on Chip Architecture Regular Topology Irregular Topology

More information

Lecture 1: CS/ECE 3810 Introduction

Lecture 1: CS/ECE 3810 Introduction Lecture 1: CS/ECE 3810 Introduction Today s topics: Why computer organization is important Logistics Modern trends 1 Why Computer Organization 2 Image credits: uber, extremetech, anandtech Why Computer

More information

TEMPLATE BASED ASYNCHRONOUS DESIGN

TEMPLATE BASED ASYNCHRONOUS DESIGN TEMPLATE BASED ASYNCHRONOUS DESIGN By Recep Ozgur Ozdag A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the

More information

Automated Extraction of Physical Hierarchies for Performance Improvement on Programmable Logic Devices

Automated Extraction of Physical Hierarchies for Performance Improvement on Programmable Logic Devices Automated Extraction of Physical Hierarchies for Performance Improvement on Programmable Logic Devices Deshanand P. Singh Altera Corporation dsingh@altera.com Terry P. Borer Altera Corporation tborer@altera.com

More information

EE5780 Advanced VLSI CAD

EE5780 Advanced VLSI CAD EE5780 Advanced VLSI CAD Lecture 1 Introduction Zhuo Feng 1.1 Prof. Zhuo Feng Office: EERC 513 Phone: 487-3116 Email: zhuofeng@mtu.edu Class Website http://www.ece.mtu.edu/~zhuofeng/ee5780fall2013.html

More information

HOME :: FPGA ENCYCLOPEDIA :: ARCHIVES :: MEDIA KIT :: SUBSCRIBE

HOME :: FPGA ENCYCLOPEDIA :: ARCHIVES :: MEDIA KIT :: SUBSCRIBE Page 1 of 8 HOME :: FPGA ENCYCLOPEDIA :: ARCHIVES :: MEDIA KIT :: SUBSCRIBE FPGA I/O When To Go Serial by Brock J. LaMeres, Agilent Technologies Ads by Google Physical Synthesis Tools Learn How to Solve

More information

Chronos Latency - Pole Position Performance

Chronos Latency - Pole Position Performance WHITE PAPER Chronos Latency - Pole Position Performance By G. Rinaldi and M. T. Moreira, Chronos Tech 1 Introduction Modern SoC performance is often limited by the capability to exchange information at

More information

Power Estimation of UVA CS754 CMP Architecture

Power Estimation of UVA CS754 CMP Architecture Introduction Power Estimation of UVA CS754 CMP Architecture Mateja Putic mateja@virginia.edu Early power analysis has become an essential part of determining the feasibility of microprocessor design. As

More information

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen CSE 548 Computer Architecture Clock Rate vs IPC V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger Presented by: Ning Chen Transistor Changes Development of silicon fabrication technology caused transistor

More information

Design of Asynchronous Interconnect Network for SoC

Design of Asynchronous Interconnect Network for SoC Final Report for ECE 6770 Project Design of Asynchronous Interconnect Network for SoC Hosuk Han 1 han@ece.utah.edu Junbok You jyou@ece.utah.edu May 12, 2007 1 Team leader Contents 1 Introduction 1 2 Project

More information