Quasi Delay-Insensitive High Speed Two-Phase Protocol Asynchronous Wrapper for Network on Chips

Size: px

Start display at page:

Download "Quasi Delay-Insensitive High Speed Two-Phase Protocol Asynchronous Wrapper for Network on Chips"

Gillian Cameron
6 years ago
Views:

1 Guan XG, Tong XY, Yang YT. Quasi delay-insensitive high speed two-phase protocol asynchronous wrapper for network on chips. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(5): Sept DOI /s Quasi Delay-Insensitive High Speed Two-Phase Protocol Asynchronous Wrapper for Network on Chips Xu-Guang Guan ( ), Student Member, IEEE, Xing-Yuan Tong ( ), and Yin-Tang Yang ( ) Institute of Microelectronics, Xidian University, Xi an , China guanxuguang 5@126.com; mayxt@126.com; yangyt@xidian.edu.cn Received July 13, 2009; revised June 17, Abstract For the purpose of solving the shortcomings of low speed and high power consumption of asynchronous wrapper in conventional network on chips, this paper proposes a quasi delay-insensitive high-speed two-phase operation mode asynchronous wrapper. The metastable state in sampling data procedure can be avoided by detecting the write/read signal, which can be used to stop the clock. Empty/full level of the registers can be determined by detecting the pulse signal of the two-phase asynchronous register, and then control the wrapper to sample input/output data. Sender wrapper and receiver wrapper consist of C elements and threshold gates, which ensure the quasi delay-insensitive characteristics and enhance the robustness. Simulations under different technology corners are implemented based on SMIC 0.18 µm standard CMOS. Sender wrapper and receiver wrapper allow synchronous modules to work at the speed of 3.08 GHz and 2.98 GHz respectively with average dynamic power consumption of mw and mw. Its advantages of high-throughput, low-power, scalability and robustness make it a viable option for high-speed low-power interconnection of network-on-chip. Keywords asynchronous wrapper, quasi delay-insensitive, network on chip (NoC), two-phase protocol, threshold gate 1 Introduction As semiconductor technology shrinks, more IP cores are integrated onto a single chip to implement more complicated and efficient on-chip system solutions. But with the increasing working speed and larger scale of on chip systems, conventional single clock working operation faces a lot of challenges, such as poor reusability of modules, power consumption increment of clock tree, large size of clock tree area, clock skew and EMI. Because of these problems, the complexity of designing very deep submicron integrated circuit is greatly increased. So the problems brought by clock become the crucial issues that need to be solved first in ultra large integrated circuits. To reduce power consumption of clock and increase communication performance, extensive research has been conducted into network-on-chip (NoC) systems [1-3]. The NoC approach particularly suits communication-dominant on-chip systems. Asynchronous NoCs are proposed to eliminate the clock for global communication [4-5], providing better power efficiency and higher modularity compared to synchronous NoCs. Asynchronous circuits have a couple of advantages over synchronous circuits in terms of low power design. The lack of a clock network is a substantial advantage. High-speed clock networks have been known to account for as high as 70% of total power consumption of the system [6]. In addition, asynchronous circuits have the equivalent of perfect clock gating. In other words, circuits go into working when data come, while automatically go into sleep when no data come, and there is no need for extra control logic to control them. So it is necessary to add asynchronous wrappers around synchronous modules, in order to communicate with each other by asynchronous routers to reach high-throughput and low-power. Thus asynchronous wrapper becomes an important and difficult issue in NoC design. This paper presents a novel quasi delayinsensitive two phase protocol asynchronous wrapper with characteristics of high-speed, low-power, quasi delay-insensitive and high-scalability, which can fulfill the requirements of high performance interconnections of the network on chips. 2 Network on Chips and Asynchronous Interconnection Handshake Protocol The working manner of network on chips refers to Regular Paper Supported by the National Natural Science Foundation of China under Grant Nos , , the National High-Tech Research and Development 863 Program of China under Grant Nos. 2009AA01Z258, 2009AA01Z260, and the National Science & Technology Important Project under Grant No. 2009ZX Springer Science + Business Media, LLC & Science Press, China

2 Xu-Guang Guan et al.: Two-Phase Asynchronous Wrapper for NoC 1093 the data transmission mode in computer networks. Data can be transmitted to the corresponding target module by route switching, which substitutes the conventional data transmission mode in bus-based architecture. So it has high concurrent transmission capacity and expansibility. Owing to the dependence of data transmission on the handshake signals other than clock in the asynchronous on chip networks, problems brought by clock can be eliminated, also modularity is greatly enhanced. Although on chip networks have a variety of topologies, asynchronous wrappers can be used directly regardless of topology changes due to its high reusable characteristics. 2D mesh topology is widely used in network on chips, and Fig.1 shows its unit structure. It consists of synchronous module, stoppable clock, asynchronous wrapper, asynchronous router and on-chip buffer. Four-phase protocol is widely used in conventional NoC design [7-9], because four-phase protocol can effectively reuse existing synchronous units, and it is suitable for designing function module in asynchronous router due to its design simplicity. But communications between asynchronous router and asynchronous wrapper always become the bottleneck when burst mode data transmission emerge. In other words, four-phase working manner cannot satisfy the increasing demand of large scale communications in network on chip. complexity; therefore, it is more preferable for communications between asynchronous routers. In fourphase protocol working, as shown in Fig.2(a), control signals will hop four times for transmitting one data, which is time-consuming and power-consuming. So it is not suitable for high-speed low-power asynchronous on-chip interconnect. Substituting two-phase protocol for four-phase protocol can result in a great enhancement in communication speed, as shown in Fig.2(b). In two-phase operation mode, each hop in request and acknowledgement signals represents a data transmission. The throughput of two-phase protocol circuits can reach twice as much as that of four-phase protocol circuits under the same frequency requests. Thus twophase protocol is more suitable for high-speed transmission applications in network on chips. Fig.2. Four-phase and two-phase transmission protocols. (a) Four-phase handshake protocol. (b) Two-phase handshake protocol. Fig.1. Structure diagram of 2D mesh network on chip unit. Two-phase operation mode can be triggered to work at rising edge and falling edge of the request signal. So its working speed is much higher than that of four-phase operation mode [10]. But two-phase protocol does not fit function module design due to its high However, single-rail working is widely used in conventional two-phase transmission protocol. In order to avoid glitches at outputs, delay-matching is adopted to guarantee the proper operation of the circuits. At this point, dual-rail operation can reflect its particular advantages. Dual-rail protocol circuits are also called quasi delay-insensitive circuits, namely the circuit is insensitive to variations of physical parameters if lines on fork and merge are equal in length. Variations of physical parameters include doping density fluctuating, temperature and voltage variations. So dual-rail circuits have strong robustness, which is suitable for low-voltage high-speed on-chip applications. We can combine the high-speed two-phase protocol and high robustness dual-rail protocol together, that is, two-phase dual-rail working. Fig.3 is the diagram of two-phase dual-rail transmission protocol. If the

3 1094 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 channel transmits the same data, the codes in different cycles are not always the same. Data code in present cycle depends on the data in previous cycle. If the data in present cycle is different from that of the previous, two-phase dual-rail encoding will change following the sequence ; if the data in present cycle and following cycles are the same and equal to 1, then two-phase dual-rail encoding will change following the sequence ; if the data in present cycle and following cycles are the same and equal to 0, then two-phase dual-rail encoding will change following the sequence (see Fig.3) can be seen more visually in Fig.4. Only one bit changes in each cycle, and data is indicated by S1, while S2 is an accompanying bit, which indicates the adjacent cycles have the same data. This is a non-return to zero (NRTZ) encoding, in which each hop of one line represents a data transmission, so the time utilization is twice that of the four-phase working. After two-phase dual-rail protocol asynchronous wrappers are used instead of the synchronous modules, data signals can be converted to two-phase dual-rail data at a high speed. And high speed transmissions between wrappers and routers become possible, which greatly enhances the system performance. 3 Specific Implementation of Two-Phase Quasi Delay-Insensitive Asynchronous Wrapper Fig.3. Diagram of two-phase dual-rail transmission protocol. Design of asynchronous wrappers need to meet three major targets: less metastable state, lower delay and lower power consumption. It has the function of converting synchronous signals into asynchronous signals of corresponding transmission protocol and vice versa. Asynchronous wrapper proposed recently in [11-13] are four-phase single-rail wrappers: Although single-rail design can broadly reuse traditional synchronous units, it has unavoidable limitations as delay-matching, low function in anti-emi, extra control circuits and glitches. The two-phase wrapper proposed is insensitive to delay variations, and can properly work with no delaymatching effort, and be of high function in anti-emi. 3.1 Sender Two-Phase Wrapper Fig.4. Waveform of two-phase dual-rail transmission protocol. The encoding scheme of two-phase dual-rail protocol Specific circuit of sender two-phase quasi delayinsensitive wrapper is shown in Fig.5. It can automatically detect read signal and convert output data of the Fig.5. Implementation of sender two-phase quasi delay-insensitive wrapper.

4 Xu-Guang Guan et al.: Two-Phase Asynchronous Wrapper for NoC 1095 synchronous module to two-phase dual-rail output data with properly working under variations of delays. For simplicity, we merely draw one bit conversion as the example here, and more bits conversion is easy to reach through adding bit width at the output buffer. The difficulty of designing the wrapper is detecting full/empty state of the output buffer and controlling the clock module to work. The circuit in Fig.5 consists of three parts: synchronous module, two-phase asynchronous wrapper and asynchronous output buffers. The wrapper is responsible for release/stop the clock as well as data sampling. It can stop the stoppable clock module through detecting write signal. If write signal becomes high, then stretch signal will go high just after 2 gates delay. Processing element serves as the synchronous module and is responsible for sending data and informing the wrapper to write. Output buffers are responsible for data storage and connections with asynchronous routers. Twophase registers need XOR gates at both input and output to detect new data, which is different from conventional four-phase registers, as shown in right half of Fig.5. Two flip-flops are used to sample synchronous data, while pulse generator is used to control D flip-flop sampling, as shown in Fig.6. It consists of XNOR gate, XOR gate and NOR gate. N1 and N2 come from XOR gates of both input and output, while ack signal comes from next stage of two-phase register. When codes at inputs and outputs are unequal, that is, N1 and N2 signal are different, the output of XOR gate goes low. And now the voltages of signal ack and N2 are the same, so the output of XOR gate is low as well. This will make the pulse signal go higher and create a rising edge pulse, which will trigger D flip-flops to sample the incoming data. After the D flip-flops have successfully sampled incoming data, the pulse signal goes low. So each change in signal data will cause D flip-flops to sample, avoiding null cycles in four-phase wrappers, so the throughput is enhanced. The working procedure of wrapper circuit in Fig.5 is Fig.6. Pulse generation module. as follows. The circuit begins to reset at first. After reset, the output of reverse C element is high, the output of TH33 threshold gate and signal stretch are low, and stoppable clock module begins to work properly; the wrapper circuit is waiting for read signal from synchronous module. When synchronous module wants to send data, write signal goes high, and the output of reverse C element keeps high due to low signal stretch. So the output of #1 AND gate goes high. And now, three inputs of the TH33 gate are all high, which fulfill the threshold condition, and this will cause the output of the TH33 gate to go high. So the stoppable clock goes into pause state to keep the output data stable. This can be classified into two circumstances according to the data inputted. Firstly, new data is different from previous data, that is, signal data and D1 are unequal. This can be detected by #1 XOR gate and signal H goes high. In addition, stretch is high, so the output of #2 AND gate goes high, which enables the tri-state gate and data can transmit into signal D1. Variations in D1 can be detected by pulse generator and cause signal pulse to go high, and this makes the TH33 gate meet the reset condition. So clock is released by signal stretch. At the same time, rising edge of the signal pulse will control D flip-flop #2 and #3 to sample D1 and D2. By now synchronous data have been converted to two-phase dual-rail asynchronous data. Secondly, new data is the same as previous ones, i.e., signal data and D1 are equal. In this case, the output of #1 XOR gate H keeps low, so the output of #2 reverse C element keeps high. Therefore, the output of #3 AND gate goes high, which triggers #1 D flip-flop to change its output to the opposite one. Similarly, variations in D2 will be detected by pulse generator and makes signal pulse high, which controls D flip-flops to sample inputs. By now, cycles of second circumstances end. The working procedure above all can be regarded as write slowly, read quickly. So what is write quickly, read slowly? This situation is very similar to blocking in asynchronous part. If the asynchronous part transmits slowly, ack signal of the next stage will not reach to the present stage right away, thus the pulse signal will not go high until ack signal arrives. So signal stretch will continuously be high until pulse signal changes. That is to say, the synchronous module will wait until the asynchronous part is free. So flow control is manipulated by pulse signal. Another situation often encounter is that output buffers get full. In this case, signal ack is contrary to signal N2. This will cause the output of TH33 gate stretch to keep high, and clock remains pause. Only when output buffers are not full, pulse signal can rise again, and release the output of TH33. Thus, stoppable clock can continuously go into working again.

5 1096 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 Fig.7. Implementation of receiver two-phase quasi delay-insensitive wrapper. 3.2 Receiver Two-Phase Wrapper Specific implementation of receiver two-phase quasi delay-insensitive wrapper is shown in Fig.7. The receiver wrapper is simpler than sender because only D1 represents data. This wrapper can automatically detect read signal and converts two-phase dual-rail data to synchronous data, only one-bit conversion is shown here, and more bits conversion is easy to reach through adding bit width at the input buffer. The difficulty of designing this wrapper is detecting full/empty state of the input buffer and controlling the stoppable clock module to work. Circuits in Fig.7 mainly consist of three parts. Processing element serves as the synchronous module and is responsible for receiving data and informing the wrapper to read. If read signal becomes high, then stretch signal will go high just after two gates delay and stoppable clock will stop working. Ack signal is responsible for informing the previous stages that data sampling has been finished. In order to achieve flow control function at the receiver, changes must be made to the first two-phase register at the interface of input buffers and two-phase wrapper. Thus the stretch signal of the wrapper can control the action of pulse generation module, so the flow control is reached. Improved pulse generation module is shown in Fig.8(a). G1 and G2 are produced by XOR gates at the input and output of first stage two-phase register respectively. When new data arrive, G1 will get inversed, and XOR gate of pulse generation module goes high. And now pulse generation module needs to wait for signal stretch to let signal pulse go high. In other words, the working rhythm of input buffer is controlled by signal stretch, which avoids data of previous cycle being flushed by new data. This point can be seen exactly from Fig.8(b). When data arrive, G1 is low while G2 is high, so the output of XOR gate and the inverse output of TH22 gate are both high. If two-phase wrapper wants to read data, stretch becomes high, and three inputs of the AND gate in pulse generation module are all high, which will cause signal pulse to rise. At the same time, high signal stretch makes the output of inverse TH22 gate go low. So after delays of a TH22 gate and an inverter, signal b goes low, which makes the output of AND gate pulse go low. And, a pulse period is over. Fig.8. Pulse generation module with flow control and its timing diagram. (a) Pulse generation module with flow control function. (b) Timing diagram of pulse generation module.

6 Xu-Guang Guan et al.: Two-Phase Asynchronous Wrapper for NoC 1097 The working procedure of receiver two-phase quasi delay-insensitive wrapper in Fig.7 is as follows. The circuit begins to reset at first. After reset, the outputs of C element and TH33 gate are both zero, and stoppable clock module begins to work properly. The whole module is waiting for the arrival of read signal. If synchronous module wants to read data, read signal goes high. The output of AND gate goes high due to high reverse output of C element. And now TH33 gate meets set condition, thus signal stretch goes high and clock stops. If there are any new data at the input port, that is, G1 and G2 are different, then a pulse can be generated by pulse generator. This pulse is used to control D flip-flops to sample input data. Signal read keeps high during the time clock suspended, so tri-state gate is enabled, and data can enter synchronous module through tri-state gate. Meanwhile, high signal stretch and pulse cause the inputs of TH33 gate to reach reset condition, thus signal stretch goes low. This will make stoppable clock module go to work again. By now, a data sampling period is over. It can be found that when data bits being converted are wider, the number of gates used is larger than that of conventional single rail circumstances. But dual-rail two-phase working scheme has unique advantages over the single-rail working scheme, quasi delay-insensitive can make the circuit immune to delay variations and strong in anti-emi. From the view point of on-chip communication, stability is the first consideration. If error frequently appears in transmission, then more cost will come from retransmission or error correction. These greatly add burdens to the network. So a high robustness transmission scheme is necessary to on-chip communications. Generally speaking, it is worthy to use more components to achieve robustness. Another issue concerned is the probability of hazard. It can be divided into two parts. First part is the asynchronous register, in which pulse generation module was designed with specific discrete gates to avoid hazard. A rising pulse-edge is produced only when the new code is different from the current code and the current code is the same with the next stage code. So there are two possible transitions that generate the rising pulse edge: the arrival of a new code, or the arrival of an acknowledgement of the next stage s code. The falling pulse edge is always generated by the capture of the new code in the D flip-flops. So this circuit is robust against delay variations. There is one timing relationship that must be guaranteed: the minimum pulse width of the D flip-flop must be taken into account. This requirement is easily met since the falling edge of the pulse is not generated until the D flip-flops capture a new value and it propagates through the pulse generation logic. Second part is the control signal of the wrapper. Different from conventional Boolean gates, C element and null convention logics are both state holding devices, and they have input threshold characteristics. Thus C element and null convention logic are insensitive to the sequence of incoming signals, that is to say the output of the device only changes when inputs are all reached or all removed, so glitches cannot emerge at the outputs. In a word, the circuit is hazard-free. 4 Simulations and Analysis The whole circuits of sender wrapper and receiver wrapper are implemented using SMIC 0.18 µm standard technology. Fig.9 is the SPICE simulation waveform. It can be found that in Fig.9(a), data at first Fig.9. Simulation waveform of sender and receiver wrapper. (a) Waveform of sender two-phase wrapper. (b) Waveform of receiver two-phase wrapper.

7 1098 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 half of the time is 1, and dual-rail two-phase output changes following the sequence ; data at last half of the time is 0, and dual-rail two-phase output changes following the sequence So the sender has intact functions. In Fig.9(b), asynchronous two-phase input data at first half of the time changes following the sequence , and it can be found that the data sampled is 0; while asynchronous twophase input data at last half of the time changes following the sequence , the data sampled work following the sequence Thus, the receiver has intact functions as well. To test the performance under different technology conditions and the sensitiveness on process variations, simulations were made under three technology models (tt, ss, ff) at temperature 27 C. Results are shown in Table 1 and Table 2. Here we define delay forward as the delay from rise of read/write signal to rise of signal stretch, and define delay all as the delay from rise of read/write signal to fall of signal stretch. P dynamic represents the average dynamic power consumption of the circuit, while P static stands for average static power consumption of the circuit. As can be seen from Table1 and Table 2, the circuit can properly work with preferable performance under different variations of technology. Variations of technology have a little impact on the circuit, namely the circuit has a better robustness. To further explain the merits of proposed two-phase quasi delay-insensitive asynchronous wrapper, comparisons among several wrappers on performance, sensitiveness on delay and operation mode are made, as shown in Table 3. Here Throughput sender represents the throughput of sender wrapper, and Throughput receiver represents the throughput of receiver wrapper. As can be seen from Table 3, the proposed wrapper has advantages over the majority of conventional singlerail four-phase asynchronous wrappers on throughput because the proposed wrapper can work at both edges of the signal. Although the comparisons are based on different technology processes, it can be found that the throughput of the proposed method is close to method in [11] and greatly exceed the throughput of methods in [12-14]. Considering that improving the technology process can make a great improvement to working speed of the circuits, the wrapper proposed is expected to have better performance under more advanced technology process. Due to the fact that the performance of wrapper in [11] is close to the performance of the wrapper proposed, we mainly focus our comparison on these two wrappers. Firstly, the wrapper of [11] may have timing constraints. While the wrapper proposed is based on a stoppable clock scheme, synchronous module does not change its state until data sampling is finished. Only when asynchronous module has successfully sampled the synchronous data can it inform the stoppable clock generator module to release the clock signal. So there are no timing constraints to the whole conversion circuits and the robustness of the wrapper can be increased. But in the mean time, some throughput performances would be lost since restoring the clock signal wastes some time. Secondly, the wrapper in [11] needs K + 2 conversion stages to reach the maximum throughput. While the proposed wrapper has nothing to do with conversion stages, the throughput is relatively stable. And the wrapper in [11] needs Multiplexer, De-multiplexer, finite state machine and Domino controller to control the data sampling by asynchronous part, so the area used could be large. The number of gates consumed by the wrapper (both sender and receiver) of [11] is 97, and the wrapper of this paper (both sender and receiver) uses 48 gates, only about half of the previous one. So the wrapper in [11] uses more area to achieve better throughput performance. Here we also compare the performance, structure as well as overheads of wrappers in [12-14]. The wrapper in [13] uses FIFO (first in first out) to increase buffering space, but it simultaneously brings about a series of defects. Specifically, it often can be Table 1. Performance Test Results of the Sender Wrapper (27 C, 1.8 V) Max Clk Fre. Supported delay forward delay all P dynamic P static ss 2.47 GHz ps ns mw@1.18 GHz µw tt 3.08 GHz ps ps mw@1.43 GHz µw ff 3.78 GHz ps ps mw@1.71 GHz µw Table 2. Performance Test Results of the Receiver Wrapper (27 C, 1.8 V) Max Clk Fre. Supported delay forward delay all P dynamic P static ss 2.39 GHz ps ps 1.43 mw@1.18 GHz 2.98 µw tt 2.98 GHz ps ps mw@1.43 GHz nw ff 3.67 GHz ps ps mw@1.71 GHz nw

8 Xu-Guang Guan et al.: Two-Phase Asynchronous Wrapper for NoC 1099 Table 3. Performance Comparisons of Different Asynchronous Wrappers Function Sensitiveness on Delay Throughput sender Throughput receiver (GEvents/s) (GEvents/s) Proposed Synchronous to dual-rail two-phase Quasi delay-insensitive (180 nm) (180 nm) Method in [11] Synchronous to single-rail four-phase Delay sensitive 2.39 (90 nm) 1.5 (90 nm) Method in [12] Synchronous to single-rail four-phase Delay sensitive 0.25 (65 nm) 0.3 (65 nm) Method in [13] Synchronous to single-rail four-phase Delay sensitive 0.52 (65 nm) 0.71 (65 nm) Method in [14] Synchronous to single-rail four-phase Delay sensitive 0.18 (65 nm) 0.22 (65 nm) observed that the FIFO uses dual-port RAM to serve as buffering space. But the area overheads could be much larger. Furthermore, the pointer which is used to detect the full/empty state in the FIFO so as to avoid overflow or underflow could be a problem that limits the performance of the wrapper. But the comparisons of read and write pointers are complex and designers have to design Gray encoding and decoding modules to convert pointers. Although the wrapper in [13] changes the encoding scheme to optimize the design of the wrapper, it still cannot avoid the comparisons between read and write pointers, and the improvement of the performance is not so obvious. Another problem is, when FIFO is empty, the write pointer needs to be resynchronized using a conventional two-dff synchronizer, since the empty state is detected at the synchronous domain. This has the disadvantages of delaying the effective increment of two clock cycles, which may have a direct impact on performance in the case of a small FIFO. In contrast, the wrapper proposed does not need FIFO, so comparisons between read and write pointers can be removed. Also, the area overheads of the proposed wrapper are much smaller due to the fact that there is no FIFO. Moreover, the wrapper in [13] has timing constraints, that is, there is a race between the signal and the data after a rising clock edge because of the increment of the read pointer. So delay matching is needed on the clock input of the Muller gate. The structure of the wrapper in [12] is nearly the same as the wrapper in [13]. And they have the same defects. The difference only exits in token generation/consumption mode and a smaller FIFO depth. The wrapper in [14] also has no FIFO, thus the area is smaller. Performance of [14] was limited by asynchronous delays in the stoppable clock and clock tree insertion constraints. Finally, the dual-rail operation mode in this paper makes the request signals integrated into data, which avoids designs of complex control circuits, simultaneously delay matching process can be removed. But the most striking merit is the characteristics of quasi delayinsensitiveness, which greatly enhances the robustness of the circuit and is more suitable for transmission on long interconnection lines. 5 Conclusions High performance on-chip communication is always a hot topic research field, and the emergence of network on chips makes this topic even hotter. High speed point to point communication in network on chips is the key problem that needs solving. Asynchronous circuits inherently have the advantages of high performance, so communication using asynchronous scheme request and acknowledgement gradually becomes popular in asynchronous network on chips. To further improve the performance of point to point asynchronous communication, this paper proposes a quasi delay-insensitive high-speed two-phase operation mode asynchronous wrapper for network on chips. It can convert the data between synchronous data and twophase dual-rail asynchronous data. Stoppable clock scheme is used so that the metastable state would not emerge. Full/empty state of the registers can be determined by pulse signals of the two-phase register, and further control the wrapper to sample input and output data. Sender wrapper and receiver wrapper work in a quasi delay-insensitive mode so the robustness is improved. The wrapper is implemented using SMIC 0.18 µm standard CMOS technology, and the wrapper can properly work under different technology modes with a high robustness. The wrapper has the merits of high speed, low power, and quasi delay-insensitiveness and lower design complexity. The wrapper proposed is suitable for high performance, high robustness onchip networks. Further research will focus on enhancing the working speed of the wrapper and improving the portability. We hope this paper can contribute to the development of high speed, low power asynchronous interconnect of network on chips. References [1] Dally W J, Towles B. Route packets, not wires: On-chip interconnection networks. In Proc. 38th ACM Conf. Design Automation, Las Vegas, Nevada, Jun , 2001, pp [2] Benini L, Micheli G D. Networks on chips: A new SoC paradigm. Computer, 2002, 35(1):

1100 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 [3] Wang J L, Xue Y B, Wang H X, Li C M, Wang D S. CCNoC: Cache-coherent network on chip for chip multiprocessors. J. Comput. Sci. & Technol., 2010, 25(2): 257-266.

IEEE Micro, 2004, 24(1): 32-41. [6] Geer D. Is it time for clockless chips? Computer, 2005, 38(3): 18-21. [7] Teehan P, Greenstreet M, Lemieux G. A survey and taxonomy of GALS design styles.

IEEE Design & Test of Computers, 2008, 25(6): 572-580. [9] Krstic M, Grass E, Gurkaynak F K, Vivet P. Globally asynchronous, locally synchronous circuits: Overview and outlook.

9 1100 J. Comput. Sci. & Technol., Sept. 2010, Vol.25, No.5 [3] Wang J L, Xue Y B, Wang H X, Li C M, Wang D S. CCNoC: Cache-coherent network on chip for chip multiprocessors. J. Comput. Sci. & Technol., 2010, 25(2): [4] Bainbridge J, Furber S B. CHAIN: A delay-insensitive chip area Interconnect. IEEE Micro, 2002, 22(5): [5] Lines A. Asynchronous interconnect for synchronous SoC design. IEEE Micro, 2004, 24(1): [6] Geer D. Is it time for clockless chips? Computer, 2005, 38(3): [7] Teehan P, Greenstreet M, Lemieux G. A survey and taxonomy of GALS design styles. IEEE Design & Test of Computers, 2007, 24(5): [8] Sheibanyrad A, Greiner A, Miro-Panades I. Multisynchronous and fully asynchronous NoCs for GALS architectures. IEEE Design & Test of Computers, 2008, 25(6): [9] Krstic M, Grass E, Gurkaynak F K, Vivet P. Globally asynchronous, locally synchronous circuits: Overview and outlook. IEEE Design & Test of Computers, 2007, 24(5): [10] Dobkin R R, Ginosar R. Two-phase synchronization with subcycle latency. Integration, the VLSI Journal, 2009, 42(3): [11] Sheibanyrad A, Greiner PA. Two efficient synchronous asynchronous converters well-suited for networks-on-chip in GALS architectures. Integration, the VLSI Journal, 2008, 41(1): [12] Beigne E, Vivet P. Design of on-chip and off-chip interfaces for a GALS NoC architecture. In Proc. the 12th International Symposium on Advanced Research in Asynchronous Circuits and Systems, Grenoble, Mar , 2006, pp [13] Yvain T, Edith B, Pascal V. Design and implementation of a GALS adapter for ANoC based architectures. In Proc. the 15th International Symposium on Asynchronous Circuits and Systems, Chapel Hill, USA, May 17-20, 2009, pp [14] Beigne E, Clermidy F, Miermont S, Vivet P. Dynamic voltage and frequency scaling architecture for units integration within a GALS NoC. In Proc. the 2nd IEEE International Symposium on Networks-on-Chip, Newcastle Upon Tyne, UK, Apr. 7-11, 2008, pp Xu-Guang Guan received his B.S. degree from Institute of Physics and Technology, Xidian University in He is currently working toward the M.D.-Ph.D. degree with the School of Microelectronics, Xidian University, China. He is a student member of IEEE. His research interests include asynchronous circuits design, network on chips and VLSI design. Xing-Yuan Tong received his B.S. degree from Guilin University of Electronic Technology in He is currently working toward the M.D.-Ph.D. degree with the School of Microelectronics, Xidian University, China. His research interests include VLSI designs and A/D converters. Yin-Tang Yang received his B.S. and M.S. degrees in microelectronics and solid state electronics from Xidian University, Xi an, China in 1982 and 1984, respectively, and received the Ph.D. degree from Xi an Jiaotong University in He is currently the vice president and professor of Xidian University. His research interests include VLSI technology, new semiconductor materials and devices, and microelectronics reliability technology.

A full asynchronous serial transmission converter for network-on-chips

Vol. 31, No. 4 Journal of Semiconductors April 2010 A full asynchronous serial transmission converter for network-on-chips Yang Yintang( 杨银堂 ), Guan Xuguang( 管旭光 ), Zhou Duan( 周端 ), and Zhu Zhangming(