Asynchronous Behavior Related Retiming in Gated-Clock GALS Systems

Size: px

Start display at page:

Download "Asynchronous Behavior Related Retiming in Gated-Clock GALS Systems"

Margaret Ray
5 years ago
Views:

1 Asynchronous Behavior Related Retiming in Gated-Clock GALS Systems Sam Farrokhi, Masoud Zamani, Hossein Pedram, Mehdi Sedighi Amirkabir University of Technology Department of Computer Eng. & IT {Sfarrokhi, m-zamani, pedram, Abstract Although retiming is a well known method to optimize various characteristics of synchronous circuits, this method has rarely been applied to the synchronous blocks of a Globally Asynchronous Locally Synchronous (GALS) system. In this paper, communication protocols of gated-clock based wrappers have been analyzed for applying retiming algorithm to improve performance. Through the introduction of a new algorithm, it will be shown that a careful application of retiming concepts will not only prevent metastability problems between synchronous blocks of a GALS system, it can also reduce the communication gaps among those blocks whereby increasing their operating frequency. To demonstrate the effectiveness of the proposed algorithm, a 3-stage pipeline implementation of 16N digital filter is used as a locally synchronous block. Our experiments show a 23.2% increase in filter s operating frequency compared to other optimized circuits using original retiming algorithm and common reliable metastability-free methods. 1. Introduction GALS design methodology has been introduced by Chapiro in 1984[1]. The idea based on gathering both synchronous and asynchronous design advantages in a system such as achieving higher performance and less power consumption. Moreover, avoiding clock skew problem in large systems beside non-significant overhead logic were most attractive properties of GALS methodology during 1990s [2]. GALS Designers main concerns can be classified as [2]: 1- partitioning a system to some locally synchronous (LS) modules maximizing GALS methodology advantages. 2- Design and synthesis an asynchronous interface for synchronous modules which imposes as low as possible delay overhead 3-The metastability problem during modules synchronization process, which should seriously be avoided. Some basic synchronization methods in GALS systems are discussed in [3], [4] and [5].In 1996 pauseable clock circuits (PCC) was proposed to manage data transfer in asynchronous communications in GALS systems [6]. Later, asynchronous wrappers responsible for all asynchronous communications were introduced in 1997[7]. They surrounded each LS module with an asynchronous wrapper with a local clock generator inside each wrapper [5][7]. During the communication process each wrapper stops its local clock until the completion of the process. The idea of asynchronous wrappers based on clock gating is discussed in [8] [9]. By sharing clock generator between different locally synchronous modules, simpler wrappers using less power and area overhead can be achieved. [9] 1.1. Paper Contributions In this paper, a new approach for using retiming in a gated-clock wrapper based GALS system according to its asynchronous characteristics has been proposed.as shown in this paper, such retiming can improve system s working frequency and performance. In GALS systems and especially in gated clock based wrappers, there are some timing gaps during asynchronous handshaking of two LS modules. Designers leave these timing gaps unused avoiding metastability problem. Proposed retiming method will use these timing gaps in a safe manner and guarantee circuit will not encounter metastability problems. This timing gap analysis can help designers and CAD tools during partitioning and/or synthesis process. This method repositions some combinational parts of each LS module to boundary areas of the module while leaving sequential parts inside the module. This leads to shorter critical paths and increased frequency and/or performance of the system. 1.2 Paper Organization Paper has been organized as follows: second section contains some preliminaries about retiming, IEEE EWDTS, Yerevan, September 7-10,

asynchronous communication choices and GALS methodology. Third section contains gated-clock based GALS timing in details followed by timing analysis section.

2 asynchronous communication choices and GALS methodology. Third section contains gated-clock based GALS timing in details followed by timing analysis section. Our contribution described in the 5 th section named retiming in GALS systems. A clarifying example is provided in the 6 th section. Experimental results and conclusions are the last sections. 2. Preliminaries 2.1 Asynchronous communication choices [16] Impressed that asynchronous communications can be choosing through a solution space which consists of the cross product of a number of options including: {2-phase, 4-phase} * {bundled-data, dual rail, 1-ofn, } * {push, pull} The choice of protocol affects the circuit s implementation and characteristics such as area, speed, power and robustness. 2.2 Gated-clock GALS Methodology In GALS methodology a large system partitions to several LS modules. These modules communicate asynchronously. A GALS system can be assumed as an asynchronous sea with some locally synchronous islands. As mentioned above, we focused on GALS systems using wrapper methods based on clock gating. The general architecture of the GALS wrapper circuit based on clock gating is shown in figure 1. In this wrapper circuit, each locally synchronous module has a local clock which is obtained by gating separated external clock (eclk) signal corresponding to the request which comes from a port controller. When locally synchronous module enters data communication phase, it informs the related port controller that it needs data communication. Subsequently, the port controller will generate a gate request for clock generator and the external clock will be gated. After handshake completion, the clock will be released. Since no handshaking is required between asynchronous port controllers and local clock generation circuit, such port controllers are simpler than plausible clock based port controllers. Figure 1. Gated clock Based GALS wrapper Basic components of such gated-clock based wrappers are: 1. Clock generator: The clock generator circuit for gated clock method in GALS wrapper is as simple as usual clock gating shown in figure 2. There is no asynchronous element in clock generator circuit. During the activity phase of LS module all of the gate-signals are high. When LS module enters data communication phase, one or more of the gate signals go low and clock will be gated. Figure 2. Clock gating in gated clock GALS wrapper 2. Port controller: This is responsible for interfacing internal synchronous and external asynchronous environment. Each port controller is activated by Den signal which is generated by the LS module. When LS module needs data for next clock cycle, it activates Den signal at the coming negative edge of the current clock pulse like pausible clock based GALS scheme. After Den signal activation, the port controller starts its work. At the first step, external clock will be gated to prevent metastablity during asynchronous data exchange. 3. Gate synchronizer Gating the eclk signal must not be done later than the next positive clock edge. This can be guaranteed by enforcing timing constraints on GALS modules during synthesis process. 516 IEEE EWDTS, Yerevan, September 7-10, 2007

3 4. Input latch: Input port should store arrived valid data. The validity of data is defined by handshaking signals. 3. GALS Timings in details Assume that there is a 4-bit data that has to be transmitted between two LS modules in a Gated-Clock GALS system via two wrappers of the same type, as shown in figure 3. Figure3. Structure of binding to wrappers As it is described in [8] the sender should follow next sequence to transmit its data to receiver: Den1 g1 Gate1 LCLK1 Rp [Ap ]* Rp [Ap ] g1 Gate1 LCKL1 *[] means that wrapper will wait until transition detected Also, receiver should follow a dual sequence as shown bellow: Den2 g2 Gate2 LCLK2 [Rp ] Ap [Rp ] Ap g1 Gate1 LCKL2 As states in [8], this process starts when the local clock switches to inactive mode. This can avoid metastability problem while as another solution some designers leave primary outputs connected directly to FF. According to figure 5, input latch will be activated by Ap s positive edge and data will be grabbed by latch at this time. Another noticeable point is that, LS2 can save data after its clock enabled; the duration between LCLK2 s rise to LS2 s FF load is Tlclk2r_ff2. 4. GALS timing analysis 4.1. Sender Counterpart As described before, there are two paths which should be considered while analyzing sender counterpart: 1. Control path which is responsible for producing control signals. 2. Data path which produces data for transmition. According to timings mentioned in the last section, the sequence of data transfer from sender to receiver - in a situation which receiver is ready to receive the data, contains: Tsr: max(tden1t_g1r + Tg1rt_Gate1r + TGate1r_rpr, Tff1_b1 + Tb1_l) + Trpr_apr + Tapr_rpf + Trpf_apf + max( Tapf_g2f + Tg2f_Gate2f + TGate2f_lclk2r + Tlclk2r_ff2, Tl_b2 + Tb2_ff2 ) Wrapper designers try to generate control signals (such as Den1, g1 and Gate1) as soon as possible after local clock falls [5]. On the other hand, as stated before, designers leave data path produce data as soon as possible to avoid metastability problem, like connecting Primary outputs directly to FFs. In some cases, extra optimization to minimize control path is used such as generation of Rp based on g signal instead of Gate signals Receiver Counterpart While analyzing receiver counterpart, individual consideration of these two data and control paths should not be neglected. This timing is mentioned bellow: Max (Tden2t_g2r + Tg2r_Gate2r + TGate2r_rpr, max(tden1t_g1r + Tg1r_Gate1r + TGate1r_lclk1f + TGate1r_rpr, Tff1_b1 + Tb1_l)) + Trpr_apr + Tapr_rpf + Trpf_apf + max( Tapf_g2f + Tg2f_Gate2f + TGate2f_lclk2r + Tlclk2r_ff2, Tl_b2 + Tb2_ff2 ) If receiver assumed ready to receive data, by sensing positive edge on Rp signal, wrapper should produce g2 and Gate2 sequence. As illustrated in figure 5 data path contains Tl_b2 and Tb2_ff2. Like sender counterpart, Because of metastability problem data path will be left free. If receiver s readiness considered during analysis, former equation should be replaced by following expression: max(tden1t_g1r + Tg1r_Gate1r + TGate1r_lclk1f + TGate1r_rpr, Tff1_b1 + Tb1_l) + Trpr_apr + Tapr_rpf + Trpf_apf + max( Tapf_g2f + Tg2f_Gate2f + TGate2f_lclk2r + Tlclk2r_ff2, Tl_b2 + Tb2_ff2 ) Because of deferent sources of these two delays, rest of the paper will not concern the receiver s readiness timing details. Like sender s counterpart some optimizations are apply able to these control signals generation such as releasing clock based on Rp signal instead of Ap. In this case, propagation delay of control path at receiver counterpart is: Tapr_rpf + Trpf_g2f + Tg2f_Gate2f + TGate2f_lclk2r + Tlclk2r_ff2 IEEE EWDTS, Yerevan, September 7-10,

4 5. Retiming in Gated-clock GALS systems A key point which was not considered before, is using data path gaps. These gaps were left free to avoid metastability. By using these gaps in a safe manner, each LS module can get rid of some combinational logics which had to manage. As mentioned in the last section, there are two timing gaps during asynchronous communication in a Gated-Clock GALS system. Sender s gap: One of these timing gaps is the duration time between sender s last FF and the input latch located at receiver s wrapper. This duration contains Tff1_b1 and Tb1_l. Also, the latch will be ready to store data after its control line is enabled by Ap that takes Half_lck1_period + Tffden1_toggle + Tden1t_g1r + Tg1r_rpr + Trpr_apr time units to be enabled. Therefore data will be ready at most after Half_lck1_period + Tffden1_toggle + Tden1t_g1r + Tg1r_rpr + Trpr_apr Tlatch_setup time units Receiver s gap: The other timing gap is the duration from wrapper s input latch to receiver s FF. the control line will be ready after Tapr_rpf + Trpf_g2f + Tg2f_Gate2 + TGate2f_lclk2r + Tlclk2r_ff2 time units following the rise of Ap. Data passes through Tl_b2 + Tb2_ff2 which is usually a simple wire with no combinational delays, and takes into account the input latch hold time and LS2 s FF hold time. Moving some combinational logics to these unused gaps, frees an LS module from handling this logic. Consequently, designers have to deal with less combinational circuit. This can lead to decrease in each counterpart s critical path. Critical path reduction can affect on frequency and hence performance improvement. Retiming algorithm as described in section two, repositions sequential parts of the circuit. On the other hand this algorithm moves combinational parts through sequential elements if fixed positions are assumed. Theorem1. In a retimeable sequential synchronous circuit that satisfies retiming constraints D1, W1 and W2, and has no combinational path between its primary inputs and its primary outputs, retiming algorithm can reposition FFs in a manner where each primary output connects directly to at least one FF. Proof. Two results can be achieved by focusing on retiming edge weight calculation formula: ' w ( e) w( e) r( v) r( u) Result 1: In a Retimeable sequential synchronous circuit, if node V assigns R=1 while all other nodes assigned R=0, and such retiming be a legal retiming for the graph, applying the retiming algorithm on this graph will result at least one FF placed in all edges ended in node V. This procedure is shown in figure 4. a) Before retiming b) After retiming Figure 4. An example for result 1 Result 2: In a Retimeable sequential synchronous circuit, in a path which has several nodes connected to each other and all of them assigned R=1 called ones path, and if such retiming is a legal retiming for the graph, after applying the retiming algorithm on this graph, one FF from end of ones path will be moved toward the beginning of the path. Because of the assumption of legal retiming, before running the algorithm, the nodes -located next to ones path - have at least one FF at the connecting edge. This procedure is shown in figure 5. a b Figure 5. An example for result 2: a - before retiming; b) after retiming FF repositioning algorithm. To reposition FFs in a way that each primary input connects directly to an FF, we will assign R vector to the nodes of the graph according to the following algorithm: Algorithm RA-PI: R vector assignment-pi 1. For all nodes in the graph, initialize all R(v) to zero 2. For all nodes that are connected to primary inputs (extra dummy node) set R = 1 3. For all nodes that have combinational path initiated in nodes of step2, set R = 1 By setting R values using RA-PI algorithm, and considering result1 and result2, the execution of retiming algorithm on the graph using these R values leads to a circuit with all primary inputs connected to at least one FF. Finally, it should be proved that such R vector is a legal retiming for the graph. As mentioned before, in order to have legal retiming, R vector should meet the following two constraints: r (u) r(v) w(u, v) 1 for all vertices u, v V such that D(u, v) c r (u) r(v) w(e) for every edge u e v of G 518 IEEE EWDTS, Yerevan, September 7-10, 2007

5 First constraint should be studied in four different cases: 1. two following R=1 nodes This situation will not affect W(e). Consequently, the graph will meet this constraint because it met this constraint before applying the algorithm. 2. two following R=0 nodes This situation will not affect W(e). Consequently, the graph will meet this constraint because it met this constraint before applying the algorithm. 3. an R=1 node connected to R=0 node As it mentioned in RA algorithm, the algorithm continues to set R=1 until it reaches an edge where W(e) > 0. Retiming algorithm uses integer values and hence the edge s value is W(e)>= 1. In this situation r (u) r(v) 1 W(e) and therefore this constraint is met. 4. an R=0 node connected to R=1 node This situation can be seen only between combinational nodes that are directly connected to primary inputs. This condition does not disturb the constraint because r ( u) r( v) 1 and the retiming condition forces W (e) 0, hence the inequality holds. Second constraint is due to timing problems and will not be considered at this time. Hence, RA-PI algorithm produces a legal retiming vector that can reposition FFs in a manner that will connect all primary inputs directly to at least one FF. Theorem 2. In a retimeable sequential synchronous circuit that satisfies D1, W1 and W2 retiming constraints, and has no combinational path between its primary inputs and primary outputs, retiming algorithm can reposition FFs in a manner where each primary input connects directly to at least one FF. Similar proof can be presented for connecting primary outputs directly to FFs, using different R value assignment algorithm. A suitable R value assignment algorithm has shown bellow: Algorithm RA-PO: R vector assignment-po 1. For all nodes in the graph initialize all R(v) to one 2. For all nodes that are connected to primary onputs (extra dummy node), set R = 0 3. For all nodes that have combinational paths ended in nodes of step2, set R = 0 Like theorem 1, it can be concluded that RA-PO algorithm produces a legal retiming vector that can reposition FFs in a manner which will connect all primary outputs directly to at least one FF. Definition: A circuit is IO-independent, if running both RA-PI and RA-PO algorithms leads in a graph which hast at least one FF connected to its primary inputs and primary outputs. In order to achieve the goal of retiming an IOindependent LS module according to its asynchronous communication, following steps should be passed: ALGORITHM 1- GALS retiming Step1: (making the circuit IO- independent) 1.1 Run RA-PI algorithm. 1.2 Run RA-PO algorithm. Step2: ( filling IO gaps) 2.1. output gaps Initialize all R values to Zero. Set R value of each node to one provided that it has a combinational path to PO with combinational delay less than the assumed output gap (in backward direction) Calculate new edge weights using retiming formula input gaps Initialize all R values to One. Set R value of each node to zero provided that it has a combinational path to PI with combinational delay less than the assumed input gap (in forward direction) Calculate new edge weights using retiming formula. Step3: (retiming core logic) 3.1 Exclude core logic by omitting all primary IO connected combinational logic. 3.2 Retime reduced graph using the original retiming algorithm to achieve max Frequency. Step4: Include primary IO connected combinational circuits to optimized core logic 7. Experimental results To demonstrate the effectiveness of the proposed algorithm, two cascaded 3-stage pipelined 16N digital filter were used as a locally synchronous block. TSMC is 0.18 and Synopsys synthesis tool was used to synthesize the logical circuit. First, wrappers with output and input controllers were implemented to determine the sender and receiver gaps as follows: Sender: Half_lck1_period + Tffden1_toggle + Tden1t_g1r + Tg1r_rpr + Trpr_apr = Half_lck1_period + 412ps(sender) Reciever: Tapr_rpf + Trpf_g2f + Tg2f_Gate2f + TGate2f_lclk2r + Tlclk2r_ff2=492ps Then, the circuit was retimed to achieve the lowest possible critical path delay which was 825 ps. The algorithms output was optimized to have 642ps as its critical path delay. This shows 23.2% improvement for the selected digital filter. IEEE EWDTS, Yerevan, September 7-10,

6 8. Conclusion Retiming LS modules of a gated-clock GALS system according to their asynchronous communication behavior has not been investigated before. Timing analysis showed that there are some timing gaps of asynchronous communication in a gated-clock GALS system that are left free to avoid metastability problem. The proposed algorithm noticed these timing gaps can be used by some combinational circuit without encountering metastability problems. Such repositioning of some combinational circuit toward boundary areas; leads to simpler core circuit containing less combinational modules and same number of pipeline stages inside the core LS module. Consequently, the retimed circuit can run using higher frequencies and execute algorithms in a shorter period of time. Although, this timing analysis has been done for gated and pauseable clock wrappers, related analysis on interblock retiming is possible as well. 10. References [1] Chapiro, D.M. Globally Asynchronous Locally Synchronous Systems. PhD Thesis, Stanford University, [2] Gurkaynak, F.K., and Oetiker, S. Is there hope for GALS in the future? 4 th ACiD Workshop of the European commission s fifth framework programme, (Jun 2004). [3] Gurkaynak,F.K., and Oetiker, S. On the GALS Design Methodology of ETH Zurich. FMGALS Workshop at the 12th International FME Symposium, (Sep 2003). [4] Seizovic, J. Pipeline synchronization. In Proceeding of International Symposium on Advanced Research in Asynchronous Circuits and Systems, (Nov 1994). [5] Muttersbach, J., and Villiger T. Practical Design of Globally-Asynchronous Locally-Synchronous Systems. In Proceeding of International Symposium Advanced Research in Asynchronous Circuits and Systems, (April 2000). [7] Bormann, D., and Cheung, P. Asynchronous wrapper for heterogeneous systems. In Proceeding of International Conf. Computer Design(ICCD), (Oct 1997). [8] Amini, E., and Najibi, M., and Pedram, H. Globally Asynchronous Locally Synchronous Wrapper Circuits based on Clock Gating. In Proceeding of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI), (Mar 2006). [9] Amini, E., and Najibi, M., and Pedram, H. Automatic Generation of Pausible Clock Based GALS Wrapper Circuit. In Proceeding of the 11th International CSI computer Conference, (Jan 2006). [10] Leiserson, C., and saxe, J. Retiming Synchronous Circuitry. Algorithmica, Vol. 6, (1991), [11] De Micheli, G. Synthesis and optimization of digital circuits. McGraw-Hill, [12] Baumgartner, J. Min-Area Retiming on Flexible Circuit Structures. In Proceeding Of IEEE International Conference ICCAD, (2001), [13] Hsu, Y.-L., and Wang, S.-J. Retiming-based logic synthesis for low-power. In Proceeding of International Symposium On low power electronics and design, (2002), [14] Dey, S., and Potkonjak, M., and Rothweiler, S. G. Performance Optimization of Sequential Circuits by Eliminating Retiming Bottlenecks. In Proceeding of IEEE International Conference ICCAD, (1992), [15] Farrokhi, S., and Sedighi, M. Improving the Retiming Synchronous Circuitry Algorithm to Increase Clock Speed. Journal of computer science and engineering, no 3, (fall 2003) [16] Sprso, J., and Furber, S. Principles of Asynchronous Circuit Design- A System Prespective. Kluwer Academic Publishers, [6] Yun, K., and Donohue R.P. Pausible clocking: A first step toward heterogeneous systems. In Proceeding of International Conf. Computer Design(ICCD), (Oct 1996). 520 IEEE EWDTS, Yerevan, September 7-10, 2007

Globally Asynchronous Locally Synchronous FPGA Architectures

Globally Asynchronous Locally Synchronous FPGA Architectures Andrew Royal and Peter Y. K. Cheung Department of Electrical & Electronic Engineering, Imperial College, London, UK {a.royal, p.cheung}@imperial.ac.uk