A Novel Pseudo 4 Phase Dual Rail Asynchronous Protocol with Self Reset Logic & Multiple Reset

Similar documents
ISSN Vol.08,Issue.07, July-2016, Pages:

Implementation of ALU Using Asynchronous Design

Design of Low Power Asynchronous Parallel Adder Benedicta Roseline. R 1 Kamatchi. S 2

Implementation of Asynchronous Topology using SAPTL

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 5, Sep-Oct 2014

Parallel, Single-Rail Self-Timed Adder. Formulation for Performing Multi Bit Binary Addition. Without Any Carry Chain Propagation

Design of Parallel Self-Timed Adder

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

Design of 8 bit Pipelined Adder using Xilinx ISE

POWER consumption has become one of the most important

A Synthesizable RTL Design of Asynchronous FIFO Interfaced with SRAM

Formulation for Performing Multi Bit Binary Addition using Parallel, Single-Rail Self-Timed Adder without Any Carry Chain Propagation

Feedback Techniques for Dual-rail Self-timed Circuits

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

POWER ANALYSIS OF CRITICAL PATH DELAY DESIGN USING DOMINO LOGIC

High Performance Interconnect and NoC Router Design

Introduction to Asynchronous Circuits and Systems

Single-Track Asynchronous Pipeline Templates Using 1-of-N Encoding

16x16 Multiplier Design Using Asynchronous Pipeline Based On Constructed Critical Data Path

Design and Characterization of High Speed Carry Select Adder

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

DYNAMIC ASYNCHRONOUS LOGIC FOR HIGH SPEED CMOS SYSTEMS. I. Introduction

the main limitations of the work is that wiring increases with 1. INTRODUCTION

The design of a simple asynchronous processor

Low Power GALS Interface Implementation with Stretchable Clocking Scheme

Designing NULL Convention Combinational Circuits to Fully Utilize Gate-Level Pipelining for Maximum Throughput

Recursive Approach for Design of a Parallel Self-Timed Adder Using Verilog HDL

Power and Delay Analysis of Critical Path Delay Design Using Domino Logic Multiplier

Recursive Approach to the Design of a Parallel Self-Timed Adder

VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT

ANEW asynchronous pipeline style, called MOUSETRAP,

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema

Compact Clock Skew Scheme for FPGA based Wave- Pipelined Circuits

ISSN Vol.05, Issue.02, February-2017, Pages:

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3

Reconfigurable PLL for Digital System

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

O PT I C Alan N. Willson, Jr. AD-A ppiov' 9!lj" 2' 2 1,3 9. Quarterly Progress Report. (October 1, 1992 through December 31, 1992)

Analysis of Different Multiplication Algorithms & FPGA Implementation

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

Asynchronous Behavior Related Retiming in Gated-Clock GALS Systems

Verilog for High Performance

TEMPLATE BASED ASYNCHRONOUS DESIGN

Area-Efficient Design of Asynchronous Circuits Based on Balsa Framework for Synchronous FPGAs

An overview of standard cell based digital VLSI design

A Macro Generator for Arithmetic Cores

Introduction to asynchronous circuit design. Motivation

A Energy-Efficient Pipeline Templates for High-Performance Asynchronous Circuits

Optimization of Robust Asynchronous Circuits by Local Input Completeness Relaxation. Computer Science Department Columbia University

Low Power Circuits using Modified Gate Diffusion Input (GDI)

Compound Adder Design Using Carry-Lookahead / Carry Select Adders

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

High Speed Han Carlson Adder Using Modified SQRT CSLA

Design and Implementation of High Performance DDR3 SDRAM controller

THE latest generation of microprocessors uses a combination

TIMA Lab. Research Reports

SIDDHARTH INSTITUTE OF ENGINEERING AND TECHNOLOGY :: PUTTUR (AUTONOMOUS) Siddharth Nagar, Narayanavanam Road QUESTION BANK UNIT I

Design and Implementation of CVNS Based Low Power 64-Bit Adder

POWER REDUCTION IN CONTENT ADDRESSABLE MEMORY

On the Design of High Speed Parallel CRC Circuits using DSP Algorithams

A Hybrid Wave Pipelined Network Router

High Performance Asynchronous Circuit Design Method and Application

High-Performance Full Adders Using an Alternative Logic Structure

Digital Design Methodology (Revisited) Design Methodology: Big Picture

Digital Design Methodology

Reliable Physical Unclonable Function based on Asynchronous Circuits

THE synchronous DRAM (SDRAM) has been widely

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

Design and Development of Vedic Mathematics based BCD Adder

Clocked and Asynchronous FIFO Characterization and Comparison

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM

IMPLEMENTATION OF LOW POWER AREA EFFICIENT ALU WITH LOW POWER FULL ADDER USING MICROWIND DSCH3

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry

Design of Asynchronous Interconnect Network for SoC

Content Addressable Memory with Efficient Power Consumption and Throughput

Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Architecture of an Asynchronous FPGA for Handshake-Component-Based Design

2. BLOCK DIAGRAM Figure 1 shows the block diagram of an Asynchronous FIFO and the signals associated with it.

IMPLEMENTATION OF TWIN PRECISION TECHNIQUE FOR MULTIPLICATION

Design of Low Power Wide Gates used in Register File and Tag Comparator

An Overview of Standard Cell Based Digital VLSI Design

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

Area And Power Optimized One-Dimensional Median Filter

HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE

ASIC Implementation of one level 2D DWT and 2D DWT in Hybrid Wave-Pipelining & Pipelining

Globally Asynchronous Locally Synchronous FPGA Architectures

ISSN Vol.05, Issue.12, December-2017, Pages:

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

Columbia Univerity Department of Electrical Engineering Fall, 2004

Low Power using Match-Line Sensing in Content Addressable Memory S. Nachimuthu, S. Ramesh 1 Department of Electrical and Electronics Engineering,

ESE 570 Cadence Lab Assignment 2: Introduction to Spectre, Manual Layout Drawing and Post Layout Simulation (PLS)

The Wave Pipeline Effect on LUT-based FPGA Architectures

Transcription:

A Novel Pseudo 4 Phase Dual Rail Asynchronous Protocol with Self Reset Logic & Multiple Reset M.Santhi, Arun Kumar S, G S Praveen Kalish, Siddharth Sarangan, G Lakshminarayanan Dept of ECE, National Institute of Technology/, Tiruchirappalli, Tamilnadu, India ABSTRACT 1. This paper presents a novel pseudo 4 phase dual rail protocol with self reset logic suited for high speed asynchronous applications. The traditional 4 phase dual rail requires the input to be of alternating valid and empty cycles. However the proposed pseudo 4 phase involves continuous stream of valid data without a separate empty cycle. The empty phase is generated internally so that the next valid data can be processed. Also self reset logic for dual rail protocol has been proposed in which the combinational blocks resets itself whenever its evaluation phase is completed and the data is latched at the pipeline register. The concept of multiple reset i.e. resetting each of the gates in the combinational block between any two pipeline registers simultaneously has been introduced reducing the reset phase and hence increasing the throughput rate. An asynchronous 8 bit pipelined carry propagate adder was implemented in 0.18 μm technology. The reset phase has reduced by 63.25% and 47.63% compared to the design without self and multiple reset for logic depth of three and two respectively. The results show that the reset phase varies inversely with the logic depth for the proposed design. 2. Index terms pseudo 4 phase, self reset, multiple reset, dual rail, asynchronous. INTRODUCTION: Due to the recent trend towards multi-gigahertz systems, there are increasing design challenges to managing clock distribution. Asynchronous design, which replaces global clocking with local handshaking, has the potential to make high speed design more feasible. Since they are asynchronous, these pipelines avoid issues related to the distribution of a high-speed clock: e.g., wasteful clock power and management of clock skew. Moreover, the use of local handshaking, instead of global clocking, imparts the ability of localized flow control to the asynchronous pipeline: the number of data items in the pipeline is allowed to vary over time, depending on the operating speeds of the interfaces and stages. Thus, the proposed pipelines have a natural elasticity. In contrast, except for some recent approaches, traditional synchronous pipelines typically do not provide elasticity. Finally, the inherent flexibility of asynchronous components allows an asynchronous pipeline to interface with varied environments operating at different rates; thus, asynchronous pipeline styles are useful for the design of system-on-a-chip (SoC) and reusable intellectual property (IP), and also have been shown to gracefully accommodate dynamic voltage scaling. Finally, the proposed control circuits are simple, small and fast, thus further minimizing the overhead of fine-grain pipelining. Dual-rail is a commonly-used scheme to implement an asynchronous data path [1], [2]. In dual-rail, two wires (or rails) are used to implement each bit. The wires indicate both the value of the bit, and also its validity. In the 4 phase implementation, encodings of 01 and 10 correspond to valid data values 0 and 1, respectively. The encoding 00 indicates the reset or pacer state with no valid data and 11 is an unused (i.e., illegal) encoding. Encodings on the data path typically alternate between valid values and the reset state. Since the data path itself indicates the validity of each bit, dual-rail is effective in designing asynchronous data paths which are highly robust in the presence of arbitrary delays. Unlike in the 4 phase implementation, where definite codes represent definite values, in 2 phase, a new data is marked by a transition in the corresponding bit. A transition in false/true line indicates a zero/one respectively. 2-Phase and 4-Phase Implementations A Comparison A key limitation of LEDR is the difficulty of building efficient and simple robust function blocks [3]. There are two challenges: the complexity of handling the alternating encodings of odd and even phases and the overhead of enforcing hysteresis so that the output phase does not change until the phase of all inputs have changed. These requirements make LEDR functional blocks area and delay-inefficient, and hard to design without using custom transistor gates. Hence, in practice, Figure.1 Four phase dual rail protocol 21

Fig.2 Block diagram of pseudo 4 phase dual rail with self reset LEDR is rarely used to design function blocks. In contrast, four-phase dual-rail is commonly-used and easy to implement function blocks. But the 4 phase protocol as shown in fig.1 requires alternate valid and empty data so that the block is reset after every computation and is ready to process the next data. The period of valid and empty phase is equal to the largest combinational path delay. This results in reduced throughput rate as the empty data takes time to propagate through the system. DESIGN OVERVIEW: The main contribution of this paper is that it proposes a new dual rail protocol that is optimized for low power and is simpler to build the system. Because of the inherent advantages and disadvantages of 2-phase (LEDR) and 4 phase protocols, we propose the pseudo 4- phase protocol. This protocol has similar features to that of 4 phase in the way that the acknowledge and the data signals used inside the system is 4 phase transitioning, but external to the system the regular empty phase of a phase system is not visible. So both at input and output we have streams of valid data. The empty phase is generated internally using self reset logic that is described later. This results in lower number of transitions at the input and output due to non-return to zero signals as like the 2 phase. Unlike 2 phase, there is no explicit phase changes within the block thereby ensuring lower complexity of combinational blocks. As shown in Figure 2, each stage consists of a combinational and a reset network. Data flows from one stage to another through a latch in a linear pipeline. The time taken for the combinational block to process the data is the evaluation phase. As soon as the evaluation phase is over the combinational block is reset to accept the next data. An acknowledge signal is generated (a low to high Fig.4 Illustration of the handshaking on a 4-phase dual-rail transition) when a valid data is detected by the completion detection circuit. When the acknowledge signal is low, the combinational block is in evaluation phase. To insure proper data flow across stages, data is transferred from the current stage to the next one if the current stage is in the evaluate phase while the next stage is in the reset phase. The enable signal of the latch depends upon the acknowledge signal of the next stage as it denotes the commencement of reset phase in that stage. Hence, the latch separating both stages is enabled when both stages are in evaluate and reset phase respectively. Various forms of Self Reset logic have been proposed in recent years. In globally self-resetting CMOS [4], the reset signal for each stage is generated by a separate timing chain which provides a parallel worst-case delay path. On the other hand, in locally self-resetting CMOS [5], the reset signal for each stage is generated by a mechanism internal to that stage. Most previous implementations of SRCMOS have required great care in device sizing in order to ensure that the reset pulses will occur at the correct times. This is especially difficult to achieve over a wide range of process, voltage and temperature variations. A. Self Reset Logic The major drawback of the previous implementation of self reset logic [6] is the usage of delay buffers that are required to delay the reset signal till a valid data is obtained at the end of the combinational block. While the inputs are traveling along the critical path of the combinational network, the reset signal is similarly traveling along the matching delay. This delay buffer Fig. 3 Pipeline register with acknowledge generation 22

Fig.5 The pseudo 4 phase protocol Fig.6 Model waveform of acknowledge showing the effect of multiple reset on reset and evaluate phase delays the reset phase long enough to allow the outputs of the combinational network to stabilize. In the proposed Dual Rail self reset logic, the delay insensitive Muller C-element [7] is used in the completion detection circuit (as shown in the figure 3) to indicate the validity of all the bits of the data. This waits for all the bits to be valid eliminating the need to model the delay. Once all the bits become valid, acknowledge is generated as shown in the fig.4. Since the input is a continuous stream of valid data, an empty phase is inserted to distinguish between two consecutive valid data. This empty phase is generated using self reset logic. The combinational blocks between the pipeline registers are reset when the data is latched in the register next to it. Since the acknowledge signal of the register immediately after the combinational block denotes the latching of data, the combinational block is reset using it. B. The pseudo 4 phase protocol The pseudo four phase protocol involves the 4 events as shown in fig.5. 1. The reset phase pulls down the acknowledge 2. This lifts the reset and the combinational block is ready to process the next data(evaluate phase) Fig.8 PG cg cell 3 At the end of evaluation phase, a valid data is generated. As soon as the next register is ready to accept the data, the valid data is latched and acknowledge signal is generated. 4 The acknowledge signal resets the combinational block (reset phase) and so empty state is introduced. To get back the valid data as a continuous stream (remove empty states), a latch with acknowledge signal generated using only the completion detector as a clock signal can be used. C. Multiple Reset In 4 phase dual rail systems, the empty state has to be maintained for the period of the maximum combinational path delay. Since reset phase is equal to the evaluate phase in case of the usual 4 phase implementations, it increases with logic depth. This paper proposes a novel idea to overcome this problem. Each of the gates in the combinational block between two registers is reset simultaneously. So the reset phase is effectively the delay of the last gate in the combinational block and the delay of the completion detection circuit to detect the empty state. So the reset phase does not vary with the depth of the combinational block as illustrated in fig. 6. Fig.7 Carry propagate adder 23

Fig.9 Carry generator block TABLE 1. COMPARISON OF PERIOD AND RESET PHASE FOR VARIOUS SCHEMES AND LOGIC DEPTH Logic Depth With both Self and Multiple Reset With Self Reset (no multiple reset) Without Self and Multiple Reset (4 phase protocol) Period (ns) On Time (ns) Period (ns) On Time (ns) Period (ns) On Time (ns) 1 2.86 1.41(94%) 2.86 1.41(94%) 3 1.5(100%) 2 4.07 1.44(52.37%) 5.34 2.67(97%) 5.5 2.75(100%) 3 5.31 1.47(36.75%) 7.84 3.92(98%) 8 4(100%) III.APPLICATION: AN 8-B PARALLEL ADDER We use the Carry Look Ahead structure proposed by [8]. The global organization is shown in Fig. 7, where only the combinational block that performs the addition is depicted. This is a Carry Propagate Adder (CPA). This block by itself is essentially asynchronous. In order to make it work in a synchronous environment, an input register and an output register are added. The combinational block is composed of three stages: a propagate/generate block PG generator, followed by a parallel carry generator, and a sum generator. The maximum clock frequency depends on the cycle time of the gates, while controlling the maximum difference in arrival times of signals at any given stage. The first stage forms the propagate term p and the generate term g, according to the following equations: g i =a i b i i=1,2, n (1) p i =a i ^ b i i=1,2, n (2) The basic cell used in the PG generator is shown in Fig. 8. The resulting p and g bits are the inputs to the next stage, the parallel carry generator. Carries are computed from the recursive equations c i =G i for i=1,2..n (3) Where (G i.p i )= (g i, pi), for i=1 (4) (G i,p i )= (g i +p i *G i-1, pi*p i-1 ) for i=2, 3...n (5) Sums are computed from s i =p i for i=1 (6) s i =p i^ c i-1 for i=2,3 n (7) IV.RESULTS The 8 bit Carry Look Ahead adder was implemented in 0.18 μm technology. The netlist file was generated using Leonardo Spectrum from the design file written in Verilog HDL. The layout design was (floor planning, place and route) carried out in Mentor Graphics IC Station tool. The parasitic extraction was done using Calibre and the post layout simulation was completed using Eldo SPICE. The number of pipeline registers introduced is shown in the fig 9, the no of registers was varied in order to investigate the variation of reset phase with logic depth. The logic depth is the no. of stages of PG-cg cell between two registers. The results of variations of reset phase with increasing logic depth are provided in table 1. Three cases of implementations are considered for verification for above mentioned proposal. The on time of the acknowledge signal generated gives the reset phase of the combinational block between the register. The period of acknowledge is the sum of evaluate and reset phase. As seen from the graph in fig.10 the reset phase of the combinational block in which both the self reset and multiple reset were used, is fairly constant irrespective of logic depth. Whereas, the blocks with no multiple reset, the reset phase increases linearly 24

with logic depth and also is equal to the maximum combinational path delay(evaluation phase). The reset period of the design with self reset logic (without multiple reset) decreased by 6%, 3%, 2% compared to the normal 4 phase protocol design for logic depth of 1,2 and 3 respectively. The reset period of the design with self reset logic and multiple reset decreased by 6%, 47.63% and 63.25% compared to the normal 4 phase protocol design for logic depth of 1, 2 and 3 respectively. This shows that the reduction in reset phase varies inversely with logic depth. The layout of the implemented Carry Propagate Adder is shown in the figure11. The waveforms obtained after post layout simulation using Eldo SPICE is shown in the figure 12. V.CONCLUSION This paper presents a novel pseudo four phase protocol which involves self reset logic for dual rail applications. This protocol eliminates return to zero transitions in the global data lines and is simpler to build functional blocks using four phase dual rail. The self reset logic is used to reset the blocks whenever it has completed its evaluation and data is latched. The reset signals were generated using the completion detection circuit. Also multiple reset signals to all the gates of the combinational block between the two registers reduce the reset phase. The implementation results of 8 bit CPA adder in 0.18 μm technology shows that the reset phase can be reduced upto 63.25% and 47.63% for a logic depth of 3 & 2 compared to the blocks without multiple reset and selfreset. This protocol with self reset and multiple reset concept increases the throughput rate of the system as it keeps the reset phase constant for increasing logic depth. Fig.11 Layout of 8 bit carry propagate adder Fig.10 Variation of reset phase with logic depth for various schemes REFERENCES [1] M. B. Josephs, S. M. Nowick, and C. H. K. van Berkel, Modeling and design of asynchronous circuits, Proc. IEEE, vol. 87, no. 2, pp. 234 242, Feb. 1999. [2] C. L. Seitz, System timing, in Introduction to VLSI Systems, C. A. Mead and L. A. Conway, Eds. Reading, MA: Addison-Wesley, 1980, ch. 7. [3] Amitava Mitra William F. McLaughlin Steven M. Nowick Efficient Asynchronous Protocol Converters for Two- Phase Delay-Insensitive Global Communication IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'07) 0-7695-2771-X/07 [4] W. Hwang et al, Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability, IEEE Journal of Solid-State Circuits, Vol. 34, No.8, pp. 1108-1117, August 1999 [5] A. E. Dooply and K. Y. Yun, Optimal Clocking and Enhanced Testability for High-Performance Self-Resetting Domino Pipelines, Proceedings, 20th Conf. on Advanced Research in VLSI, pp. 200-214, 1999. [6] Abdel Ejnioui and Abdelhalim Alsharqawi, Pipeline Design Base on Self Resetting State Logic,Proceedings of IEEE Computer Society Annual Symposium on VLSI Emerging Trends in VLSI Systems Design (ISVLSI 04), 2004 IEEE [7] J. Sparsø and S. Furber (eds), Principles of asynchronous circuit design - A systems perspective, Kluwer Academic Publishers, 2001 (ISBN 0-7923-7613-7) 25

Fig.12 Post layout simulation waveforms [8] W. Liu, T. Gray, D. Fan, W. J. Farlow, T. A. Hughes, and R. K. Cavin, A250-MHzwave pipelined adder in 2- _mcmos, IEEE J. Solid-StateCircuits, vol. 29, no. 9, pp. 1117 1128, Sep. 1994. 26