Dynamic Logic Families

Similar documents
Dynamic CMOS Logic Gate

3. Implementing Logic in CMOS

Memory. Outline. ECEN454 Digital Integrated Circuit Design. Memory Arrays. SRAM Architecture DRAM. Serial Access Memories ROM

Lecture 11: MOS Memory

Monotonic Static CMOS and Dual V T Technology

Design of Low Power Wide Gates used in Register File and Tag Comparator

Lecture 11 SRAM Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

+1 (479)

Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

Introduction to CMOS VLSI Design Lecture 13: SRAM

SRAM. Introduction. Digital IC

Semiconductor Memory Classification

Power Analysis for CMOS based Dual Mode Logic Gates using Power Gating Techniques

Lecture 5. Other Adder Issues

Microcomputers. Outline. Number Systems and Digital Logic Review

Digital Integrated Circuits Lecture 13: SRAM

EE577b. Register File. By Joong-Seok Moon

Introduction to SRAM. Jasur Hanbaba

Topics. ! PLAs.! Memories: ! Datapaths.! Floor Planning ! ROM;! SRAM;! DRAM. Modern VLSI Design 2e: Chapter 6. Copyright 1994, 1998 Prentice Hall

Memory Design I. Array-Structured Memory Architecture. Professor Chris H. Kim. Dept. of ECE.

POWER ANALYSIS RESISTANT SRAM

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

STUDY OF SRAM AND ITS LOW POWER TECHNIQUES

Prototype of SRAM by Sergey Kononov, et al.

ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems

1. Prove that if you have tri-state buffers and inverters, you can build any combinational logic circuit. [4]

! Memory. " RAM Memory. " Serial Access Memories. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

CHAPTER 12 ARRAY SUBSYSTEMS [ ] MANJARI S. KULKARNI

DESIGN, MANUFACTURE AND TESTING OF A 4-BIT MICROPROCESSOR

! Memory Overview. ! ROM Memories. ! RAM Memory " SRAM " DRAM. ! This is done because we can build. " large, slow memories OR

CAD4 The ALU Fall 2009 Assignment. Description

6. Latches and Memories

Very Large Scale Integration (VLSI)

Digital Fundamentals. Integrated Circuit Technologies

Memory Design I. Semiconductor Memory Classification. Read-Write Memories (RWM) Memory Scaling Trend. Memory Scaling Trend

CENG 4480 L09 Memory 2

Adaptive Robustness Tuning for High Performance Domino Logic

Chapter 6. CMOS Functional Cells

Reference Sheet for C112 Hardware

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

THE latest generation of microprocessors uses a combination

DESIGN OF HIGH SPEED & LOW POWER SRAM DECODER

Memory Arrays. Array Architecture. Chapter 16 Memory Circuits and Chapter 12 Array Subsystems from CMOS VLSI Design by Weste and Harris, 4 th Edition

Semiconductor Memory Classification. Today. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. CPU Memory Hierarchy.

MEMORIES. Memories. EEC 116, B. Baas 3

EE 434 Lecture 30. Logic Design

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM

Integrated Circuits & Systems

LOGIC EFFORT OF CMOS BASED DUAL MODE LOGIC GATES

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

TESTING TRI-STATE AND PASS TRANSISTOR CIRCUIT STRUCTURES. A Thesis SHAISHAV PARIKH

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

Memory, Latches, & Registers

Unit 7: Memory. Dynamic shift register: Circuit diagram: Refer to unit 4(ch 6.5.4)

ESE 570 Cadence Lab Assignment 2: Introduction to Spectre, Manual Layout Drawing and Post Layout Simulation (PLS)

Design and Simulation of Low Power 6TSRAM and Control its Leakage Current Using Sleepy Keeper Approach in different Topology

Column decoder using PTL for memory

MOS High Performance Arithmetic

VLSI Test Technology and Reliability (ET4076)

Deep Sub-Micron Cache Design

ECEN 449 Microprocessor System Design. Memories

Memory, Latches, & Registers

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

Synthesis of combinational logic

CS429: Computer Organization and Architecture

Memory in Digital Systems

CS152 Computer Architecture and Engineering Lecture 16: Memory System

Unleashing the Power of Embedded DRAM

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

RTL Design (2) Memory Components (RAMs & ROMs)

SIDDHARTH INSTITUTE OF ENGINEERING AND TECHNOLOGY :: PUTTUR (AUTONOMOUS) Siddharth Nagar, Narayanavanam Road QUESTION BANK UNIT I

A 65nm LEVEL-1 CACHE FOR MOBILE APPLICATIONS

CS250 VLSI Systems Design Lecture 9: Memory

4. Hot Socketing and Power-On Reset in MAX V Devices

In this lecture, we will focus on two very important digital building blocks: counters which can either count events or keep time information, and

High-Performance Full Adders Using an Alternative Logic Structure

A Energy-Efficient Pipeline Templates for High-Performance Asynchronous Circuits

Low Power PLAs. Reginaldo Tavares, Michel Berkelaar, Jochen Jess. Information and Communication Systems Section, Eindhoven University of Technology,

Topic Notes: Building Memory

VLSI Test Technology and Reliability (ET4076)

Data Cache Final Project Report ECE251: VLSI Systems Design UCI Spring, 2000

Problem Formulation. Specialized algorithms are required for clock (and power nets) due to strict specifications for routing such nets.

EECS 150 Homework 7 Solutions Fall (a) 4.3 The functions for the 7 segment display decoder given in Section 4.3 are:

Cluster-based approach eases clock tree synthesis

International Journal of Scientific & Engineering Research, Volume 5, Issue 2, February ISSN

DESIGN AND SIMULATION OF 1 BIT ARITHMETIC LOGIC UNIT DESIGN USING PASS-TRANSISTOR LOGIC FAMILIES

E40M. Binary Numbers, Codes. M. Horowitz, J. Plummer, R. Howe 1

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016

Digital Systems Design with PLDs and FPGAs Kuruvilla Varghese Department of Electronic Systems Engineering Indian Institute of Science Bangalore

Novel low power CAM architecture

Ternary Content Addressable Memory Types And Matchline Schemes

NAND Flash Memory: Basics, Key Scaling Challenges and Future Outlook. Pranav Kalavade Intel Corporation

PROGRAMMABLE LOGIC DEVICES

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry

Lecture 20: CAMs, ROMs, PLAs

CSE 140L Final Exam. Prof. Tajana Simunic Rosing. Spring 2008

250nm Technology Based Low Power SRAM Memory

Sense Amplifiers 6 T Cell. M PC is the precharge transistor whose purpose is to force the latch to operate at the unstable point.

APPLICATION NOTE 655 Supervisor ICs Monitor Battery-Powered Equipment

Physical Implementation

Transcription:

Dynamic Logic Families C.K. Ken Yang UCLA yangck@ucla.edu Courtesy of MAH,JR 1

Overview Reading Rabaey 6.3 (Dynamic), 7.5.2 (NORA) Overview This set of notes cover in greater detail Dynamic Logic Families and in particular Domino Logic. There is an extensive discussion on the noise issues in dynamic circuits and how they are resolved. A few variants of domino logic are introduced. 2

Domino Logic Family Outline Dynamic/domino logic Domino logic Timing of domino logic Noise issues and keepers Dual-rail domino logic (Dynamic DCVS) and other domino styles 3

Review: Pre-charged Logic (1) We saw before that pseudo-nmos logic s main disadvantage was the static current that it consumes. One way to get rid of it is to build a dual pmos stack to cut this static current path (CMOS). Another approach to eliminate this static current is pre-charging. What do you mean by pre-charging? Before each evaluation phase, pre-charge the output high Execution of Boolean expression either discharges output or leaves it high A single low-to-high transition on the input allowed, but NOT a high-to-low transition during evaluation static current Dual pmos Network precharge A non-overlapping (good, but not always possible) B precharge precharge evaluate evaluate evaluate Psuedo-nMOS CMOS Pre-Charge 4

Review: Precharged Logic (2) Implement the logic function with nmos pull-down stack as in pseudo-nmos Can use a single clock signal = pre-charge = evaluate clk clk clk clk These gates cannot be cascaded, even if complementary clocks are used for alternating stages Constrained by low-to-high transition requirement at the input during evaluation Need to put an inverting stage between them Domino Logic 5

Review: Domino Logic clk X precharged node rises monotonically clk This can be any static CMOS gate (NAND, NOR, etc.) During pre-charge: Output of dynamic stage (X) pre-charged high when clk is low Domino gate output driving input of another always low during pre-charge During evaluate: X is conditionally discharged during evaluation Output of static buffer rises monotonically Inverting gate can be any inverting static CMOS gate It is impossible for buffer output to go from H-to-L during evaluation 6

Review: Domino Chains nmos nmos nmos Cascaded gates can be switched from PRECHARGE to EVAL on the same clock edge Logic decisions propagate through the cascade (or chain) like a row of falling dominos Length of domino chains is limited by EVAL time Logic must propagate to the output before falls Inputs to domino stage must be held stable during EVAL Domino gates are ratioless All domino gates are NONINVERTING (no XOR function) 7

Review: Delay in Domino Circuits clk 8 16 8 8 16 8 4 4 Eliminating fat slow pmos transistors allows less input capacitance for same drive strength (lower logical effort) Less input capacitance for same drive strength Reduces diffusion capacitances Domino gate has lower switching threshold, so it starts switching sooner No contention between pull-up and pull-down 8

Review: Logical Effort of Dynamic Gates 3 3 2 2 3 2 LE= 1 LE=2/3 What about the foot transistor? Does it need to be sized the same? NAND structure might not need a footing transistor. 9

Review: Precharged NAND Decoder Generally Built with NAND gates If you don t use clocked transistors Can get lower logical effort CLK W/2 4W If we used NAND gates with skewed inverters afterward 2W Assume inputs are pulses 2W W Average Logical Effort is Sqrt(2/3 * 5/6) = 0.75 10

Monotonic Edge Optimization Care most about evaluation speed, so skew static gate to favor input falling edge (output rising edge) Use high-skewed CMOS gates (pmos >> nmos) Caveats: degraded noise margins, slower pre-charge time Structuring logic into dynamic and static gates is an art form Static gate favors NAND (since series pmos slow) Dynamic stage allows more series devices clk 16 16 8 8 dyanmic stage static stage 11

Clocked Evaluation Transistor The clocked evaluation transistor is not strictly necessary. Can remove if all the inputs are provably low during pre-charge Other domino gate outputs satisfy this condition Also okay if high inputs are in series with provably low input Delay pre-charge edge to reduce power burned at start of pre-charge clk clkd clkdd clk clkd clkdd L H L H clk clkd clkdd 12

Pre-charge Properties Many domino gates can evaluate in one half-cycle, so it should be easy to pre-charge a single domino gate in the other half-cycle. But The domino gate must pre-charge enough to flip the high skew gate, then the high skew gate must fall below V t by sufficient noise margin before evaluation starts again To speed up domino evaluation, we want a small pre-charge transistor (small diffusion parasitic capacitances) Makes pre-charge slow High skew gate falls very slowly Delaying the clock to avoid pre-charge contention in un-clocked pull-down stacks reduces pre-charge time for clkdd domino gate Cycles are getting shorter Advanced domino methodologies are stretching the length of evaluation phase at the expense of pre-charge time Bottom line: pre-charge time is becoming an important issue. Size for roughly equal pre-charge and evaluate times 13

Domino Logic Family Outline Dynamic/domino logic Domino logic Timing of domino logic Noise issues and keepers Dual-rail domino logic (Dynamic DCVS) and other domino styles 14

Clocking for Domino Circuits (1) Make sure that the half-cycle during pre-charge is not wasted. Use clk for one domino chain, and clk_b for the 2 nd domino chain. Data transfers from one phase (chain) to the next. Need a latch between the phases since data is gone during precharge. If pre-charge comes early, we may lose the data. clk clk_b Clk_b Latch Clk_b domino Static Clk_b domino Static Clk_b domino Clk Latch Clk domino Static Clk domino Clk Static domino Legend: Domino: One inverting dynamic gate Static: One inverting static gate Latch: Inverting tristate latch Source: D. Harris 15

Clocking for Domino Circuits (2) Domino doesn t look so attractive in the context of a traditional pipeline. Pay clock skew twice in each cycle. Balancing short phases is difficult since there is no time borrowing. Latches become a significant fraction of the cycle time. clk clk_b Clk_b Latch Clk_b domino Static Clk_b domino Static Clk_b domino Clk Latch Clk domino Static Clk domino Static Clk domino Legend: Domino: One inverting dynamic gate Static: One inverting static gate Latch: Inverting tristate latch Source: D. Harris 16

Domino-clocking Evaluation Let T = cycle time = 16 FO4 delays; t skew = 2; t setup = 1 Difficult filling cycle exactly (no time borrowing) -> t imbalance = 1 T phase-logic = T/2 - t skew -t setup -t imbalance Baseline Design: T phase-logic = 50% of the phase is wasted in overhead! Slower than static! Optimized Design: Define clock domains and use t skew-local = 1 Work hard to balance logic between phases: t imbalance = 0 (optimistic) T phase-logic = Still, 25% of the phase is overhead! Source: D. Harris 17

Early Enhancements Good designers have recognized this problem for years. The largest problem is the hard edges set by the latches. A variety of latches soften this edge: from domino SR Latch Dual-Monotonic LatchTSPC Latch Source: D. Harris 18

Skew-tolerant Domino Clocking How much clock skew could we tolerate given N clock phases? Divide logic into N phases of T/N duration each. Overlapping clocks eliminates need for latches Extra overlap accommodates clock skew and time borrowing As with other domino techniques, budget skew on the transition from static to domino 1 2 1 1 1 1 2 2 2 2 static domino static domino static domino static domino static domino static domino static domino static domino 19

Skew Tolerance T = t e + t p t p = t prech + t skew; t e = T/N + t skew + t hold Hence t skew-max = [T(N-1)/N - t prech -t hold ] / 2 1 2 t p 1a 1b t e must overlap by t hold 2a Effective Precharge Window 1a 1b 2a static domino static domino static domino 20

Time Borrowing If we overlap the phases some more, we can provide a region where we can allow time-borrowing between the phases. Both phases are high for longer period of time. Helps with logic granularity. t borrow t overlap t hold t skew 21

Numerical Example Assume that T cycle =16 Let t prech = 4, long enough to: Precharge domino gate Make subsequent skewed static fall below V t t hold is slightly negative for reasonable cell libraries Next phase can evaluate before precharge ripples through static gate N t skew t p 2 2 6 3 3.33 7.33 4 4 8 6 4.66 8.66 8 5 9 Conservatively bound t hold at 0 Sweet spots: N=2 (fewest clocks), N=4 (good tolerance, 50% duty cycle) 22

Aside: 4-Phase Skew-Tolerant Domino Don t need to worry about data flowing through from 1-2 - 3-4 within 1 cycle. No min-delay constraint. Lots of overlap for skew tolerance and time borrowing. 23

Some Design Issues State is no longer stored in the latch at the end of a phase Instead, it is held by the first domino gate in the phase Use a full keeper to allow stop-clock operation from 1 block 2 weak All systems with overlapping clocks require min-delay checks Domino paths are presumably critical anyway, so few mindelay errors 4-phase has effectively no min-delay risk Overlap of all four phases is at most very small A minimum of 8 gates are in the cycle anyway 24

Pulse Stretching and Shrinking Stretch pulses by 2 inverter delays using an even number of inverters. Input transitions HIGH Output stays HIGH (inverted) after the 2 inverter delay. Create a pulse with only 3 inverter delay pulse-width. Input transitions HIGH Both inputs are HIGH (output LOW) for 3 inverter delays 0 0 1 2 1 2 Each tick= t inv 25

Multiphase Clock Generation Generating precisely shaped clocks is not easy. Fortunately, it doesn t need to be terribly precise. 2-phase clocking 1 and 2 are nonoverlapping. In this design, length of ck non-overlap does not scale with frequency. Use pulse stretchers to guarantee overlap. Control overlap with inverters. 4-phase clocking often need well-controlled delay lines. ck in Clock complement Pulse widen 3 ¼ t per ¼ t per 1 2 Pulse widen Pulse widen 4 2 1 26

Example: 2 -Phase Time Borrowing Time borrowing in the Itanium (Rusu00) Use 4 clock phases Clkd overlaps with both clkb and clk to allow borrowing between Phase 1 and Phase 2. Instead of requiring exactly 180 o overlapping clocks 27

N-phase Skew-Tolerant Domino The idea is to delay the clock along with the data flow. Can t delay by too much (>T cycle /2 in case (a) >T cycle in case (b)) would cause improper timing. Last phase ( 6 ) needs to arrive before the next 1 arrives. Phases are not necessarily uniform. 28

Interfacing with Static Logic (1) When domino output is driven to a static logic. Pre-charge phase must be eliminated. Follow the pre-charge gate with the latch (Itanium 2) Evaluates low when clock transitions HIGH. When pre-charge data (X) evaluates, output transitions HIGH (or stays LOW). Stays stable during pre-charge because latch is non-transparent when clock is LOW. 29

Interfacing with Static Logic (2) When a static logic outputs are driven to the first domino stage. Capture the data with a F/F or latch so that the data do not transition during Evaluate. Or in some way so that only rising edges are allowed. Ultrasparc/Itanium 2 both use a latch that only allows the output to transition from L-H. The latch is pulsed. Only conducting LOW for 3 inverter delay time. A -input arrives before the rising edge is latched. Rising edge A -input that arrives during the pulse is also latched. This essentially gives a small degree of time borrow. 30

Domino Logic Family Outline Dynamic/domino logic Domino logic Timing of domino logic Noise issues and keepers Dual-rail domino logic (Dynamic DCVS) and other domino styles 31

Noise in Domino Design #1: Charge Leakage Out CLK A Subthreshold leakage Junction leakage V Out Precharge Evaluate Minimum clock rate on the order of khz 32

Noise in Domino Design #2: Coupling and Gnd Bounce Coupling Ground Bounce high skew gate 1 1 1 V t The output of a dynamic gate is a floating node Coupling on the dynamic node can cause the static gate to glitch Input glitches can discharge dynamic node Portion of glitch >V t is important Ground bounce can cause a glitch or turn on the nmos pull down 33

Noise in Domino Design #3: Backgate Coupling A B out1 in out2 3 Dynamic NAND Static NAND 2 1 out1 0-1 in out2 Time, ns 0 2 4 6 34

Domino Noise Margin: Keepers weak minimum long Keeper for tiny domino gates Dynamic output may be corrupted by subthreshold leakage, -particles Use a weak keeper to make the dynamic node static Keeper doesn t help much with charge sharing and output coupling b/c it is so small Also degrades evaluation speed Prefer separate inverter for keeper Allows complex static gates, minimizes noise coupled onto keeper Dual-gate keeper minimizes load on tiny gates 35

Delayed Keepers Weakened keepers are not as effective at restoring the degraded voltage. To avoid fighting, we can turn on a stronger keeper after a small delay. (Alvandpour02), (Allam01), (Jung01) In (b), x floats momentarily. Key is to not delay by too much. Restore before too much charge is gone. But not start the keeper before all the inputs have arrived. Works best with the static logic interface (when all inputs are stable). 36

Issue in Domino Design #5: Charge Sharing Domino designs often fail due to charge sharing if internal nodes are not considered Occurs when internal node was low; capacitance divider with output formed Reduce charge sharing by reducing capacitance of internal nodes relative to capacitance of load High fanout gates suffer least from charge sharing Pre-charge internal nodes where necessary with secondary pre-charge devices (generally, every other node suffices) clk out clk in x C out in out goes to high skew gate 0 C x x let C x = C out 37

Pre-charging Internal Nodes Normally, internal nodes are pre-charged with small pmos devices Not crucial to get node to 100% of Vdd, just reduce noise Gates actually run faster when some charge sharing occurs Less capacitance needs to be pulled all the way down Sometimes pre-charge an internal node to Vdd-V t with an nmos device Maybe even pre-discharge an internal node to speed it up Worst case for speed is with node high, worst case for noise is with node low If we can tolerate the noise with node low, we might improve the speed by guaranteeing the node is low Use small nmos device (make sure it is off during evaluation) Only can pre-discharge a node if no path to Vdd possibly exists Must be sure that noise is tolerable for all cases when doing this! A B 2 O 38

Domino Pitfalls Review There are lots of ways that domino circuits can fail: Charge sharing and leakage Noise coupling onto the output (crosstalk). An -particle hit, sub-threshold leakage, or substrate charge injection on the dynamic node. Power supply noise (especially ground bounce). Fortunately, these are all relatively easy to check with ERC (Electrical Rule Check) and DRC (Design Rule Check) tools. Microprocessor companies routinely build reliable domino datapaths these days. 39

Domino Logic Family Outline Dynamic/domino logic Domino logic Timing of domino logic Noise issues and keepers Dual-rail domino logic (Dynamic CVSL) and other domino styles 40

Non-monotonic Logic Domino gate + high skew gate pair can only implement non-inverting ( monotonic ) functions. Many important functions are non-monotonic, such as XOR clk a b_b a_b b One solution: push non-monotonic function to end of logic cone Build first part of cone in domino gates Switch to static of transmission gate logic for non-monotonic part Example: carry select adder often uses static mux 41

Dual-Rail Domino clk clk F F out_h a a a out_l b b merge into a single pulldown network We can overcome this problem by computing both true and complementary outputs with dual rail domino. Also known as Differential Cascode Voltage Switch (DCVS) Compute out_h and out_l; may be able to share transistors out_h is asserted when the output is evaluated to be high out_l is asserted when the output is evaluated to be low Asserting both out_h and out_l is illegal Both out_h and out_l are unasserted during pre-charge 42

Keepers for DCVS F m1 Pull-down Tree m2 F F m1 Pull-down Tree m2 F Keepers are the same idea. Since we have differential, keepers can be cross coupled. 43

Multiple-Output Domino MODL (Hwang89) Opportunistic reuse of logic Particularly true of pre-charged carry-propagate chain Can be thought of as one big gate. 44

Diode-Footed Domino VDD CLK Out C L Diode-Foot Current Mirror CLK_b The stacking reduces leakage Current mirror and feedback increase the speed 45

Operation: Pre-Charge Phase VDD CLK = 0 0 -> VDD VDD -> 0 C L VDD -> 0 CLK_b= VDD 46

Operation: Evaluate Phase VDD VDD CLK = VDD 0 0 0 VDD -> VDD C L 0 -> 0 CLK = VDD VDD -> 0 0 1 1 C L 0 -> VDD Vx 0V V x CLK_b = 0 CLK_b = 0 V x has finite voltage due to leakage current. Stack of 2 reduce leakage. Initial discharge due to charge sharing Current mirror provide a faster discharge path. Feedback provide remaining discharge 47

Simulations Noise immunity test: Apply input noise pulse until noise is unity gain. Normal Operation 48

Noise Immunity of DFD 49

Summary Dynamic logic is based on optimizing for one edge of evaluation. To eliminate the other edge, a pre-charge phase is introduced. Timing is a critical element of the design Because one of the nodes is dynamic, noise is another critical design constraint. Large internal capacitance can lead to a bad delayrobustness tradeoff. Large fanin can be challenging (especially ANDs). Monotonicity forces us to build dual rail making ANDs unavoidable. Diode-footed is one attempt at pushing the tradeoff to a different point. (We ll see many more). 50