GHz Asynchronous SRAM in 65nm. Jonathan Dama, Andrew Lines Fulcrum Microsystems

Similar documents
! Memory Overview. ! ROM Memories. ! RAM Memory " SRAM " DRAM. ! This is done because we can build. " large, slow memories OR

A 65nm LEVEL-1 CACHE FOR MOBILE APPLICATIONS

CS250 VLSI Systems Design Lecture 9: Memory

Lecture 11: MOS Memory

Unleashing the Power of Embedded DRAM

Z-RAM Ultra-Dense Memory for 90nm and Below. Hot Chips David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc.

ENEE 759H, Spring 2005 Memory Systems: Architecture and

ECE 485/585 Microprocessor System Design

The Memory Hierarchy 1

Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

Deep Sub-Micron Cache Design

+1 (479)

Data Cache Final Project Report ECE251: VLSI Systems Design UCI Spring, 2000

The Memory Hierarchy. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 3, 2018 L13-1

TIMA Lab. Research Reports

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture. This Unit: Caches and Memory Hierarchies.

Memories: Memory Technology

STUDY OF SRAM AND ITS LOW POWER TECHNIQUES

Memory Design I. Array-Structured Memory Architecture. Professor Chris H. Kim. Dept. of ECE.

Memory. Outline. ECEN454 Digital Integrated Circuit Design. Memory Arrays. SRAM Architecture DRAM. Serial Access Memories ROM

! Memory. " RAM Memory. " Serial Access Memories. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

ECEN 449 Microprocessor System Design. Memories

Silicon Memories. Why store things in silicon? It s fast!!! Compatible with logic devices (mostly)

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

CS311 Lecture 21: SRAM/DRAM/FLASH

Yield-driven Near-threshold SRAM Design

A Energy-Efficient Pipeline Templates for High-Performance Asynchronous Circuits

ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems

Integrated Circuits & Systems

CS152 Computer Architecture and Engineering Lecture 16: Memory System

An Asynchronous Floating-Point Multiplier

EEM 486: Computer Architecture. Lecture 9. Memory

EECS150 - Digital Design Lecture 17 Memory 2

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Semiconductor Memory Classification. Today. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. CPU Memory Hierarchy.

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

EECS 151/251A Spring 2019 Digital Design and Integrated Circuits. Instructor: John Wawrzynek. Lecture 18 EE141

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

CpE 442. Memory System

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

Introduction to SRAM. Jasur Hanbaba

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Memory System Design. Outline

Topic 21: Memory Technology

Topic 21: Memory Technology

ECEN 449 Microprocessor System Design. Memories. Texas A&M University

Lecture 11 SRAM Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

A 1.5GHz Third Generation Itanium Processor

CMPEN 411 VLSI Digital Circuits Spring Lecture 22: Memery, ROM

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Very Large Scale Integration (VLSI)

Semiconductor Memory Classification

An Overview of Standard Cell Based Digital VLSI Design

Computer Architecture

DRAM with Boosted 3T Gain Cell, PVT-tracking Read Reference Bias

CHAPTER 12 ARRAY SUBSYSTEMS [ ] MANJARI S. KULKARNI

Column decoder using PTL for memory

NAND Controller Reliability Challenges

A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache

Memory Hierarchy: Caches, Virtual Memory

TEMPLATE BASED ASYNCHRONOUS DESIGN

Silicon Memories. Why store things in silicon? It s fast!!! Compatible with logic devices (mostly) The main goal is to be cheap

A Write-Back-Free 2T1D Embedded. a Dual-Row-Access Low Power Mode.

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

1/19/2009. Data Locality. Exploiting Locality: Caches

Magnetic core memory (1951) cm 2 ( bit)

10/24/2016. Let s Name Some Groups of Bits. ECE 120: Introduction to Computing. We Just Need a Few More. You Want to Use What as Names?!

EE241 - Spring 2007 Advanced Digital Integrated Circuits. Announcements

SRAM. Introduction. Digital IC

Replacement Policy: Which block to replace from the set?

Memory Design I. Semiconductor Memory Classification. Read-Write Memories (RWM) Memory Scaling Trend. Memory Scaling Trend

ECE 2300 Digital Logic & Computer Organization

Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China

COSC 6385 Computer Architecture - Memory Hierarchies (II)

The DRAM Cell. EEC 581 Computer Architecture. Memory Hierarchy Design (III) 1T1C DRAM cell

Basic Low Level Concepts

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

A Configurable Radiation Tolerant Dual-Ported Static RAM macro, designed in a 0.25 µm CMOS technology for applications in the LHC environment.

Gate Sizing and Vth Assignment for Asynchronous Circuits Using Lagrangian Relaxation. Gang Wu, Ankur Sharma and Chris Chu Iowa State University

Introduction to CMOS VLSI Design Lecture 13: SRAM

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

Centip3De: A 64-Core, 3D Stacked, Near-Threshold System

Single-Track Asynchronous Pipeline Templates Using 1-of-N Encoding

EECS150 - Digital Design Lecture 16 Memory 1


Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Memory. Lecture 22 CS301

An introduction to SDRAM and memory controllers. 5kk73

Digital Integrated Circuits Lecture 13: SRAM

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Caches. Samira Khan March 23, 2017

Mainstream Computer System Components

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Memory Challenges. Issues & challenges in memory design: Cost Performance Power Scalability

Spare Block Cache Architecture to Enable Low-Voltage Operation

TESTING TRI-STATE AND PASS TRANSISTOR CIRCUIT STRUCTURES. A Thesis SHAISHAV PARIKH

Transcription:

GHz Asynchronous SRAM in 65nm Jonathan Dama, Andrew Lines Fulcrum Microsystems

Context Three Generations in Production, including: Lowest latency 24-port 10G L2 Ethernet Switch Lowest Latency 24-port 10G L3 Switch/Router Higher Frequencies Lower Latencies

Design Methodology Mostly Quasi Delay-Insensitive PCHB, PCFB, WCHB templates 18 tpc Islands of Synchronous Standard Flow (GALS) Additional timing assumptions in key circuits Register Files (unacknowledged bit-writes) Dense SRAM (many) TCAM (trickiest)

Outline 10T Register Files 6T SRAM Bank and Analog Verification Multibank 6T Dual-ported SRAMs (SDP/DDP/CDP) Design for Test (scan) Design for Yield (repair) Soft Error Tolerance Performance Analysis

Outline 10T Register Files 6T SRAM Bank and Analog Verification Multibank 6T Dual-ported SRAMs (SDP/DDP/CDP) Design for Test (scan) Design for Yield (repair) Soft Error Tolerance Performance Analysis

10T Memories: Fast, Safe 10T state-bit (11T including reset) Uses foundry 6T ratios _w.0 _r.0 Design Rule Correct Up to 32 bits and 32 address Supports masked writes Single & Dual Ported Control Versions _w.1 JW JR _r.1 Custom Handshakes replace control for particular purposes FIFOs and SHELFs

10T Memories: Structure _w.0 _r.0 W WRITE _w.1 JW JR _r.1 BIT ARRAY READ R e1ofn JW e1of1 KW e1ofn JR e1of1 KR DI Interface DECODE

10T Memories: Structure _w.0 _r.0 W WRITE _w.1 JW JR _r.1 BIT ARRAY READ R e1ofn JW e1of1 KW e1ofn JR e1of1 KR DI Interface DECODE

6T Memories: Dense 6T Statebit (TSMC) (Carefully) Violates DRC Different Implant than normal Logic Validated ratio assumptions Bank: up to 16 bits and 1024 address 4 way set muxing 8-way 2nd level buses 32 bits per bit-line Fully pipelined to arbitrary width and depth

6T Bank: Bit and SET W 1 R 1 B 1 Go B 0 A 0 S 0 W 0 R 0 STATE!BIT PRECHARGE WRITE SET!MUX READ

6T Bank: Two Chunks 2x 128 Addresses in 4 Sets

6T Bank: Top-level Structure e1of4[2] R e1of4[2] W DATA CHUNK CHUNK CHUNK CHUNK e1of4[2] R e1of4[2] W DATA CHUNK CHUNK CHUNK CHUNK e1of2 I e1of4[5] A CTRL DEMUX DEMUX DEMUX DEMUX e1of4[2] R e1of4[2] W DATA CHUNK CHUNK CHUNK CHUNK e1of4[2] R e1of4[2] W DATA CHUNK CHUNK CHUNK CHUNK

6T Bank: Address Decoding DEMUX 5 1of4s as input 256 address lines decoded with AND4s 8 groups (half chunk) of 4 set lines Decoder transitions are treated as digitally isochronic CHUNKs are power-gated

6T SRAM: Bank

6T Bank: Analog Assumptions Common Concerns: Bit-line pull-down can overpower state-bit while pass-gate open Bit-lines held at or floating near Vdd don t write state-bit while pass-gate open Cap-coupling, Slews, Leakage Arise from implementation decisions: Precharge interference with reads of unselected sets must hold those bitlines above the switching threshold of the set-muxing NAND Bit-lines float at Vdd briefly before address-lines asserted

6T Bank: Analog Assumptions 1 0.9 Write Overpowers State!Bit (Opposite State!Bit Rail (s ) Forced to GND) Voltage (V) 0.8 0.7 0.6 0.5 0.4 State!Bit Rail (s) b s s b A 0.3 0.2 0.1 Bit!Line Rail (b) 15% of Vdd 6% of Vdd 0 2 2.1 2.2 2.3 2.4 2.5 Time (ns)

6T Bank: Timing Assumptions Read-Data is fully Delay-Insensitive (DI) Writes are not checked (~2:1 race) Bit-line precharge is not checked (~2:1 race) Neutrality of address decoding implied by input neutrality; the decoded control is not checked Everything else is DI! _w 3T 8T 2nd Level Bus bit!line 4T 4T 11T 6T 8T

state!bit writes successfully write pull!down Voltage (V) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Write/Precharge Timing Margin write margin precharge margin!0.1 11.4 11.6 11.8 12 12.2 12.4 Time (ns) set/address closes bit!lines precharge opens again once write begins, opposing bit!line and state!node are restored with pass!gate open, bit!line reads initially 6T Bank: Timing Assumptions Write and Precharge Margins

Dense Dual-Ported Memory Design 1.8x Larger 8T 6T

Dual-Ported Memories 10 8T statebit is 1.8x larger Address and bit-lines lengthen, reducing read performance 8 Overhead increases with fewer bits per-line Overall scaling worse than 1.8x at high frequencies: bit-line slew dominates 2x 6 4 Area/Bit (um^2) 2 0.8 0.9 1 1.1 1.2 1.3 Frequency (GHz) 6T Bank 8T Bank 0

6T SRAM: Multi-Bank Structure 6T Bank 6T Bank WD Write Bus Read Bus RD WD Write Bus Read Bus RD 6T Bank 6T Bank Address/Control Bus Address/Control Crossbar I A WI WA RI RA

d Dual-Ported (CDP) SRAM Uses same 6T High Current State-bits Dual-ported buses, single-ported banks Can read and write different banks at once Sideband cache SRAM of one bank in size (e.g. 1024 addresses) When attempting to read and write the same bank, divert the write to the cache Must victimize the old cache entry to the main banks, but this won t conflict with the read

d Dual-Ported (CDP) SRAM WD Data 0 Data 1 SDP Core RD Tags and Control WI WA RI RA

CDP: Operation Write red to 0b10 Directed to core Tags Bank 0 Bank 1 Write green to 0b01 0 0 Directed to cache 0 0 no eviction needed Data 0 SDP Core WD RD Data 1 Tags and Control WI WA RI RA

CDP: Operation Scenario: Read Bank 1, Index 0 Tags Bank 0 Bank 1 Write blue to Bank 1, Index 1 0 0 Green evicted from cache 0 0 Blue written to cache to allow read of red from bank Data 0 0 10 SDP Core WD RD Data 1 Tags and Control WI WA RI RA Read Write Flush

CDP: Operation Scenario: Read Bank 1, Index 0 Tags Bank 0 Bank 1 Write blue to Bank 1, Index 1 0 0 Green evicted from cache 0 0 Blue written to cache to allow read of red from bank Data 0 0 10 SDP Core WD RD Data 1 Tags and Control WI WA RI RA Read Write Flush

CDP: Operation Scenario: Read Bank 1, Index 0 Tags Bank 0 Bank 1 Write blue to Bank 1, Index 1 0 0 Green evicted from cache 0 0 Blue written to cache to allow read of red from bank Data 0 0 10 SDP Core WD RD Data 1 Tags and Control WI WA RI RA Read Write Flush

CDP: Operation Scenario: Read Bank 1, Index 0 Tags Bank 0 Bank 1 Write blue to Bank 1, Index 1 0 0 Green evicted from cache 0 0 Blue written to cache to allow read of red from bank Data 0 0 10 SDP Core WD RD Data 1 Tags and Control WI WA RI RA Read Write Flush

DUALSRAM16K_16

CDP: Area Scaling 4 Reference 8T Fulcrum SDP (6T) Fulcrum CDP (6T) Area/Bit @1.1GHz, 1V, 125C 3 2 1 0 4 8 16 Number of Banks

Post Silicon: Simulation vs. Silicon 2.5 2 simulated read frequency measured read frequency simulated write frequency measured write frequency simulated read power simulated write power 40 35 30 Frequency (GHz) 1.5 1 25 20 15 Power (mw) 0.5 10 5 0 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0 Voltage (V)

Conclusions Quasi Delay-Insensitive design works as a cost-competitive, productioncompatible methodology Targeted timing assumptions still useful for aggressive frequency targets and area reduction We can build asynchronous SRAMs as dense as synchronous and faster at similar densities 65nm development successful and the fruits are soon doing into production