GHz Asynchronous SRAM in 65nm. Jonathan Dama, Andrew Lines Fulcrum Microsystems

GHz Asynchronous SRAM in 65nm Jonathan Dama, Andrew Lines Fulcrum Microsystems

Context Three Generations in Production, including: Lowest latency 24-port 10G L2 Ethernet Switch Lowest Latency 24-port 10G L3 Switch/Router Higher Frequencies Lower Latencies

Design Methodology Mostly Quasi Delay-Insensitive PCHB, PCFB, WCHB templates 18 tpc Islands of Synchronous Standard Flow (GALS) Additional timing assumptions in key circuits Register Files (unacknowledged bit-writes) Dense SRAM (many) TCAM (trickiest)

Outline 10T Register Files 6T SRAM Bank and Analog Verification Multibank 6T Dual-ported SRAMs (SDP/DDP/CDP) Design for Test (scan) Design for Yield (repair) Soft Error Tolerance Performance Analysis

10T Memories: Fast, Safe 10T state-bit (11T including reset) Uses foundry 6T ratios _w.0 _r.0 Design Rule Correct Up to 32 bits and 32 address Supports masked writes Single & Dual Ported Control Versions _w.1 JW JR _r.1 Custom Handshakes replace control for particular purposes FIFOs and SHELFs

10T Memories: Structure _w.0 _r.0 W WRITE _w.1 JW JR _r.1 BIT ARRAY READ R e1ofn JW e1of1 KW e1ofn JR e1of1 KR DI Interface DECODE

6T Memories: Dense 6T Statebit (TSMC) (Carefully) Violates DRC Different Implant than normal Logic Validated ratio assumptions Bank: up to 16 bits and 1024 address 4 way set muxing 8-way 2nd level buses 32 bits per bit-line Fully pipelined to arbitrary width and depth

6T Bank: Bit and SET W 1 R 1 B 1 Go B 0 A 0 S 0 W 0 R 0 STATE!BIT PRECHARGE WRITE SET!MUX READ

6T Bank: Two Chunks 2x 128 Addresses in 4 Sets

6T Bank: Top-level Structure e1of4[2] R e1of4[2] W DATA CHUNK CHUNK CHUNK CHUNK e1of4[2] R e1of4[2] W DATA CHUNK CHUNK CHUNK CHUNK e1of2 I e1of4[5] A CTRL DEMUX DEMUX DEMUX DEMUX e1of4[2] R e1of4[2] W DATA CHUNK CHUNK CHUNK CHUNK e1of4[2] R e1of4[2] W DATA CHUNK CHUNK CHUNK CHUNK

6T Bank: Address Decoding DEMUX 5 1of4s as input 256 address lines decoded with AND4s 8 groups (half chunk) of 4 set lines Decoder transitions are treated as digitally isochronic CHUNKs are power-gated

6T SRAM: Bank

6T Bank: Analog Assumptions Common Concerns: Bit-line pull-down can overpower state-bit while pass-gate open Bit-lines held at or floating near Vdd don t write state-bit while pass-gate open Cap-coupling, Slews, Leakage Arise from implementation decisions: Precharge interference with reads of unselected sets must hold those bitlines above the switching threshold of the set-muxing NAND Bit-lines float at Vdd briefly before address-lines asserted

6T Bank: Analog Assumptions 1 0.9 Write Overpowers State!Bit (Opposite State!Bit Rail (s ) Forced to GND) Voltage (V) 0.8 0.7 0.6 0.5 0.4 State!Bit Rail (s) b s s b A 0.3 0.2 0.1 Bit!Line Rail (b) 15% of Vdd 6% of Vdd 0 2 2.1 2.2 2.3 2.4 2.5 Time (ns)

6T Bank: Timing Assumptions Read-Data is fully Delay-Insensitive (DI) Writes are not checked (~2:1 race) Bit-line precharge is not checked (~2:1 race) Neutrality of address decoding implied by input neutrality; the decoded control is not checked Everything else is DI! _w 3T 8T 2nd Level Bus bit!line 4T 4T 11T 6T 8T

state!bit writes successfully write pull!down Voltage (V) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Write/Precharge Timing Margin write margin precharge margin!0.1 11.4 11.6 11.8 12 12.2 12.4 Time (ns) set/address closes bit!lines precharge opens again once write begins, opposing bit!line and state!node are restored with pass!gate open, bit!line reads initially 6T Bank: Timing Assumptions Write and Precharge Margins

Dense Dual-Ported Memory Design 1.8x Larger 8T 6T

Dual-Ported Memories 10 8T statebit is 1.8x larger Address and bit-lines lengthen, reducing read performance 8 Overhead increases with fewer bits per-line Overall scaling worse than 1.8x at high frequencies: bit-line slew dominates 2x 6 4 Area/Bit (um^2) 2 0.8 0.9 1 1.1 1.2 1.3 Frequency (GHz) 6T Bank 8T Bank 0

6T SRAM: Multi-Bank Structure 6T Bank 6T Bank WD Write Bus Read Bus RD WD Write Bus Read Bus RD 6T Bank 6T Bank Address/Control Bus Address/Control Crossbar I A WI WA RI RA

d Dual-Ported (CDP) SRAM Uses same 6T High Current State-bits Dual-ported buses, single-ported banks Can read and write different banks at once Sideband cache SRAM of one bank in size (e.g. 1024 addresses) When attempting to read and write the same bank, divert the write to the cache Must victimize the old cache entry to the main banks, but this won t conflict with the read

d Dual-Ported (CDP) SRAM WD Data 0 Data 1 SDP Core RD Tags and Control WI WA RI RA

CDP: Operation Write red to 0b10 Directed to core Tags Bank 0 Bank 1 Write green to 0b01 0 0 Directed to cache 0 0 no eviction needed Data 0 SDP Core WD RD Data 1 Tags and Control WI WA RI RA

CDP: Operation Scenario: Read Bank 1, Index 0 Tags Bank 0 Bank 1 Write blue to Bank 1, Index 1 0 0 Green evicted from cache 0 0 Blue written to cache to allow read of red from bank Data 0 0 10 SDP Core WD RD Data 1 Tags and Control WI WA RI RA Read Write Flush

DUALSRAM16K_16

CDP: Area Scaling 4 Reference 8T Fulcrum SDP (6T) Fulcrum CDP (6T) Area/Bit @1.1GHz, 1V, 125C 3 2 1 0 4 8 16 Number of Banks

Post Silicon: Simulation vs. Silicon 2.5 2 simulated read frequency measured read frequency simulated write frequency measured write frequency simulated read power simulated write power 40 35 30 Frequency (GHz) 1.5 1 25 20 15 Power (mw) 0.5 10 5 0 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0 Voltage (V)

Conclusions Quasi Delay-Insensitive design works as a cost-competitive, productioncompatible methodology Targeted timing assumptions still useful for aggressive frequency targets and area reduction We can build asynchronous SRAMs as dense as synchronous and faster at similar densities 65nm development successful and the fruits are soon doing into production