ECE 485/585 Microprocessor System Design

Microprocessor System Design Lecture 4: Memory Hierarchy Memory Taxonomy SRAM Basics Memory Organization DRAM Basics Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based on materials provided by Mark F.

Memory "640K ought to be enough for anybody." -- Bill Gates, 1981

Outline Taxonomy of Memories Memory Hierarchy SRAM Basic Cell, Devices, Timing Memory Organization Multiple banks, interleaving DRAM Basic Cell, Timing DRAM Evolution DRAM modules Error Correction Memory Controllers

Memory Taxonomy Read/Write Memory Volatile Non-Random Access Non-Volatile Random Access Read Only Shift Register FIFO CAM SRAM DRAM EPROM E 2 PROM Flash NAND NOR NVRAM Mask ROM PROM

Computer Memory Hierarchy From Hennessy & Patterson, Computer Architecture: A Quantitative Approach (4 th edition) Processor Datapath Control Registers Intermediate results On-Chip Cache Second Level Cache (SRAM) Cached DRAM Third Level Cache (SRAM) Main Memory (DRAM) Secondary Storage (Disk) Instructions File System Data Paging [Cached Files] Tertiary Storage (Tape) Archive Backup

Register Files sel a sel b sel c data a Register File General Purpose Registers Usually have multiple ports Support CPU architecture s datapaths Ability to read two operands, write one Operate at CPU speed data b data c For read operations, the register file is equivalent to a 2-D array of flip-flops with tri-state outputs For write operations, we add some additional circuitry to the basic cell

Address Decoding Address decoder generates a one-hot code (1-of-n code) from the address binary to unary The output is used for row selection

Accessing Register Files Read Address following Change address Data from new address appears on output Asynchronous Write is synchronous Clock RegID WE If WE, input data is written to selected word on the clock edge Din Register File Dout Clock RegID RegID X RegID Y Dout R[X] R[Y] val Din val WE

Multi-ported Register File A memory unit with two output ports is said to be dual ported Two ways to implement a dual-ported register file True ports: Single set of registers with duplicate data paths and access circuitry that enables two registers to be read at a time Two copies: Use 2 memory blocks each containing one copy of the register file To read two registers, one register can be accessed from each file To write a register, data needs to be written to both the copies of that register Input Data C C Address C Address A Regist er File A Address B Regist er File B Output Data

Static RAMs (SRAM)

SRAM Technology addr SRAM Cell bit line data 6 transistors bit line word line Write Write bit and bit onto bit lines Select desired word ( row ) Turns on pass transistors Writes new value to cell [One inverter input will be low, turning its output high] Read Select desired word ( row ) One bit line will be pulled low Other will remain high For density and low power, want tiny transistors but they can t drive long bit lines Sol n: Pre-charge bit lines (Vdd/2) before read Sense differential between bit and bit

Dual-ported Memory Internals Add decoder, another set of read/write logic, bits lines, word lines Example cell: SRAM WL 2 WL 1 dec a dec b cell array b 2 b 1 b 1 b 2 r/w logic Repeat everything but cross-coupled inverters. address ports data ports r/w logic This scheme extends up to a couple more ports, then need to add additional transistors.

Basic SRAM Size in bits (organization) 1Mb (256K x 4) 256K words of 4 bits 1Mb (128K x 8) 128K words of 8 bits Most control signals are active Low Chip Select (/CS) effectively an enable Write Enable (/WE) controls read/write To perform a write /WE is asserted (Low) /CS is asserted (Low) To perform a read /WE is de-asserted (High) /CS is asserted (Low) A 0 A 1 A n-1 DIN 0 DIN 1 CS WE 2 n x b RAM DOUT 0 DOUT 1 DIN b-1 DOUT b-1

SRAM Variations 2 n x b RAM 2 n x b RAM A 0 A 1 A 0 A 1 A n-1 A n-1 DIN 0 DIN 1 DOUT 0 DOUT 1 D 0 D 1 DIN b-1 DOUT b-1 CS WE D b-1 CS OE WE Dedicated Din & Dout Trade pin count ($) for higher performance No bidirectional turnaround time required Din & Dout often combined to save pins ($) A new control signal, Output Enable (/OE)

Simplified SRAM timing diagram Read: Valid address, then /CS (Chip Select) asserted Access Time: Address good to data valid Cycle Time: Minimum time between subsequent memory operations Write: Valid address and data with /WE asserted, then /CS asserted Address must be stable a setup time before /WE and /CS go low Add hold time after one of the signals goes high

Internal SRAM Organization (16x4) Din 3 Din 2 Din 1 Din 0 WriteEnable Wr Driver - + Wr Driver - + Wr Driver - + Wr Driver - + SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell SRAM Cell : : : : Word 0 Word 1 Address Decoder A0 A1 A2 A3 SRAM Cell SRAM Cell SRAM Cell SRAM Cell Word 15 - + Sense Amp - + Sense Amp - + Sense Amp - + Sense Amp Dout 3 Dout 2 Dout 1 Dout 0

Example: Cypress SRAM Note address following mode Key SRAM timing parameters t AA Address access time: time between a valid address being applied and valid data available on data outputs t RC Read cycle time: Minimum time that one address must be held on the address lines before a second address can be presented t AA represents latency t RC represents bandwidth (throughput)

n bits What happens as number of bits increases? Decoder gets larger and slower Bit lines increase in length Large distributed RC load Compensate with larger, slower transistors Log 2 n bit address Remember Treat output as differential signal Pre-charge both bit lines high Memory cell pulls only one low Sense bit value by comparing sense lines Option: Make array shorter and wider!

Inside a Tall Thin RAM is n = k x m bits Log 2 k bit row address Sense amps mux Log 2 m bit column address 1 data bit

Replicate for Desired Width Log 2 k bit row address n = k x m bits Sense amps Log 2 m bit column address mux 4 data bits 1 data bit x 4

Physical SRAM Array Should Be Square Example: 16 x 1 SRAM 4 x 4 Array DI A1 A0 A3-A2 /WE /CS 2-to-4 Decoder 1 1 0 0 2 3 2-to-4 Decoder IN SEL WR IN SEL WR IN SEL WR IN SEL WR OUT OUT OUT OUT IN SEL WR IN SEL WR IN SEL WR IN SEL WR OUT OUT OUT OUT IN SEL WR IN SEL WR IN SEL WR IN SEL WR OUT OUT OUT OUT IN SEL WR IN SEL WR IN SEL WR IN SEL WR OUT OUT OUT OUT /OE S E 4-to-1 Mux DO

Synchronous SRAM So far we ve been talking about SRAMs w/ asynchronous reads but there are fully synchronous SRAMs Faster than asynchronous SRAMs but need to be clocked Microprocessor manufacturers implement synchronous SRAMs for internal caches FPGA manufacturers embed dedicated synchronous SRAM blocks in their FPGAs Provides Kb s to Mb s of RAM w/o using flip-flops in FPGA fabric Highly configurable (bit width, memory depth, parity/no parity, input/output latches, pipeline registers, etc.) Single cycle access up to speeds near max for FPGA depending on FGPA family

Memory Subsystems

Memory Organization How do we build memory subsystems out of memory devices?

Making the Memory Deeper 256K x 8 Memory System: Use four 64K x 8 RAM chips 256K 18 address lines 16 shared address lines to array 2 address lines decoded to provide /CS (one per chip) common R/W and tri-state data outputs

Making the Memory Wider 64K x 16 Memory System: Use two 64K x 8 RAM chips 16 shared address lines shared control signals

Access Bank 0 Memory Interleaving Access Pattern without Interleaving: CPU Memory D1 available Start Access for D1 Start Access for D2 Access Pattern with 4-way Interleaving: CPU Memory Bank 0 Memory Bank 1 Memory Bank 2 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again Memory Bank 3

Memory Interleaving (cont d) read 00000 read 00001 read 00002 read 00003 read 00004 for (i = 0; i <16; i++) A[i] = A[i] * c + d; (assume A[0] at address 0) address address address address 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 Bank 0 Bank 1 Bank 2 Bank 3

Memory Interleaving (cont d) Low Order Address Interleaving

Memory Interleaving (cont d) Low Order Address Interleaving w/ Byte Select Bank Select Byte Select

Memory Interleaving (cont d) High Order Address Interleaving

High Order Interleaving at Work 256K x 8 Memory System: Use four 64K x 8 RAM chips 256K 18 address lines 16 shared address lines to array 2 address lines decoded to provide /CS (one per chip) common R/W and tri-state data outputs

Memory Interleaving (cont d) High Order Address Interleaving Bank Select Byte Select