Memories: Memory Technology Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 * Memory Hierarchy 1
Outline Survey of various types of memory technologies S (caches, register files) DRAMS (main memory) Variants of DRAMs Latency vs. Throughput again Main Memory Background Random Access Memory (vs. Serial Access Memory) Different flavors at different levels Physical Makeup (CMOS, DRAM) Low Level Architectures (FPM, EDO, BEDO, SDRAM) Cache uses : Static Random Access Memory Fast: 8-16 times as fast as DRAM (also 8-16 times as costly...) Small: 1/4-1/8 the capacity of DRAM No refresh needed, but volatile to power loss Main Memory is DRAM: Dynamic Random Access Memory Slow and big (relative to ) Dynamic: needs to be refreshed periodically (every 8 ms, 1% time) Addresses divided into 2 halves (Memory as a 2D matrix): RAS or Row Access Strobe CAS or Column Access Strobe 2
Basic Set-Associative Cache Structure (from CACTI) Typical Organization: 16-word x 4-bit Din 3 Din 2 Din 1 Din 0 Precharge WrEn Wr Driver & Wr Driver & Wr Driver & Wr Driver & - Precharger+ - Precharger+ - Precharger+ - Precharger+ : : : : - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Word 0 Word 1 Word 15 Address Decoder A0 A1 A2 A3 Dout 3 Dout 2 Dout 1 Dout 0 3
Basic Static RAM 6-Transistor 0 1 0 1 word (row select) 0 word 1 bit bit Write: bit 1. Drive bit lines (bit=1, bitbar=0) 2. Select row replaced with pullup Read: to save area 1. Precharge bit and bitbar to Vdd 2. Select row 3. pulls one line low 4. Sense amp on column detects difference between bit and bitbar bit Multi-ported s p = total number of ports w = register cell width without ports h = register cell height without ports So each cell is (w+p)(h+p) in area How many ports needed per register per functional unit? 2 reads, 1 write External port to cache = x How large is the register file given N functional units? Number of registers scale with N Size of each register scales with square of total number of ports = (3+x)N So area of register file scales with N^3. [Rixner et. al., Register organization for media processing, In HPCA 2000. 4
Example: Toshiba Problems with Select = 1 P1 P2 Off On On N1 On Off N2 On bit = 1 bit = 0 Six transistors use up a lot of area Consider when a Zero is stored in the cell: Transistor N1 will try to pull bit to 0 Transistor P2 will try to pull bit bar to 1 5
1-Transistor Memory (DRAM) Write: 1. Drive bit line 2. Select row Read: 1. Precharge bit line to Vdd/2 2. Select row 3. and bit line share charges Very small voltage changes on the bit line 4. Sense (fancy sense amp) Can detect changes of ~10-100k electrons Amplifies and recharges cell 5. Write: restore the value Refresh Basically a dummy read to an entire row bit row select DRAM logical organization (4 Mbit) Square root of bits per RAS/CAS 11 Column Decoder Sense Amps & I/O A0 A10 Address Buffer Row Decoder Memory Array (2,048 x 2,048) Word Line Storage 6
Logic Diagram of a Typical DRAM RAS_L CAS_L WE_L OE_L A 256K x 8 9 DRAM 8 D Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low Din and Dout are combined (D is bidirectional): WE_L is asserted (Low), OE_L is disasserted (High) D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low) D is the data output pin Row and column addresses share the same pins (A) RAS_L goes low: Pins A are latched in as row address CAS_L goes low: Pins A are latched in as column address RAS/CAS edge-sensitive Cycle Time vs. Access Time: Latency vs. Throughput again Cycle Time Access Time Time DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time DRAM (Read/Write) Cycle Time : How frequently can you initiate an access? DRAM (Read/Write) Access Time: How quickly will you get what you want once you initiate an access? DRAM Bandwidth Limitation : How much data can you get from the memory? 7
Main Memory Organizations Simple: CPU, Cache, Bus, Memory same width (32 or 64 bits) Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512) Banked & Interleaved: CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved Increasing Bandwidth - Interleaving Access Pattern without Interleaving: CPU Memory D1 available Start Access for D1 Start Access for D2 Access Pattern with 4-way Interleaving: CPU Access Bank 0 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again Memory Bank 0 Memory Bank 1 Memory Bank 2 Memory Bank 3 8
Main Memory Performance Timing model (word size is 32 bits) 4 to send address, 56 access time per word, 4 send time per word Cache block is 4 words Simple M.P. = 4 x (4+56+4) = 256 Wide M.P. = 4 + 56 + 4 = 64 (4-word) Interleaved M.P. = 4 + 56 + 4x4 = 76 4-way interleaved memory. Optimized for sequential accesses DRAM Performance A 60 ns (trac) DRAM can perform a row access only every 110 ns (trc) perform column access (tcac) in 15 ns, but time between column accesses is at least 35 ns (tpc). In practice, external address delays and turning around buses make it 40 to 50 ns These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead! Can it be made faster? 9
Improvements on DRAM Fast page mode Send row address once for multiple column addresses Extended Data Out (EDO) Keep data available even after CAS_L is high Synchronous DRAM (SDRAM) Double Data Rate SDRAM (DDR SDRAM) PC2100: 2.1 Gbytes/second, 8 * 133M * 2 Direct Rambus DRAM (DRDRAM) 16-bit internal bus clocked at 400MHz (1.6Gbytes/second) Fast Page Mode DRAM Page: All bits on the same ROW (Spatial Locality) Don t need to wait for wordline to recharge Toggle CAS with new column address 10
Extended Data Out (EDO) Add a latch between sense amps and output pins EDO DRAM Last accessed row data still available in latch, so precharge can be started sooner Variant: Burst EDO (BEDO) Read of write cycles batched in bursts of 4 Address is incremented internally as CAS toggles 11
Synchronous DRAM Has a clock input. Data output is in bursts w/ each element clocked In the past: read the whole row, then select small # of useful bits SDRAM: Don t throw away the bits, use arbitrary number of bits from each row Register holds how many bytes per request, up to entire row. Synchronous DDR DRAMs Double Data Rate: data is driven and received on both rising and falling edges of the clock. This DDR signalling technique is used in both DDR DRAMS, and Rambus DRDRAMs 12
SDRAM and Direct RDRAM (Rambus) DRDRAM: Regular interconnect: High-frequency bus Three-component bus: 1 data, 2 address Memory controller can request for components of a large block in any order can schedule accesses SDRAM and Direct RDRAM 13
DRDRAM Performance Comparison Bandwidth impairment for SDRAM, DDR SDRAM: 1. Bank Conflict (banks share sense amps) 25% prob. of sequential memory accesses hitting same bank 2. Constraints on address command bus 3. Two cycle addressing problem Caused by uneven capacitive loading of command/data bus RDRAM: 1. 32 banks Reduces bank contention 2. Command/data Channel uniformly routed to each device (equal load) 3. Row/column address in separate buses (can be sent on same cycle) 14
Timing Diagram: SDRAM Bank Conflicts DRDRAM Timing 15
Independent Memory Banks Parallel access instead of sequential access (multi-issue vs. pipelined) Multiple controllers, arrays Scheduling accesses to multiple banks (Rixner et al., ISCA 2000) (bank, row, column) 16
Some numbers... DRAM History DRAMs: capacity +40-60%/yr, cost (1 MB) 40%/yr 2.5x cells/area, 1.5x die size in 3 years 1998 DRAM fab line costs $2B DRAM only: density, leakage v. speed Rely on increasing no. of computers & memory per computer (60% market) SIMM or DIMM is replaceable unit => computers use any generation DRAM Commodity, second source industry => high volume, low profit, conservative Little organization innovation in 20 years Don t want to be chip foundries (bad for RDRAM) Order of importance: 1) Cost/bit 2) Capacity First RAMBUS: 10X BW, +30% cost => little initial impact 17