EEM 486: Computer Architecture Lecture 9 Memory The Big Picture Designing a Multiple Clock Cycle Datapath Processor Control Memory Input Datapath Output The following slides belong to Prof. Onur Mutlu Lec 9.2
Main Memory Main Memory in the System SHARED L3 CACHE CORE 0 L2 CACHE 0 L2 CACHE 2 L2 CACHE 1 L2 CACHE 3 CORE 1 DRAM MEMORY CONTROLLER CORE 2 CORE 3 DRAM INTERFACE DRAM BANKS 4
Ideal Memory Zero access time (latency) Infinite capacity Zero cost Infinite bandwidth (to support multiple accesses in parallel) 5 The Problem Ideal memory s requirements oppose each other Bigger is slower Bigger à Takes longer to determine the location Faster is more expensive Memory technology: SRAM vs. DRAM Higher bandwidth is more expensive Need more banks, more ports, higher frequency, or faster technology 6
The Memory Chip/System Abstraction 7 Main Memory Overview 8
Memory Bank Organization and Operation Read access sequence: 1. Decode row address & drive word-lines 2. Selected bits drive bit-lines Entire row read 3. Amplify row data 4. Decode column address & select subset of row Send to output 5. Precharge bit-lines For next access 9 Memory Technology: DRAM Dynamic random access memory Capacitor charge state indicates stored value Whether the capacitor is charged or discharged indicates storage of 1 or 0 1 capacitor 1 access transistor Capacitor leaks through the RC path DRAM cell loses charge over time DRAM cell needs to be refreshed Refresh: A DRAM controller must periodically read all rows within the allowed refresh time (10s of ms) such that charge is restored in cells _bitline row enable 10
Memory Technology: SRAM Static random access memory Two cross coupled inverters store a single bit Feedback path enables the stored value to persist in the cell 4 transistors for storage 2 transistors for access row select bitline _bitline 11 DRAM vs. SRAM DRAM Slower access (capacitor) Higher density (1T 1C cell) Lower cost Requires refresh (power, performance, circuitry) Manufacturing requires putting capacitor and logic together SRAM Faster access (no capacitor) Lower density (6T cell) Higher cost No need for refresh Manufacturing compatible with logic process (no capacitor) 12
The Problem Bigger is slower SRAM, 512 Bytes, sub-nanosec SRAM, KByte~MByte, ~nanosec DRAM, Gigabyte, ~50 nanosec Hard Disk, Terabyte, ~10 millisec Faster is more expensive (dollars and chip area) SRAM, < 10$ per Megabyte DRAM, < 1$ per Megabyte Hard Disk < 1$ per Gigabyte These sample values scale with time 13 DRAM: Memory Access Protocol Addr RAS n m CAS 2 n bit-cell array 2 n row x 2 m -col (n~m to minimize overall latency) 2 m sense amp and mux 1 A DRAM die is comprised of multiple such arrays Five basic commands Activate Read Write Precharge Refresh To reduce pin count, row and column share same address pins RAS: Row address strobe CAS: Column address strobe 14
DRAM: Basic Operation Access Address: (Row 0, Column 0) (Row 0, Column 1) (Row 0, Column 85) (Row 1, Column 0) Row address 01 Row decoder Columns Rows Commands Activate 0 Read 0 Read 1 Read 85 Precharge Activate 1 Read 0 Empty Row 01 Row Buffer CONFLICT HIT! Column address 185 0 Column mux Data 15 DRAM: Basic Operation A DRAM bank is a 2D array of cells: rows x columns A DRAM row is also called a DRAM page Sense amplifiers also called row buffer Each address is a <row, column> pair Access to a closed row Activate command opens row (placed into row buffer) Read/write command reads/writes column in the row buffer Precharge command closes the row and prepares the bank for next access Access to an open row No need for activate command Read/write command reads/writes column in the row buffer 16
The DRAM Chip Consists of multiple banks (2-16 in Synchronous DRAM) Banks share command/address/data buses The chip itself has a narrow interface (4-16 bits per read) 17 DRAM: Banks 18
128M x 8-bit DRAM Chip 19 The DRAM Bank Structure 20
DDR3 SDRAM Introduced in 2007 SDRAM = Synchronous DRAM = Clocked DDR = Double Data Rate Data transferred on both clock edges ν 400 MHz = 800 MT/s x4, x8, x16 datapath widths Minimum burst length of 8 8 banks 1Gb, 2Gb, 4Gb capacity common Relative to SDR/DDR/DDR2: + bandwidth, ~ latency 21 Main Memory Overview 22
DRAM Modules DRAM chips have narrow interface (typically x4, x8, x16) Multiple chips are put together to form a wide interface DIMM: Dual Inline Memory Module To get a 64-bit DIMM, we need to access 8 chips with 8-bit interfaces Share command/address lines, but not data Advantages Acts like a high-capacity DRAM chip with a wide interface 8x capacity, 8x bandwidth, same latency Disadvantages Granularity: Accesses cannot be smaller than the interface width 8x power 23 A 64-bit Wide DIMM (Physical view) DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip Command Data 24
A 64-bit Wide DIMM (logical view) 25 DRAM Ranks A DIMM may include multiple ranks A 64-bit DIMM using 8 chips with x16 interfaces has 2 ranks Each 64-bit group of chips is called a rank All chips in a rank respond to a single command Different ranks share command/address/data lines Select between ranks with Chip Select signal Ranks provide more banks across multiple chips (but don t confuse rank and bank!) 26
The DRAM Subsystem The Top Down View DRAM Subsystem Organization Channel DIMM Rank Chip Bank Row/Column 28
The DRAM subsystem Channel DIMM (Dual in- line memory module) Processor Memory channel Memory channel DRAM Channels Channel: a set of DIMMs in series All DIMMs get the same command, one of the ranks replies System op@ons Single channel system Mul@ple dependent (lock- step) channels Single controller with wider interface (faster cache line refill!) Some@mes called Gang Mode Only works if DIMMs are iden@cal (organiza@on, @ming) Mul@ple independent channels Tradeoffs Requires mul@ple controllers Cost: pins, wires, controller Benefit: higher bandwidth, capacity, flexibility 30
DRAM Channel Op@ons Lock-step Independent CPU MC MC Mul@- CPU (Old school) CPU Front-side bus MC CPU External memory controller adds latency Capacity does not grow with # of CPUs
NUMA Topology (modern) MC CPU QPI MC QPI MC CPU MC Capacity grows with # of CPUs NUMA: Non- uniform Memory Access Breaking down a DIMM DIMM (Dual in- line memory module) Side view Front of DIMM Back of DIMM
Breaking down a DIMM DIMM (Dual in- line memory module) Side view Front of DIMM Back of DIMM Rank 0: collec@on of 8 chips Rank 1 Rank Rank 0 (Front) Rank 1 (Back) <0:63> <0:63> Addr/Cmd CS <0:1> Data <0:63> Memory channel
Breaking down a Rank Rank 0 Chip 0 Chip 1... Chip 7 <0:63> <8:15> <56:63> Data <0:63> Breaking down a Chip Chip 0 Bank 0...
Breaking down a Bank 2kB 1B (column) row 16k- 1 Bank 0... row 0 Row- buffer 1B 1B... 1B DRAM Subsystem Organization Channel DIMM Rank Chip Bank Row/Column 40
Example: Transferring a cache block Physical memory space 0xFFFF F Channel 0... DIMM 0 0x40 64B cache block Mapped to Rank 0 0x00 Example: Transferring a cache block Physical memory space 0xFFFF F Chip 0 Chip 1 Chip 7 Rank 0... <8:15> <56:63>... 0x40 64B cache block Data <0:63> 0x00
Example: Transferring a cache block Physical memory space 0xFFFF F Row 0 Col 0 Chip 0 Chip 1 Chip 7 Rank 0... <8:15> <56:63>... 0x40 64B cache block Data <0:63> 0x00 Example: Transferring a cache block Physical memory space 0xFFFF F Row 0 Col 0 Chip 0 Chip 1 Chip 7 Rank 0... <8:15> <56:63>... 0x40 0x00 8B 64B cache block 8B Data <0:63>
Example: Transferring a cache block Physical memory space 0xFFFF F Row 0 Col 1 Chip 0 Chip 1 Chip 7 Rank 0... <8:15> <56:63>... 0x40 0x00 8B 64B cache block Data <0:63> Example: Transferring a cache block Physical memory space 0xFFFF F Row 0 Col 1 Chip 0 Chip 1 Chip 7 Rank 0... <8:15> <56:63>... 0x40 0x00 8B 8B 64B cache block 8B Data <0:63>
Example: Transferring a cache block Physical memory space 0xFFFF F Row 0 Col 1 Chip 0 Chip 1 Chip 7 Rank 0... <8:15> <56:63>... 0x40 0x00 8B 8B 64B cache block Data <0:63> A 64B cache block takes 8 I/O cycles to transfer. During the process, 8 columns are read sequenually. Address Mapping (Single Channel) Page/Row interleaving Consecutive rows of memory in consecutive banks Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 Bank 0 Bank 1 Bank 2 Bank 3 Address format r k p page index bank page offset 48
Address Mapping (Single Channel) Single-channel system with 8-byte memory bus 2GB memory, 8 banks, 16K rows & 2K columns per bank Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) 49 Address Mapping (Single Channel) Cache block interleaving Consecutive cache block addresses in consecutive banks cacheline 0 cacheline 4 cacheline 1 cacheline 5 cacheline 2 cacheline 6 cacheline 3 cacheline 7 Bank 0 Bank 1 Bank 2 Bank 3 Address format r p-b k b page index page offset bank page offset 50
Address Mapping (Single Channel) Single-channel system with 8-byte memory bus 2GB memory, 8 banks, 16K rows & 2K columns per bank Row interleaving Consecutive rows of memory in consecutive banks Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Cache block interleaving Consecutive cache block addresses in consecutive banks 64 byte cache blocks Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits 51