Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Caches Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Access time of SRAM Capacity and cost/gb of disk 2

DRAM Organization (1) Micron MT4LC16M4T8 (16M x 4bit) 3

DRAM Organization (2) DRAM configuration Asynchronous: no clock Large capacity: 1 4Gb Arranged as 2D matrix Minimizes wire length Maximizes refresh efficiency Narrow data interface: 1 16 bits (x1, x4, x8, x16) Cheap packages few bus pins Pins are expensive Narrow address interface: Multiplexed address lines: row and column address Signaled by RAS# and CAS# respectively 4

DRAM Operation (1) Read operation (1) (1) word line 5

DRAM Operation (2) Read operation (2) (2) word line bit line 6

DRAM Operation (3) Read operation (3) (3) word line bit line 7

DRAM Operation (4) Read cycle 8

DRAM Operation (5) Timing parameters t RC : minimum time from the start of one row access to the start of the next (cycle time) t RAC : minimum time from RAS# line falling to the valid data output (access time) Used to be quoted as the nominal speed of a DRAM chip t CAC : minimum time from CAS# line falling to valid data output Model t RC t RAC t CAC MT4LC16M4T8-5 90 ns 50 ns 13 ns MT4LC16M4T8-6 110 ns 60 ns 15 ns 9

DRAM Generations Year Capacity $/GB 1980 64Kbit $1500000 1983 256Kbit $500000 1985 1Mbit $200000 1989 4Mbit $50000 1992 16Mbit $15000 1996 64Mbit $10000 1998 128Mbit $4000 2000 256Mbit $1000 2004 512Mbit $250 2007 1Gbit $50 300 250 200 150 100 50 0 '80 '83 '85 '89 '92 '96 '98 '00 '04 '07 Trac Tcac 10

Principle of Locality Programs access a small portion of their address space at any time Temporal locality Items accessed recently are likely to be accessed again soon e.g., instructions in a loop, induction variables Spatial locality Items near those accessed recently are likely to be accessed soon e.g., sequential instruction access, array data 11

Memory Hierarchy (1) We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible. -- A. W. Burks, H. H. Goldstein, J. von Neumann, Preliminary Discussion of the Logical Design of Electronic Computing Instrument, June 1946. Taking advantage of locality Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory (main memory) Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory (cache memory) 12

Memory Hierarchy (2) Smaller, Faster, and Costlier (per byte) L1: L0: registers on-chip L1 cache (SRAM) CPU registers hold words retrieved from L1 cache. L1 cache holds cache lines retrieved from the L2 cache memory. L2: on-chip L2 cache (SRAM) L2 cache holds cache lines retrieved from main memory. Larger, Slower, and Cheaper (per byte) L4: L3: main memory (DRAM) local secondary storage (local magnetic disks) Main memory holds disk blocks retrieved from local disks. Local disks hold files retrieved from disks on remote network servers. L5: remote secondary storage (distributed file systems, Web servers) 13

Memory Hierarchy (3) Terminologies Block (aka line): unit of copying May be multiple words Hit: If accessed data is present in upper level, access satisfied by upper level Hit ratio: hits/accesses Miss: If accessed data is absent, block copied from lower level Time taken: miss penalty Miss ratio: misses/accesses = 1 - hit ratio 14

Caches (1) Cache memory The level of the memory hierarchy closest to the CPU Given accesses X 1,, X n 1, X n How to we know if the data is present? Where do we look? 15

Caches (2) Direct mapped cache Location determined by address Direct mapped: only one choice (Block address) modulo (#Blocks in cache) #Blocks is a power of 2 Use low-order address bits 16

Caches (3) Tags and valid bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag Address Tag Index Offset What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 17

Cache Example (1) 8-blocks, 1 word/block, direct mapped Initial state Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N 18

Cache Example (2) Word addr Binary addr Hit/miss Cache block 22 10 110 Miss 110 Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 19

Cache Example (3) Word addr Binary addr Hit/miss Cache block 26 11 010 Miss 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 20

Cache Example (4) Word addr Binary addr Hit/miss Cache block 22 10 110 Hit 110 26 11 010 Hit 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 21

Cache Example (5) Word addr Binary addr Hit/miss Cache block 16 10 000 Miss 000 3 00 011 Miss 011 16 10 000 Hit 000 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N 22

Cache Example (6) Word addr Binary addr Hit/miss Cache block 18 10 010 Miss 010 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N 23

Address Subdivision 24

Block Size (1) Example: larger block size 64 blocks, 16 bytes/block To what block number does address 1200 map? Block address = 1200/16 = 75 Block number = 75 modulo 64 = 11 31 10 9 4 3 0 Tag Index Offset 22 bits 6 bits 4 bits 25

Block Size (2) Considerations Larger blocks should reduce miss rate Due to spatial locality SPEC92 26

Block Size (3) Considerations (cont d) But in a fixed-sized cache Larger blocks fewer of them» More competition increased miss rate Larger blocks pollution Larger miss penalty Can override benefit of reduced miss rate Early restart and critical-word-first can help 27

Main Memory & Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer For 4-word block, 1-word-wide DRAM Miss penalty = 1 + 4 x 15 + 4 x 1 = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle 28

Increasing Memory Bandwidth 4-word wide memory Miss penalty = 1 + 15 + 1 = 17 bus cycles Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle 4-bank interleaved memory Miss penalty = 1 + 15 + 4 x 1 = 20 bus cycles Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle 29

Advanced DRAM Organization Bits in a DRAM are organized as a rectangular array DRAM accesses an entire row Burst mode: supply successive words from a row with reduced latency Synchronous DRAM (SDRAM) Add a clock signal to DRAM interfaces Double data rate (DDR) DRAM Transfer on rising and falling clock edges 30