Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23
Computer Systems Performance depends on which is slowest: the processor or the memory system CLK Processor MemWrite Address Write WE Memory Read 2 / 23
Memory Speeds Haven t Kept Up 1, 1, Performance 1 1 CPU 1 1 198 1981 1982 1983 1984 1985 1986 1987 1988 1989 199 1991 1992 1993 Year Memory 1994 1995 1996 1997 1998 1999 2 21 22 23 24 25 Our single-cycle memory assumption has been wrong since 198. Hennessy and Patterson. Computer Architecture: A Quantitative Approach. 3rd ed., Morgan Kaufmann, 23. 3 / 23
Your Choice of Memories Fast Cheap Large On-Chip SRAM Commodity DRAM Supercomputer 4 / 23
Memory Hierarchy An essential trick that makes big memory appear fast Technology Cost Access Time Density ($/Gb) (ns) (Gb/cm2) SRAM 3.5.25 DRAM 1 1 1 16 Flash 2 3 8 32 Hard Disk.1 1 5 2 Read speed; writing much, much slower 5 / 23
A Modern Memory Hierarchy A desktop machine: AMD Phenom 96 Quad-core 2.3 GHz 1.1 1.25 V 95 W 65 nm Level Size Tech. L1 Instruction 64 K SRAM L1 64 K SRAM L2 512 K SRAM L3 2 MB SRAM Memory 4 GB DRAM Disk 5 GB Magnetic per core 6 / 23
A Simple Memory Hierarchy???? First level: small, fast storage (typically SRAM) H i p h i p h o o r a y! \ h i p \ Last level: large, slow storage (typically DRAM) Can fit a subset of lower level in upper level, but which subset? 7 / 23
Locality Example: Substring Matching H i p h i p h o o r a y! \ h i p \ Addresses accessed over time aka reference stream h H h i h p h h h i i p p h h temporal locality: if needed X recently, likely to need X again soon spatial locality: if need X, likely also need something near X 8 / 23
Cache Highest levels of memory hierarchy Fast: level 1 typically 1 cycle access time With luck, supplies most data Cache design questions: What data does it hold? How is data found? What data is replaced? Recently accessed Simple address hash Often the oldest 9 / 23
What is Held in the Cache? Ideal cache: always correctly guesses what you want before you want it. Real cache: never that smart Caches Exploit Temporal Locality Copy newly accessed data into cache, replacing oldest if necessary Spatial Locality Copy nearby data into the cache at the same time Specifically, always read and write a block, also called a line, at a time (e.g., 64 bytes), never a single byte. 1 / 23
Memory Performance Hit: is found in the level of memory hierarchy Miss: not found; will look in next level Hit Rate = Miss Rate = Number of hits Number of accesses Number of misses Number of accesses Hit Rate + Miss Rate = 1 The expected access time E L for a memory level L with latency t L and miss rate M L : E L = t L + M L E L+1 11 / 23
Memory Performance Example Two-level hierarchy: Cache and main memory Program executes 1 loads & stores 75 of these are found in the cache What s the cache hit and miss rate? 12 / 23
Memory Performance Example Two-level hierarchy: Cache and main memory Program executes 1 loads & stores 75 of these are found in the cache What s the cache hit and miss rate? Hit Rate = 75 1 = 75% Miss Rate = 1.75 = 25% If the cache takes 1 cycle and the main memory 1, What s the expected access time? 12 / 23
Memory Performance Example Two-level hierarchy: Cache and main memory Program executes 1 loads & stores 75 of these are found in the cache What s the cache hit and miss rate? Hit Rate = 75 1 = 75% Miss Rate = 1.75 = 25% If the cache takes 1 cycle and the main memory 1, What s the expected access time? Expected access time of main memory: E 1 = 1 cycles Access time for the cache: t = 1 cycle Cache miss rate: M =.25 E = t + M E 1 = 1 +.25 1 = 26 cycles 12 / 23
A Direct-Mapped Cache Address 11...111111 mem[xfffffffc] 11...11111 mem[xfffffff8] 11...11111 mem[xfffffff4] 11...1111 mem[xfffffff] 11...11111 mem[xffffffec] 11...1111 mem[xffffffe8] 11...1111 mem[xffffffe4] 11...111 mem[xffffffe]...11 mem[x24]...1 mem[x2]...111 mem[x1c]...11 mem[x18]...11 mem[x14]...1 mem[x1]...11 mem[xc]...1 mem[x8]...1 mem[x4]... mem[x] 2 3 -Word Main Memory This simple cache has 8 sets 1 block per set 4 bytes per block To simplify answering is this data in the cache?, each byte is mapped to exactly one set. 2 3 -Word Cache Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () 13 / 23
Direct-Mapped Cache Hardware Memory Address Tag Set 27 3 = Byte Offset V Tag 27 32 Address bits: 1: byte within block 2 4: set number 5 31: block tag Set 7 Set 6 Set 5 Set 4 Set 3 Set 2 Set 1 Set Cache hit if 8-entry x (1+27+32)-bit SRAM in the set of the address, Hit block is valid (V=1) tag (address bits 5 31) matches 14 / 23
Direct-Mapped Cache Behavior A dumb loop: repeat 5 times load from x4; load from xc; load from x8. li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Memory Address Tag... Byte Set Offset 1 3 V Tag 1... mem[x...c] 1... mem[x...8] 1... mem[x...4] Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache when reading x4 last time Assuming the cache starts empty, what s the miss rate? 15 / 23
Direct-Mapped Cache Behavior A dumb loop: repeat 5 times load from x4; load from xc; load from x8. li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Memory Address Tag... Byte Set Offset 1 3 V Tag 1... mem[x...c] 1... mem[x...8] 1... mem[x...4] Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache when reading x4 last time Assuming the cache starts empty, what s the miss rate? 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M M M H H H H H H H H H H H H 3/15 =.2 = 2% 15 / 23
Direct-Mapped Cache: Conflict A dumber loop: repeat 5 times load from x4; load from x24 Memory Address li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, x24($) addiu $t, $t, -1 j l1 done: Tag Set Byte Offset...1 1 3 V Tag Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache State Assuming the cache starts empty, what s the miss rate? These are conflict misses 16 / 23
Direct-Mapped Cache: Conflict A dumber loop: repeat 5 times load from x4; load from x24 Memory Address li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, x24($) addiu $t, $t, -1 j l1 done: Tag Set Byte Offset...1 1 3 V 1 Tag... mem[x...4] mem[x...24] Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache State Assuming the cache starts empty, what s the miss rate? 4 24 4 24 4 24 4 24 4 24 M M M M M M M M M M 1/1 = 1 = 1% Oops These are conflict misses 16 / 23
No Way! Yes Way! 2-Way Set Associative Cache Memory Address Tag Set 28 2 Byte Offset V Tag Way 1 Way V Tag 28 32 28 32 Set 3 Set 2 Set 1 Set = = Hit 1 Hit 1 Hit 1 32 Hit 17 / 23
2-Way Set Associative Behavior li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, x24($) addiu $t, $t, -1 j l1 done: Assuming the cache starts empty, what s the miss rate? 4 24 4 24 4 24 4 24 4 24 M M H H H H H H H H 2/1 =.2 = 2% Associativity reduces conflict misses Way 1 Way V Tag V Tag 1... mem[x...24] 1...1 mem[x...4] Set 3 Set 2 Set 1 Set 18 / 23
An Eight-way Fully Associative Cache Way 7 Way 6 Way 5 Way 4 Way 3 Way 2 Way 1 Way V Tag V Tag V Tag V Tag V Tag V Tag V Tag V Tag No conflict misses: only compulsory or capacity misses Either very expensive or slow because of all the associativity 19 / 23
Exploiting Spatial Locality: Larger Blocks Block Byte Tag Set Offset Offset Memory x8 9C: 1...1 1 11 Address 8 9 C Memory Address Block Byte Tag Set Offset Offset 27 2 V Tag 27 32 32 32 32 Set 1 Set 1 1 11 = 32 Hit 2 sets 1 block per set (Direct Mapped) 4 words per block 2 / 23
Direct-Mapped Cache Behavior w/ 4-word block The dumb loop: repeat 5 times load from x4; load from xc; load from x8. Block Byte Tag Set Offset Offset Memory... 11 Address V Tag 1... mem[x...c] mem[x...8] mem[x...4] mem[x...] Set 1 Set Cache when reading xc li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Assuming the cache starts empty, what s the miss rate? 21 / 23
Direct-Mapped Cache Behavior w/ 4-word block The dumb loop: repeat 5 times load from x4; load from xc; load from x8. Block Byte Tag Set Offset Offset Memory... 11 Address V Tag 1... mem[x...c] mem[x...8] mem[x...4] mem[x...] Set 1 Set Cache when reading xc li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Assuming the cache starts empty, what s the miss rate? 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M H H H H H H H H H H H H H H 1/15 =.666 = 6.7% Larger blocks reduce compulsory misses by exploting spatial locality 21 / 23
Stephen s Desktop Machine Revisited On-chip caches: AMD Phenom 96 Quad-core 2.3 GHz 1.1 1.25 V 95 W 65 nm Cache Size Sets Ways Block L1I 64 K 512 2-way 64-byte L1D 64 K 512 2-way 64-byte L2 512 K 512 16-way 64-byte L3 2 MB 124 32-way 64-byte per core 22 / 23
Intel On-Chip Caches Chip Year Freq. (MHz) 8386 1985 16 25 off-chip none 8486 1989 25 1 8K unified off-chip Pentium 1993 6 3 8K 8K off-chip Pentium Pro 1995 15 2 8K 8K 256K 1M (MCM) Pentium II 1997 233 45 16K 16K 256K 512K (Cartridge) Pentium III 1999 45 14 16K 16K 256K 512K Pentium 4 21 14 373 Pentium M 23 9 213 32K 32K 1M 2M 15 3 32K 32K 2M 6M23 / 23 Core 2 Duo 25 8 16K L1 Instr L2 12k op 256K 2M trace cache