Fundamentals of Computer Systems

Size: px

Start display at page:

Download "Fundamentals of Computer Systems"

Tyler McCormick
5 years ago
Views:

1 Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23

2 Computer Systems Performance depends on which is slowest: the processor or the memory system CLK Processor MemWrite Address Write WE Memory Read 2 / 23

3 Memory Speeds Haven t Kept Up 1, 1, Performance 1 1 CPU Year Memory Our single-cycle memory assumption has been wrong since 198. Hennessy and Patterson. Computer Architecture: A Quantitative Approach. 3rd ed., Morgan Kaufmann, / 23

4 Your Choice of Memories Fast Cheap Large On-Chip SRAM Commodity DRAM Supercomputer 4 / 23

5 Memory Hierarchy An essential trick that makes big memory appear fast Technology Cost Access Time Density ($/Gb) (ns) (Gb/cm2) SRAM DRAM Flash Hard Disk Read speed; writing much, much slower 5 / 23

6 A Modern Memory Hierarchy A desktop machine: AMD Phenom 96 Quad-core 2.3 GHz V 95 W 65 nm Level Size Tech. L1 Instruction 64 K SRAM L1 64 K SRAM L2 512 K SRAM L3 2 MB SRAM Memory 4 GB DRAM Disk 5 GB Magnetic per core 6 / 23

7 A Simple Memory Hierarchy???? First level: small, fast storage (typically SRAM) H i p h i p h o o r a y! \ h i p \ Last level: large, slow storage (typically DRAM) Can fit a subset of lower level in upper level, but which subset? 7 / 23

8 Locality Example: Substring Matching H i p h i p h o o r a y! \ h i p \ Addresses accessed over time aka reference stream h H h i h p h h h i i p p h h temporal locality: if needed X recently, likely to need X again soon spatial locality: if need X, likely also need something near X 8 / 23

9 Cache Highest levels of memory hierarchy Fast: level 1 typically 1 cycle access time With luck, supplies most data Cache design questions: What data does it hold? How is data found? What data is replaced? Recently accessed Simple address hash Often the oldest 9 / 23

10 What is Held in the Cache? Ideal cache: always correctly guesses what you want before you want it. Real cache: never that smart Caches Exploit Temporal Locality Copy newly accessed data into cache, replacing oldest if necessary Spatial Locality Copy nearby data into the cache at the same time Specifically, always read and write a block, also called a line, at a time (e.g., 64 bytes), never a single byte. 1 / 23

Memory Performance Hit: is found in the level of memory hierarchy Miss: not found; will look in next level Hit Rate = Miss Rate = Number of hits Number of accesses Number

11 Memory Performance Hit: is found in the level of memory hierarchy Miss: not found; will look in next level Hit Rate = Miss Rate = Number of hits Number of accesses Number of misses Number of accesses Hit Rate + Miss Rate = 1 The expected access time E L for a memory level L with latency t L and miss rate M L : E L = t L + M L E L+1 11 / 23

12 Memory Performance Example Two-level hierarchy: Cache and main memory Program executes 1 loads & stores 75 of these are found in the cache What s the cache hit and miss rate? 12 / 23

13 Memory Performance Example Two-level hierarchy: Cache and main memory Program executes 1 loads & stores 75 of these are found in the cache What s the cache hit and miss rate? Hit Rate = 75 1 = 75% Miss Rate = 1.75 = 25% If the cache takes 1 cycle and the main memory 1, What s the expected access time? 12 / 23

14 Memory Performance Example Two-level hierarchy: Cache and main memory Program executes 1 loads & stores 75 of these are found in the cache What s the cache hit and miss rate? Hit Rate = 75 1 = 75% Miss Rate = 1.75 = 25% If the cache takes 1 cycle and the main memory 1, What s the expected access time? Expected access time of main memory: E 1 = 1 cycles Access time for the cache: t = 1 cycle Cache miss rate: M =.25 E = t + M E 1 = = 26 cycles 12 / 23

15 A Direct-Mapped Cache Address mem[xfffffffc] mem[xfffffff8] mem[xfffffff4] mem[xfffffff] mem[xffffffec] mem[xffffffe8] mem[xffffffe4] mem[xffffffe]...11 mem[x24]...1 mem[x2] mem[x1c]...11 mem[x18]...11 mem[x14]...1 mem[x1]...11 mem[xc]...1 mem[x8]...1 mem[x4]... mem[x] 2 3 -Word Main Memory This simple cache has 8 sets 1 block per set 4 bytes per block To simplify answering is this data in the cache?, each byte is mapped to exactly one set Word Cache Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () 13 / 23

16 Direct-Mapped Cache Hardware Memory Address Tag Set 27 3 = Byte Offset V Tag Address bits: 1: byte within block 2 4: set number 5 31: block tag Set 7 Set 6 Set 5 Set 4 Set 3 Set 2 Set 1 Set Cache hit if 8-entry x ( )-bit SRAM in the set of the address, Hit block is valid (V=1) tag (address bits 5 31) matches 14 / 23

17 Direct-Mapped Cache Behavior A dumb loop: repeat 5 times load from x4; load from xc; load from x8. li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Memory Address Tag... Byte Set Offset 1 3 V Tag 1... mem[x...c] 1... mem[x...8] 1... mem[x...4] Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache when reading x4 last time Assuming the cache starts empty, what s the miss rate? 15 / 23

18 Direct-Mapped Cache Behavior A dumb loop: repeat 5 times load from x4; load from xc; load from x8. li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Memory Address Tag... Byte Set Offset 1 3 V Tag 1... mem[x...c] 1... mem[x...8] 1... mem[x...4] Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache when reading x4 last time Assuming the cache starts empty, what s the miss rate? 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M M M H H H H H H H H H H H H 3/15 =.2 = 2% 15 / 23

19 Direct-Mapped Cache: Conflict A dumber loop: repeat 5 times load from x4; load from x24 Memory Address li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, x24($) addiu $t, $t, -1 j l1 done: Tag Set Byte Offset V Tag Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache State Assuming the cache starts empty, what s the miss rate? These are conflict misses 16 / 23

20 Direct-Mapped Cache: Conflict A dumber loop: repeat 5 times load from x4; load from x24 Memory Address li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, x24($) addiu $t, $t, -1 j l1 done: Tag Set Byte Offset V 1 Tag... mem[x...4] mem[x...24] Set 7 (111) Set 6 (11) Set 5 (11) Set 4 (1) Set 3 (11) Set 2 (1) Set 1 (1) Set () Cache State Assuming the cache starts empty, what s the miss rate? M M M M M M M M M M 1/1 = 1 = 1% Oops These are conflict misses 16 / 23

21 No Way! Yes Way! 2-Way Set Associative Cache Memory Address Tag Set 28 2 Byte Offset V Tag Way 1 Way V Tag Set 3 Set 2 Set 1 Set = = Hit 1 Hit 1 Hit 1 32 Hit 17 / 23

22 2-Way Set Associative Behavior li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, x24($) addiu $t, $t, -1 j l1 done: Assuming the cache starts empty, what s the miss rate? M M H H H H H H H H 2/1 =.2 = 2% Associativity reduces conflict misses Way 1 Way V Tag V Tag 1... mem[x...24] mem[x...4] Set 3 Set 2 Set 1 Set 18 / 23

23 An Eight-way Fully Associative Cache Way 7 Way 6 Way 5 Way 4 Way 3 Way 2 Way 1 Way V Tag V Tag V Tag V Tag V Tag V Tag V Tag V Tag No conflict misses: only compulsory or capacity misses Either very expensive or slow because of all the associativity 19 / 23

24 Exploiting Spatial Locality: Larger Blocks Block Byte Tag Set Offset Offset Memory x8 9C: Address 8 9 C Memory Address Block Byte Tag Set Offset Offset 27 2 V Tag Set 1 Set = 32 Hit 2 sets 1 block per set (Direct Mapped) 4 words per block 2 / 23

25 Direct-Mapped Cache Behavior w/ 4-word block The dumb loop: repeat 5 times load from x4; load from xc; load from x8. Block Byte Tag Set Offset Offset Memory Address V Tag 1... mem[x...c] mem[x...8] mem[x...4] mem[x...] Set 1 Set Cache when reading xc li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Assuming the cache starts empty, what s the miss rate? 21 / 23

26 Direct-Mapped Cache Behavior w/ 4-word block The dumb loop: repeat 5 times load from x4; load from xc; load from x8. Block Byte Tag Set Offset Offset Memory Address V Tag 1... mem[x...c] mem[x...8] mem[x...4] mem[x...] Set 1 Set Cache when reading xc li $t, 5 l1: beq $t, $, done lw $t1, x4($) lw $t2, xc($) lw $t3, x8($) addiu $t, $t, -1 j l1 done: Assuming the cache starts empty, what s the miss rate? 4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M H H H H H H H H H H H H H H 1/15 =.666 = 6.7% Larger blocks reduce compulsory misses by exploting spatial locality 21 / 23

27 Stephen s Desktop Machine Revisited On-chip caches: AMD Phenom 96 Quad-core 2.3 GHz V 95 W 65 nm Cache Size Sets Ways Block L1I 64 K way 64-byte L1D 64 K way 64-byte L2 512 K way 64-byte L3 2 MB way 64-byte per core 22 / 23

28 Intel On-Chip Caches Chip Year Freq. (MHz) off-chip none K unified off-chip Pentium K 8K off-chip Pentium Pro K 8K 256K 1M (MCM) Pentium II K 16K 256K 512K (Cartridge) Pentium III K 16K 256K 512K Pentium Pentium M K 32K 1M 2M K 32K 2M 6M23 / 23 Core 2 Duo K L1 Instr L2 12k op 256K 2M trace cache

Fundamentals of Computer Systems

Fundamentals of Computer Systems Caches Stephen A. Edwards Columbia University Summer 217 Illustrations Copyright 27 Elsevier Computer Systems Performance depends on which is slowest: the processor or