Lecture 15: Cache Design (in Isolation) James C. Hoe Department of ECE Carnegie Mellon University

Size: px

Start display at page:

Download "Lecture 15: Cache Design (in Isolation) James C. Hoe Department of ECE Carnegie Mellon University"

Jocelyn Sharleen Owen
6 years ago
Views:

1 Lecture 15: Cache Design (in Isolation) James C. Hoe Department of ECE Carnegie Mellon University S18 L15 S1, James C. Hoe, CMU/ECE/CALCM, 2018

2 Your goal today Housekeeping recover from Spring Break understand ABC of caches understand 3 C s of caches Notices Lab 3, due net week HW4 out on Wed Midterm 1 regrade in by Friday Readings P&H Ch S18 L15 S2, James C. Hoe, CMU/ECE/CALCM, 2018

3 The Basic Problem Potentially M=2 m bytes of memory, how to keep copies of most frequently used locations in C bytes of fast storage where C << M Basic issues (intertwined) (1) when to cache a copy of a memory location (2) where in fast storage to keep the copy (3) how to find the copy later on (LW and SW only give indices into M) S18 L15 S3, James C. Hoe, CMU/ECE/CALCM, 2018

4 Direct Mapped Cache (v1) lg 2 M bit address tag id g Tag Bank Data Bank lg 2 (C/G) bits t bits C/G lines by t bits valid C/G lines by G bytes t bits = let t= lg 2 M lg 2 C What about writes? S18 L15 S4, James C. Hoe, CMU/ECE/CALCM, 2018 hit? G bytes data

5 Storage Overhead and Block Size For each cache block of G bytes, also storing t+1 bits of tag (where t=lg 2 M lg 2 C) if M=2 32, G=4, C=16K=2 14 t=18 bits for each 4 byte block 60% overhead; 16KB cache actually 25.5KB SRAM Solution: amortize tag over larger B byte block manage B/G consecutive words as indivisible unit if M=2 32, B=16, G=4, C=16K t=18 bits for each 16 byte block 15% overhead; 16KB cache actually 18.4KB SRAM spatial locality also says this is a good idea Larger caches wants even bigger blocks S18 L15 S5, James C. Hoe, CMU/ECE/CALCM, 2018

6 Direct Mapped Cache (final) lg 2 M bit address tag id bo g lg 2 (C/B) bits Tag Bank C/B by t bits valid Data Bank C/B by B bytes t bits lg 2 (B/G) bits = t bits B bytes let t= lg 2 M lg 2 C S18 L15 S6, James C. Hoe, CMU/ECE/CALCM, 2018 hit? G bytes data

7 Is this okay? bo id tag g lg 2 (C/B) bits t bits Tag Bank C/B by t bits valid Data Bank C/B by B bytes lg 2 (B/G) bits = t bits B bytes let t= lg 2 M lg 2 C S18 L15 S7, James C. Hoe, CMU/ECE/CALCM, 2018 hit? G bytes data

8 Is this okay? tag id bo g lg 2 (C/B)+k bits t bits Tag Bank C/B by t bits valid Data Bank 2 k C/B by B/2 k bytes lg 2 (B/G) k bits = t bits B/2 k bytes let t= lg 2 M lg 2 C S18 L15 S8, James C. Hoe, CMU/ECE/CALCM, 2018 hit? G bytes data

9 S18 L15 S9, James C. Hoe, CMU/ECE/CALCM, 2018 Direct Mapped Cache C bytes of storage managed as C/B cache blocks Agiven block address directly maps to eactly one choice of cache block (by block inde field) Block addresses with same block inde field map to same cache block of 2 t such addresses, hold only one at a time even if C > working set size, conflict is possible ( working set is not one continuous region) probability 2 random addresses conflict is 1/(C/B); likelihood working for conflict decreases with set size (W) increasing number of blocks C hit rate

10 C, B and m i Increasing B has prefetching benefit pay miss penalty only once per cache block works especially well in instruction caches Effective up to the limit of spatial locality Increasing B too much wastes capacity increases probability for conflict hit rate S18 L15 S10, James C. Hoe, CMU/ECE/CALCM, 2018 B

11 B and T i+1 Loading a large cache block increases T i+1 Solution 1: critical word first reload L i+1 returns requested word first then rotate around the complete block supply requested word to pipeline ASAP Solution 2: sub blocking (at very large C and B) valid bit per sub block, but still common tag reload only requested sub block on demand reduce T i+1 ; reduce BW at L i+1 but loses prefetching; wastes capacity tag v s block 0 v s block 1 v s block S18 L15 S11, James C. Hoe, CMU/ECE/CALCM, 2018

12 S18 L15 S12, James C. Hoe, CMU/ECE/CALCM, 2018 Now for the general case

13 tag id bo g Set Associative Cache C/a byte direct mapped Tag Data Tag Data C/a/B by t bits valid C/a/B by B bytes a banks C/a/B by t bits valid C/a/B by B bytes = = some kind of mu hit? S18 L15 S13, James C. Hoe, CMU/ECE/CALCM, 2018 data t= lg 2 M lg 2 (C/a)

14 a way Set Associative Cache C bytes of storage divided into a direct mapped banks (aka ways ) each way has (C/a)/B cache blocks a given block address maps to eactly one choice per way ; a choices constitute the set direct mapped is special case a=1 overhead: a comparators and a to 1 multipleer Block addresses with same inde map to same set 2 t such addresses; hold a different ones at a time if C > working set size higher degree of associativity fewer conflicts What if C < working set size? S18 L15 S14, James C. Hoe, CMU/ECE/CALCM, 2018

15 Replacement Policy New block displaces an eisting block from set pick the one that is least recently used (LRU) eactly LRU epensive for a>2 pick any one ecept the most recently used pick the most recently used one pick one based on some part of the address bits pick the one used again furthest in the future pick a (pseudo) random one No real best choice; second order impact only if actively using less than a blocks in a set, any sensible replacement policy will quickly converge if actively using more than a blocks in a set, no replacement policy can help you S18 L15 S15, James C. Hoe, CMU/ECE/CALCM, 2018

16 Pseudo Associative Cache set0 way0 set0 way1 set0 way set1 way0 set1 way1 set1 way Associativity is a placement policy S18 L15 S16, James C. Hoe, CMU/ECE/CALCM, 2018 it says a block address could be placed in one of a different blocks it doesn t say ways are parallel look up banks Pseudo a way associativity: given a direct mapped array with C/B blocks logically partition into C/B/a sets given an address A, inde into set and sequentially search its ways: Optimization: record the most recently used way (MRU) to check first e.g., used by MIPS R10K L2

17 Skewed Associative Cache tag id bo g different hash for each way hash 0 hash a 1 C/a byte direct mapped Tag Data Tag Data C/a/B by t bits = valid C/a/B by B bytes a banks C/a/B by t bits = valid C/a/B by B bytes hit? S18 L15 S17, James C. Hoe, CMU/ECE/CALCM, 2018 data t= lg 2 M lg 2 (C/a)

18 Fully Associative Cache: a C/B tag bo g t bits 1 by t bits v 1 by B bytes = 1 by t bits = v 1 by B bytes C/B blocks 1 by t bits v 1 by B bytes hit? = let t=lg 2 M lg 2 B S18 L15 S18, James C. Hoe, CMU/ECE/CALCM, 2018 data

19 Fully Associative Cache: a=c/b A content addressable memory no inde bits used in lookup present tag to find a block with matching tag, or else miss Any block address can go into any of C/B cache blocks if C > working set size, no conflicts Requires 1 comparator per cache block, a huge multipleer, and many long wires epensive/difficult for more than a few tens of blocks at L1 speed few reasons for very large fully assoc. caches S18 L15 S19, James C. Hoe, CMU/ECE/CALCM, 2018? hit rate ~5 a

20 S18 L15 S20, James C. Hoe, CMU/ECE/CALCM, C s of Cache Misses

21 Compulsory Miss First reference to a block address always misses Dominates when locality is poor for eample, in a streaming data access pattern where many addresses are visited, but each is used only once Main design factor: B and prefetching hit rate S18 L15 S21, James C. Hoe, CMU/ECE/CALCM, 2018 B

22 Capacity Miss Cache is too small to hold everything needed Defined as the misses that would occur in a fullyassociative cache of the same capacity using optimum (Belady) replacement Dominates when C < W for eample, the L1 cache usually not big enough due to cycle time tradeoff Main design factor: C 100% hit rate working set size (W) S18 L15 S22, James C. Hoe, CMU/ECE/CALCM, 2018 C

23 Conflict Miss Miss to a previously visited block address displaced due to conflict under direct mapped or set associative allocation Defined as a miss that is neither compulsory nor capacity Dominates when C W or when C/B is small Main design factor: a? S18 L15 S23, James C. Hoe, CMU/ECE/CALCM, 2018 hit rate ~5 a

24 3 C worksheet: a=1, b=1, C=2 addr set# which C? set[2] F.A. + Belady compulsory [, ] [0, ] { } {0} S18 L15 S24, James C. Hoe, CMU/ECE/CALCM, 2018

25 3 C worksheet: a=1, b=1, C=2 addr set# which C? set[2] F.A. + Belady compulsory [, ] [0, ] {} {0} compulsory [0, ] [2, ] {0} {0,2} conflict [2, ] [0, ] {0,2} hit conflict [0, ] [2, ] {0,2} hit compulsory [2, ] [2,1] {0,2} {0,1} conflict [2,1] [0,1] {0,1} hit capacity [0,1] [2,1] {0,1} {0,2} conflict [2,1] [0,1] {0,2} hit S18 L15 S25, James C. Hoe, CMU/ECE/CALCM, 2018

26 Recap: Basic Cache Parameters ISA M = 2 m : size of address space in bytes sample values: 2 32, 2 64 G=2 g : cache access granularity in bytes sample values: 4, 8 Implementation C : capacity of cache in bytes sample values: 16 KByte (L1), 1 MByte (L2) B = 2 b : block size in bytes sample values: 16 (L1), >64 (L2) a: associativity of the cache sample values: 1, 2, 4, 5(?),... C/B S18 L15 S26, James C. Hoe, CMU/ECE/CALCM, 2018 C/a should be a 2 power

27 Recap: Address Fields lg 2 M bit address tag inde B.O S18 L15 S27, James C. Hoe, CMU/ECE/CALCM, 2018

28 Recap: M=2 32, G=, M=2 C=, 32, a=2, C=1K, B=4, G=2 B=, a= S18 L15 S28, James C. Hoe, CMU/ECE/CALCM, 2018

29 M=2 32, a=2, C=1K, B=4, G=2: basic solution tag PA[31:9] id PA[8:2] b.o. PA[1] PA[0] id id id id tag0 v0 tag1 v1 data 0 data l 23 b 1 b 128 l 23 b 1 b 128 lines 4 bytes 128 lines 4 bytes tag 23 = = hit0 hit S18 L15 S29, James C. Hoe, CMU/ECE/CALCM, 2018 hit0 hit1 HIT b.o. hit0 hit1 2 1 mu 2 1 mu d 16 DATA 2 1 mu

30 tag PA[31:9] id 7 Same cache parameters but tune for narrower data SRAMs id id PA[8:2] 7 b.o. PA[1] PA[0] {id,bo} 8 {id,bo} 8 tag0 128 l 23 b v0 1 b tag1 128 l 23 b v1 1 b this part is unchanged data lines 2 bytes data lines 2 bytes tag 23 = hit0 hit1 HIT Can you play the same trick on the tag SRAMs? S18 L15 S30, James C. Hoe, CMU/ECE/CALCM, 2018 = hit0 hit1 hit0 hit mu d 16 DATA

31 tag0 128 l 23 b tag PA[31:9] id 7 v0 1 b Same cache parameters but tune for fatter data SRAMs tag1 128 l 23 b id PA[8:2] v1 1 b b.o. PA[1] PA[0] id 6 7 this part is unchanged PA[8:3] data 0 64 lines 8 bytes PA[8:3] 6 data 1 64 lines 8 bytes tag 23 = hit0 hit1 HIT Can you play the same trick on the tag SRAMs? S18 L15 S31, James C. Hoe, CMU/ECE/CALCM, 2018 = hit0 hit1 {PA[2],b.o.} hit0 hit1 4 1 mu 4 1 mu 2 1 mu d 16 DATA

32 Same cache parameters but each block frame is interleaved over 2 SRAM banks tag PA[31:9] id 7 id id PA[8:2] 7 b.o. PA[1] PA[0] id 7 id 7 tag0 128 l 23 b v0 1 b tag1 128 l 23 b v1 1 b this part is unchanged data lines 4 bytes h0 bo h1 bo data lines 4 bytes h1 bo h0 bo tag 23 = = h0 h S18 L15 S32, James C. Hoe, CMU/ECE/CALCM, 2018 h0 h1 HIT b.o. h0 bo+h1 bo h1 bo+h0 bo 2 1 mu 2 1 mu 2 1 mu d 16 DATA

Lecture 17: Memory Hierarchy: Cache Design

S 09 L17-1 18-447 Lecture 17: Memory Hierarchy: Cache Design James C. Hoe Dept of ECE, CMU March 24, 2009 Announcements: Project 3 is due Midterm 2 is coming Handouts: Practice Midterm 2 solutions The