EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

Size: px

Start display at page:

Download "EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont"

Jerome Hubbard
5 years ago
Views:

Lecture 14 Advanced Caches DEC Alpha Fall 2018 Instruction

edu/courses/eecs470/ Data Cache Slides developed in part by

Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Mudge, Shen,

1 Lecture 14 Advanced Caches DEC Alpha Fall 2018 Instruction Cache BIU Jon Beaumont Data Cache Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Mudge, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. Slide 1

2 Administrative HW #4 due Friday (11/16) night Milestone II due Sunday (11/18) night Meetings next Monday Slide 2

3 Basic Caches Last Time Memory is much slower than processor However, data accesses are predictable Specifically, many programs exhibit spatial and temporal locality When piece of data is accessed: Store in cache if it s needed later (temporal locality) Read whole line/block around piece of data into cache as well (spatial locality) Cache considerations: Cache size, block size, associativity Slide 3

4 Survey of other cache topics Today Classifying and reducing misses Software techniques Hardware hashing Victim cache Subblocking Multi-level caches Inclusivity Non-blocking caches Slide 4

5 Common misconception: Write Policies (Cont.) (Write-through vs write-back) and (write-allocate vs no-writeallocate) are two independent design decisions One does not preclude the other Slide 5

6 Address Bits Why this? Why not, for example, this? Slide 6

7 Miss Classification (3+1 C s) compulsory cold miss on first access to a block defined as: miss in infinite cache capacity misses occur because cache not large enough defined as: miss in fully-associative cache (with optimal placement policy) conflict misses occur because of restrictive mapping strategy only in set-associative or direct-mapped cache defined as: not attributable to compulsory or capacity coherence misses occur because of sharing among multiprocessors (More in a couple weeks) Slide 7

8 Miss Classification: Poll Question Consider a 2 entry direct mapped cache, and the following memory access pattern: A, B, C, A Assume A and B map to index 0, and C maps to index 1 Does the second reference to A hit or miss? a) Hit b) Compulsory Miss c) Capacity Miss d) Conflict Miss Slide 8

9 How to Reduce Conflict Misses? Why do conflict misses occur? (assuming no capacity concerns) Uneven distribution of active block frame into sets active frames with the same index will map to the same set Active Block Frames Index(Address) What causes uneven distribution? probabilistic vs programmatic (e.g. multi-dim. array traversal) How do we redistribute blocks into block frames? Slide 9

10 Software Approach: Restructuring Loops and Array Layout If column-major x[i+1,j] follows x[i,j] in memory x[i,j+1] is a long away from x[i,j] conflict miss + lost of spatial locality Poor code for i = 1, rows for j = 1, columns sum = sum + x[i,j] Better code for j = 1, columns for i = 1, rows sum = sum + x[i,j] Optimizations: merging arrays, data layout, padding array dimension but there can be conflicts among different arrays array sizes may be unknown at compile/programming time Slide 10

11 Exposing Caches to Software Controlling the size of the cache If the program is known to have a small footprint, turn off half of the L1 or turn off the entire L2 to save power Controlling the associativity Assign individual L1 banks to different software threads so they don t thrash each other Designate a specific L1 bank for streaming references only so you don t displace the rest of the cache contents Fine-grain cache control instructions, e.g. Lock a particular cache line from displacement Prefetch an address up to just L2 but not L1 Start a new stream buffer prefetch Slide 11

12 Software-Assisted Cache Hierarchies ISA provides for separate storages for temporal vs nontemporal data, each with its own multiple level of hierarchies Load and Store instructions can give hints about where cached copies should be held after a cache miss temporal non-temporal-l1 non-temporal-l2 non-temporal-all L1 L2 L3 NT L1 NT L2 NT L3 Main Memory Slide 12

13 Hardware Approach: Use Hash Functions Current mapping function only uses index bits conflict happens for blocks with identical index bits Can we use other parts of the address (tag bits) to distinguish? i.e. use more than n-bit to index into 2 n -line SRAM For f index (addr0) = f index (addr1) can chose f hash so that f hash (addr0)!= f hash (addr1) Cons: 1. What is a good hash function? Do we need program specific hash functions? 2. How expensive is it to critical path? - Pros: 1. Can lead to better effective use of cache Slide 13

14 Regular Set-Associative Cache Bank1 For all addresses with similar index Bank2 Tag Index Block xx Slide 14

15 Regular Set-Associative Cache If a 2-way set associative cache has 2 sets using LRU replacement, what is the probability that the memory access stream A,B,C,A misses on the second A access? a) 0 b) 1/8 c) 1/4 d) 1/2 e) 3/4 Slide 15

16 Seznec s Skewed-Associative Cache Use a different hash function for each way in a set associative cache E.g. for 4-way skewed-associative cache consider: bank0: A1 xor A2 bank1: shuffle(a1) xor A2 bank2: shuffle(shuffle(a1)) xor A2 bank3: shuffle(shuffle(shuffle(a1))) xor A2 Implementation only adds xor s to cache access path Slide 16

17 Seznec s Skewed-Associative Cache Bank1 same index redistribute to different set same index same set Bank0 f1 f0 tag idx b.o. Can get better utilization with less assoc? average case? worst case? Slide 17

18 Pseudo-Associativity: e.g MIPSR10K 2-way L2 Classic associativity is too expensive for external SRAMS (chip-count and routing) N-way associativity is a placement policy means address can be mapped to N different locations in cache it doesn t mean all look up needs to be parallel Pseudo N-way Associativity: Given a direct-mapped array with K cache blocks Implement K/N sets Given address Addr, sequentially look up: {0,Addr[lg(K/N)-1: 0]}, {1,Addr[lg(K/N)-1: 0]}, {N-1,Addr[lg(K/N)-1: 0]} Optimization: look up most recently used way (MRU) first Slide 18

19 decoder decoder Way-Predicting Set-Associative Cache Set-associative structure tag Predict way Look up that way only If miss, regular SA lookup idx Hmm..Still goes thru MUX How does this help? way 1 hit Miss? probe as a set-associative cache way n Tag Match? = Tag Match? way 1 predicted way way n Data block PC, Address, etc. = Slide 19

20 Norm Jouppi s Victim Cache Situation: working set fits inside cache, except for few memory lines which map to the same cache block and keep knocking each other out Don t want to increase associativity, since most data accesses don t have conflict misses Slide 20

21 Norm Jouppi s Victim Cache DM $ small f.a. victim cache Victim $ Next level Motivation: working set fits inside cache, except for a small memory lines which map to the same cache block and keep knocking each other out Lines evicted from the direct-mapped cache due to collision is stored into the victim cache Slide 21

22 Norm Jouppi s Victim Cache Targets conflict misses Victim cache: a small fully-associative cache capturing victims victims are conflicts in a set in DM or low-assoc cache LRU replacement A miss in cache + a hit in victim cache move line to main cache is effectively equal to fast miss handling (or slow hits) Slide 22

23 Victim Cache s Performance Removing conflict misses even one entry helps some benchmarks helps I-cache more than D-cache Compared to cache size generally, victim cache helps more for smaller caches Compared to line size helps more with larger line size (why?) Slide 23

24 Store Buffers Write stores into buffer when cache is busy (Like store queue, but after retirement) Allows reads to proceed What happens when dependent load/store is issue? Slide 24

25 Writeback Buffers Between write-back cache and next level 1. Move replaced, dirty blocks to buffer 2. Read new line 3. Move replaced data to memory Usually only need 1 or 2 write-back buffer entries Slide 25

26 Large Blocks and Subblocking Large cache blocks can take a long time to refill refill cache line critical word first restart cache access before complete refill Large cache blocks can waste bus bandwidth if block size is larger than spatial locality divide a block into subblocks associate separate valid bits for each subblock Only load subblock on access, but still have reduced tag overhead v subblock v subblock v subblock tag Slide 26

27 Multi-Level Caches Processors getting faster w.r.t. main memory larger caches to reduce frequency of more costly misses but larger caches are too slow for processor => gradually reduce miss cost with multiple levels t avg = t hit + miss ratio x t miss Slide 27

28 Multi-Level Cache Design L1I Proc L1D different technology different requirements different choice of capacity block size associativity t avg-l1 = t hit-l1 + miss-ratio L1 x t avg-l2 t avg-l2 = t hit-l2 + miss-ratio L2 x t memory L2 What is miss ratio? global: L2 misses / L1 accesses local: L2 misses / L1 misses Slide 28

29 The Inclusion Property Inclusion means L2 is a superset of L1 (ditto for L3 ) Why? if an addr is in L1, then it must be frequently used makes L1 writeback simpler L2 can handle external coherence checks without L1 Inclusion takes effort to maintain L2 must track what is cached in L1 On L2 replacement, must flush corresponding blocks from L1 How can this happen? Consider: 1. L1 block size < L2 block size 2. different associativity in L1 3. L1 filters L2 access sequence; affects LRU replacement order Slide 29

30 Possible Inclusion Violation direct mapped L2 step 1. L1 miss on c a step 2. a displaced to L2 a b b c 2-way set asso. L1 step 3. b replaced by c a,b,c have same L1 idx bits b,c have the same L2 idx bits a,{b,c} have different L2 idx bits Slide 30

31 Non-blocking Caches Also known as lock-up free caches Instead of stalling pending accesses to cache on a miss, keep track of misses in special registers and keep handling new requests Key implementation problems handle reads to pending miss handle writes to pending miss keep multiple requests straight Memory Access Stream ld A ld B ld C ld D st B Non-blocking $ hit miss miss hit miss (pend.) Miss Status Holding Registers B C Slide 31

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.