EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

Size: px

Start display at page:

Download "EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont"

Piers Perry
5 years ago
Views:

1 Basic Caches Fall 2018 Jon Beaumont Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. Slide 1

2 Slide 2

3 Announcements Project memory discussion this week during normal lab sections HW #4 due next Friday (11/16) Milestone II due next Sunday (11/18) Just a report, no graded submission Slide 3

4 Readings For Today: H&P 2.1 For Wednesday: N. Jouppi. Improving direct-mapped cache performance Slide 4

5 Memory Systems: Basic Caches Slide 5

6 Memory Systems Basic caches introduction fundamental questions cache size, block size, associativity Advanced caches Main memory Virtual memory Start today Slide 6

7 Where does this fit in? Most speedup techniques can categorized as one of the following: Reducing the amount of work Mostly software / compiler optimizations Reducing the latency of operations Circuit and logic design, μarchitecture optimizations Doing more useful things simultaneously (i.e. parallelism) Pipelining, OoO, speculation, etc. Most of material so far in 470 has been focused on increasing parallelism Today we look at reducing latency of average memory operation Slide 7

8 Roadmap Speedup Programs Reduce Instruction Latency Parallelize Reduce number of instructions Reduce average memory latency Instruction Level Parallelism Caching First 2 months Lecture 11 Slide 8

9 Motivation Performance Processor Memory Want memory to appear: as fast as CPU as large as required by all of the running applications Slide 9

10 Motivation This would be a problem, except why? Memory accesses are NOT random Specifically, most programs exhibit: spatial locality (e.g. indexing through an array) temporal locality (e.g. reading a variable shortly after referencing it) Idea: include smaller (+faster) memory structures to hold data more likely to be accessed Slide 10

11 Make common case fast: Memory Hierarchy common: temporal & spatial locality fast: smaller more expensive memory Larger Faster Slide 11

12 Storage Hierarchies Storages are layered by hierarchies in order of increasing latency (t i ) t i < t i+1 increasing size (s i ) s i < s i+1 decrease unit cost (c i ) c i < c i+1 decreasing bandwidth (b i ) b i > b i+1 increasing xfer unit (x i ) x i < x i+1 Level 0 Registers Level 1 (n levels of) Caches Level 2 Main Memory (Primary Storage) Level 3 Disks (Secondary Storage) Level 4 Tape Backup (Tertiary Storage) ISA feature Memory Abstractions Level 2.5: Flash? Slide 12

13 Processor/Memory Boundaries Processor I-Unit E-Unit Regs L1 I-Cache I-TLB D-TLB L1 D-Cache L2 Cache (SRAM on-chip) L3 Cache (SRAM off-chip) Main Memory (DRAM) Slide 13

14 Locality Example - Poll Question How do the code samples rank w.r.t. temporal locality? Spatial? Slide 14

15 Caches An automatically managed hierarchy A hiding place, esp. of goods, treasure, etc. -- OED Keep recently accessed block temporal locality CPU Break memory into blocks (several bytes) and transfer data to/from cache in blocks spatial locality $ A lot of architectures opt for software managed scratch-pad memory instead e.g. Cray-1, embedded processors, Why?? Memory Slide 15

16 Cache (Abstractly) Keep recently accessed block in block frame state (e.g., valid) address tag data address state data bookkeeping overhead multiple bytes per block frame to amortize overhead Slide 16

17 Cache (Abstractly) On memory read if incoming address corresponds to one of the stored address tags then HIT return data else MISS choose & displace a current block in use fetch new (referenced) block from memory into frame return data - Where and how to look for a block? (Block placement) - Which block is replaced on a miss? (Block replacement) - What happens on a write? Write strategy (Later) - What is kept? (Bookkeeping, data) Slide 17

18 Terminology block (cache line) minimum unit that may be present hit block is found in the cache miss block is not found in the cache miss ratio fraction of references that miss hit time time to access the cache miss penalty time to replace block in the cache + deliver to upper level access time time to get first word transfer time time for remaining words Slide 18

19 Cache Performance Poll Question Given a cache with a particular: Hit time (HT or AT) Miss ratio (MR) Miss penalty (MP) What is the mean access time for the cache? Slide 19

20 Slide 20

21 Assume Cache Performance Cache access time is equal to 1 cycle Cache miss ratio is 0.01 Cache miss penalty is 20 cycles Mean access time Typically level-1 is 16K-512K, level-2 is 512K-16M,memory is 128M-16G level-1 as fast as the processor (increasingly 2-cycles) level-1 is 1/10000 capacity but contains 98% of references Slide 21

22 Assume Cache Performance Cache access time is equal to 1 cycle Cache miss ratio is 0.01 Cache miss penalty is 20 cycles Mean access time = Cache access time + miss ratio * miss penalty = * 20 = 1.2 Typically level-1 is 16K-512K, level-2 is 512K-16M,memory is 128M-16G level-1 as fast as the processor (increasingly 2-cycles) level-1 is 1/10000 capacity but contains 98% of references Slide 22

23 Fundamental Cache Parameters that affects miss rate Cache size (C) Block size (b) Cache associativity (a) Slide 23

24 Cache Size Cache size is the total data (not including tag) capacity bigger can exploit temporal locality better not ALWAYS better Too large a cache smaller is faster => bigger is slower access time may degrade critical path Too small a cache don t exploit temporal locality well useful data constantly replaced hit rate working set size holding b and a constant C Slide 24

25 Block size is the data that is Block Size associated with an address tag not necessarily the unit of transfer between hierarchies (sub-blocking) Too small blocks don t exploit spatial locality well have inordinate tag overhead Too large blocks useless data transferred useful data permanently replaced too few total # blocks holding C and a constant b Slide 25

26 Where does block 12 (b 1100) go? Associativity Block Set/Block Set Fully-associative block goes in any frame Set-associative a block goes in any frame in exactly one set Direct-mapped block goes in exactly one frame (think all frames in 1 set) (frames grouped into sets) (think 1 frame per set) Slide 26

27 Impact of Associativity Typical values for associativity 1, 2-, 4-, 8-way associative 5-way associative 20KByte on SuperSPARC, Why? Larger associativity lower miss rate, less variation among programs only important for small C/b Smaller associativity lower cost, faster hit time hit rate ~5 holding C and b constant a Slide 27

28 decoder decoder Direct Mapped Caches block index tag index tag idx b.o. = Tag Match (hit?) Multiplexor Don t forget to check the valid/state bits = Tag match (hit?) Slide 28

29 Fully Associative Cache tag blk.offset Tag = = = Multiplexor = Associative Search Slide 29

30 decoder decoder N-Way Set Associative Cache tag idx b.o. a way (bank) a set = Tag match Multiplexor = Tag match Cache Size = N x 2 B+b Slide 30

31 Mark Hill s DM vs. SA: Bigger & Dumber is Better t avg = t hit + miss ratio x t miss comparable DM and SA caches with same t miss but, associativity that minimizes t avg is often smaller than associativity that minimizes miss ratio remember: diff(t cache ) = t cache (SA) - t cache (DM) 0 diff(miss) = miss(sa) - miss (DM) 0 e.g., If diff(t cache )= 0 => SA better, but (SA needs slower clock) (DM misses more) assuming diff(miss) = -1%, t miss = 20 if diff(t cache ) > 0.2 cycle then SA loses Slide 31

32 Associative Block Replacement Which block in a set to replace on a miss? Ideally Belady s algorithm, replace the block that will be accessed the furthest in the future How do you implement it? Approximations: Least recently used LRU optimized (assume) for temporal locality (expensive for more than 2-way) Not most recently used NMRU Random track MRU, random select from others, good compromise nearly as good as LRU, simpler (usually pseudo-random) How much can block replacement policy matter? Slide 32

33 hit0 hit1 Example: a=2, C=1kb, b=4b, word-size=2b Basic Solution tag PA[31:9] idx 7 idx PA[8:2] idx 7 b.o. PA[1] PA[0] idx 7 idx 7 tag0 v0 tag1 v1 data 0 data l x 23-b x 1-b 128-l x 23-b x 1-b 128-lines x 4-bytes 128-lines x 4-bytes tag 23 = hit0 = hit1 HIT b.o. hit0 hit1 2-1-mux 2-1-mux d 16 DATA 2-1-mux Slide 33

34 Writes are more interesting Write Policies on reads, data can be accessed in parallel with tag compare on writes, needs two steps is turn-around time important for writes? cache optimization often defer writes for reads Choices of Write Policies On write hits, update memory? Yes: write-through +no coherence issue, +immediate observability, -more bandwidth No: write-back On write misses, allocate a cache block frame? Yes: write-allocate No: no-write-allocate Slide 34

35 Write-through Write Policies (Cont.) update memory on each write keeps memory up-to-date traffic/reference = f writes, e.g independent of cache performance (miss rate) Write-back update memory only on block replacement many cache lines are only read and never written to add dirty bit to status word originally cleared after replacement set when a block frame is written to only write back a dirty block, and drop clean blocks w/o memory update traffic/reference = f dirty x miss x B e.g., traffic/reference = 1/2 x 0.05 x 4 = 0.1 Slide 35

Fall 2007 Prof. Thomas Wenisch

Basic Caches Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar