12 Cache-Organization 1

Size: px

Start display at page:

Download "12 Cache-Organization 1"

Virgil Shelton
6 years ago
Views:

1 12 Cache-Organization 1

2 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2

3 Improving Miss Penalty When caches first became popular, Miss Penalty ~ 10 processor clock cycles. Today 500 MHz (!) Processor (2 ns per clock cycle) and 200 ns to go to DRAM Þ 100 processor clock cycles! MEM Proc L1 L2 DRAM Solution: another cache between memory and the processor cache: Second Level (L2) Cache 3

4 Cache Organization 4

5 Analyzing Multi-level Cache Hierarchy Proc L1 L2 DRAM L1 Miss Rate L1 Miss Penalty L1 hit time L2 hit time L2 Miss Rate L2 Miss Penalty L1 Miss Penalty = L2 Hit Time + L2 Miss Rate * L2 Miss Penalty Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * L1 Miss Penalty 5

6 Example: without L2-Cache Assume L1 Hit Time = 1 cycle L1 Miss rate = 5% L1 Miss Penalty = 100 cycles Avg mem access time = x 100 = 6 cycles 6

7 Example: with L2-Cache Assume L1 Hit Time = 1 cycle L1 Miss rate = 5% L2 Hit Time = 5 cycles L2 Miss rate = 15% (% L1 misses that miss) L2 Miss Penalty = 100 cycles L1 miss penalty = * 100 = 20 Avg mem access time = x 20 = 2 cycle 3x faster with L2 cache 7

8 Cache Organisation 8

9 Accessing Memory vs Cache Cache sits in front of memory and holds recent address/value bindings Read: give it an address, if the address/value is in the cache, get back the value. if not, go get the value from memory Write: give it address and data, if address is in the cache, bind it to the new value b -andupdate memory. Proc a.a. M[a] M[a] M[b] M[a] M[b] 9

10 Questions How do you tell if the value associated with an address is IN a cache? => record the address with the value compare the desired address with those IN the cache How many places do you have to look? Address Value Tag Store SROM 10

11 Address Map Divide Memory Address into 3 portions: tag, index, and byte offset within block Which memory block is in the cache? tttttttttttttttt iiiiiiiiii oooo The index tells where in the cache to look, the offset tells which byte in block is start of the desired data, and the tag tells if the data in the cache corresponds to the memory address being looking for. 11

12 Direct Mapped Cache How do you know if something is in the cache? How find it if it is in the cache? In a direct mapped cache, each memory address is associated with one possible block (also called line ) within the cache. Therefore, we only need to look in a single location in the cache for the data if it exists in the cache 12

13 Simplest Cache: Direct Mapped Memory Address A B C D E F Memory Cache Index Byte Direct Mapped Cache Cache Location 0 can be occupied by data from: Memory location 0, 4, 8,... In general: any memory location whose 2 rightmost bits of the address are 0s Address & 0x3 = Cache index 13

14 One Reason for Misses, and Solutions Conflict Misses are misses caused by different memory locations mapped to the same cache index accessed almost simultaneously Solution 1: Make the cache size bigger Solution 2: Multiple Cache entries for the same Cache Index 14

15 Extreme Example: Fully Associative Cache Fully Associative Cache (e.g., 32 B block) Forget about the Cache Index Compare all Cache Tags in parallel 31 Cache Tag (27 bits long) 4 Byte Offset 0 Cache Tag Valid Cache Data = = : = = B 31 : B 1 B 0 = : : : 15

16 Compromise: N-way Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operate in parallel Select the one that gets a hit Example: 2-way set associative cache Cache Index selects a set from the cache The 2 tags in set are compared in parallel Data is selected based on the tag result (which matched the address) 16

17 2-way Set Associative Cache Valid Cache Index Cache Tag Cache Data Block 0 Cache Data Block 0 : : : : Valid Cache Tag : : Set Compare Select Compare Addr Tag OR Hit Cache Block Multiplexor or Mux Addr Tag 17

18 4-way set associative cache Address 22 8 Index V Tag Data V Tag Data V Tag Data V Tag Data to- 1 m ultiplexor Hit Data 18

19 ARM710T Cache Organisation virtual address byte addresses tag RAM tag RAM tag RAM tag RAM [1:0] [8:2] data RAM [10:0] 128 entry 128 entry 128 entry 128 entry =? =? =? =? [10:9] 2048 x 32-bit word encode hit data 19

20 Decreasing miss ratio with associativity (parallelism) One-wAy set associative (direct m apped) Block Tag Data Tw o-w ay set associative Set Tag Data Tag Data 7 Four-w ay set associative Set Tag Data Tag Data Tag Data Tag Data 0 1 Eight-way set associative (fully associative) T a g D ata T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta 20

21 3Cs Absolute Miss Rate (SPEC92) way 2-way 4-way Conflict 8-way Capacity Compulsory vanishingly small Cache Size (KB) Compulsory 21

22 Ways to reduce Cache miss rate (a) Larger cache limited by cost and technology hit time of first level cache <= cycle time (b) More places in the cache to put each block of memory associativity fully-associative any block any line k-way set associated k places for each block direct map: k=1 22

23 Performance 15% 12% 9% Miss rate 6% 3% 0% One-way Two-way Four-way Eight-way Associativity 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 23

24 What to do on a write hit? Write-Through Make sure processor read gets last value written. Simplest approach: write-through write hit => update word in cache and memory write miss => just update the memory read will bring it into the cache 24

25 What to do on a write hit? Write-Through Write-Through update word in cache block and main memory => each write adds memory cycles.. Performance trade-offs? Write-Through Buffer 25

26 OS What to do on a write hit? Write-back Write-back update word in cache block allow memory word to be stale add dirty bit to each line indicating that memory needs to be updated when block is replaced flushes cache before I/O!!! Performance trade-offs? 26

27 What to do on a Write Miss? Option 1: Just like read; bring whole block into cache and then modify bytes needed; Write Allocate Option 2: Only update slower memory, nothing in cache; No Write Allocate 27

28 Block Replacement Policy N-way Set Associative or Fully Associative have choice where to place a block, that is, which block to replace. Of course, if there is an invalid block, replace it. Whenever get a cache hit, record the cache block that was touched. 28

29 Replacement Policy When a block can go into more than one place, how do you choose which? Random? FIFO? Extra hardware: counter per set 29

30 Block Replacement Policy When need to evict a cache block, choose one which hasn't been touched recently: Least Recently Used (LRU) History suggests it is least likely of the choices to be used soon. Flip side of temporal locality 30

31 Block Replacement Policy: Random Sometimes hard to keep track of LRU. Second Choice Policy: pick one at random and replace that block Advantages Very simple to implement Predictable behavior No worst case behavior 31

32 Conclusion Tag, index, offset to find matching data, support larger blocks, reduce misses Where in cache? Direct Mapped Cache Conflict Misses if memory addresses compete Fully Associative to let memory data be in any block: no Conflict Misses Set Associative: Compromise, simpler hardware than Fully Associative, fewer misses than Direct Mapped LRU: Use history to predict replacement 32

Examples Characteristic Intel Pentium Pro PowerPC 604 Cache organization Split instruction and data caches Split intruction and data caches Cache size 8 KB each for instructions/data 16 KB each for

33 Examples Characteristic Intel Pentium Pro PowerPC 604 Cache organization Split instruction and data caches Split intruction and data caches Cache size 8 KB each for instructions/data 16 KB each for instructions/data Cache associativity Four-way set associative Four-way set associative Replacement Approximated LRU replacement LRU replacement Block size 32 bytes 32 bytes Write policy Write-back Write-back or write-through 33

34 Examples: Cache Sizes 34

35 Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, 3. Reduce the time to hit in the cache. CPUtime = IC CPI Execution + Memory accesses Instruction Miss rate Miss penalty Clock cycle time Improvements in miss penalty can be just as beneficial. Technology trends have improved CPU speed faster than DRAM access, making the relative cost of miss penalties increase over time 35

36 Multiprocessor Cache Coherence 36

37 Caches and Cache Coherence Caches play key role in all multiprocessor systems Reduce average data access time Reduce bandwidth demands placed on shared interconnect 37

38 Multiprocessors 38

39 Caches are Critical for Performance Reduce average latencey Reduce average bandwidth demands Many processor can share data efficiently What happens when store & load are executed on different processors? P P P 39

40 Caches and Cache Coherence Private processor caches create a problem Copies of a variable can be present in multiple caches A write by one processor may not become visible to others -> They ll keep accessing stale value in their caches => Cache coherence problem What do we do about it? Organize the memory hierarchy to make it go away Detect and take actions to eliminate the problem 40

41 Snooping Caches 41

42 Contention for Cache Tags Cache controller must monitor bus and processor Can view as two controllers: bus-side, and processorside With single-level cache: dual tags (not data) or dualported tag RAM must reconcile when updated, but usually only looked up Respond to bus transactions Tags used by the processor Tags Cached Data Tags 42 Tags used by the bus snooper

43 Snoopy Cache-Coherence Protocols State Address P 1 $ Bus snoop P n $ Data Mem I/O devices Cache-memory transaction Cache Controller snoops all transactions on the shared bus relevant transaction if for a block it contains take action to ensure coherence invalidate, update, or supply value depends on state of the block and the protocol 43

44 Design Choices Controller updates state of blocks in response to processor and snoop events and generates bus transactions Snoopy protocol Snoop set of states state-transition diagram actions Basic Choices Write-through vs Write-back Invalidate vs. Update Processor Ld/St Cache Controller State Tag Data 44

45 MESI 45

46 Reporting Snoop Results: MESI protocol, need to know Is block dirty; i.e. should memory respond or not? Is block shared; i.e. transition to E or S state on read miss? Three wired-or signals Shared: asserted if any cache has a copy Dirty: asserted if some cache has a dirty copy needn t know which, since it will do what s necessary Snoop-valid: asserted when OK to check other two signals actually inhibit until OK to check 46

47 Basic Design P Addr Cmd Data Busside controller Tags and state for snoop Comparator To controller Cache data RAM Tags and state for P Processorside controller Tag Write-back buffer Comparator To controller Snoop state Addr Cmd Data buffer Addr Cmd System bus 47

48 Multilevel Cache Hierarchies P P P L1 L1 L1 L2 L2 L2 Independent snoop hardware for each level? processor pins for shared bus contention for processor cache access? Snoop only at L2 and propagate relevant transactions 48

49 Multilevel Cache Hierarchies P P P L1 L1 L1 L2 L2 L2 Inclusion property (1) contents L1 is a subset of L2 (2) any block in modified state in L1 is in modified state in L2 1->all transactions relevant to L1 are relevant to L2 2->on BusRd L2 can wave off memory access and inform L1 49

50 Shared Cache Cache placement identical to single cache only one copy of any cached block fine-grain sharing Potential for positive interference one processor prefetches data for another Smaller total storage only one copy of code/data used P1 Switch (Interleaved) Cache (Interleaved) Main Memory Pn by both processors. Can share data within a line without ping-pong 50

51 Disadvantages Fundamental bandwidth limitation Increases latency of all accesses X-bar Larger cache hit time determines processor cycle time!!! Potential for negative interference one processor flushes data needed by another P1 Switch (Interleaved) Cache (Interleaved) Main Memory Pn 51

52 COMA: KSR Cache Only Memory Access 52

53 KRS-2 53

Levels in memory hierarchy

CS1C Cache Memory Lecture 1 March 1, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs1c/schedule.html Review 1/: Memory Hierarchy Pyramid Upper Levels in memory hierarchy