Lecture 11 Cache. Peng Liu. - PDF Free Download

Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1

Associative Cache Example 2

Associative Cache Example 3

Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative Block access sequence: 0, 8,0,6,8 (0 modulo 4) = 0 (6 modulo 4) = 2 (8 modulo 4) = 0 Direct mapped Block address Cache index Hit/miss 0 0 miss Mem[0] 8 0 miss Mem[8] 0 0 miss Mem[0] Cache content after access 0 1 2 3 6 2 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] 4

Associativity Example 2-way set associative Block address Cache index Full associative Hit/miss 0 0 miss Mem[0] 8 0 miss Mem[0] Mem[8] 0 0 hit Mem[0] Mem[8] 6 0 miss Mem[0] Mem[6] 8 0 miss Mem[8] Mem[6] Block address Hit/miss 0 miss Mem[0] Cache content after access Set 0 Set 1 Cache content after access Set 0 8 miss Mem[0] Mem[8] 0 hit Mem[0] Mem[8] 6 miss Mem[0] Mem[8] Mem[6] 8 hit Mem[0] Mem[8] Mem[6] 5

Set Associative Cache Organization 6

Tag & Index with Set-Associative Caches Assume a 2 n -byte cache with 2 m -byte blocks that is 2 a set-associative Which bits of the address are the tag or the index? m least significant bits are byte select within the block Basic idea The cache contains 2 n /2 m =2 n-m blocks Each cache way contains 2 n-m /2 a =2 n-m-a blocks Cache index: (n-m-a) bits after the byte select Same index used with all cache ways Observation For fixed size, length of tags increases with the associativity Associative caches incur more overhead for tags 7

Placement Policy Block Number 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 3 3 0 1 Memory Set Number 0 1 2 3 0 1 2 3 4 5 6 7 Cache block 12 can be placed Fully (2-way) Set Direct Associative Associative Mapped anywhere anywhere in only into set 0 block 4 (12 mod 4) (12 mod 8) 8

Direct-Mapped Cache Tag Index Block Offset t V Tag k Data Block b 2 k lines = t HIT Data Word or Byte 9

2-Way Set-Associative Cache Tag Index Block Offset b t V Tag k Data Block V Tag Data Block t = = Data Word or Byte HIT 10

Block Offset Tag Fully Associative Cache V Tag Data Block = t t = HIT b = Data Word or Byte 11

Replacement Methods Which line do you replace on a miss? Direct Mapped Easy, you have only one choice Replace the line at the index you need N-way Set Associative Need to choose which way to replace Random (choose one at random) Least Recently Used (LRU) (the one used least recently) Often difficult to calculate, so people use approximations. Often they are really not recently used 12

Replacement Policy In an associative cache, which block from a set should be evicted when the set becomes full? Random Least Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way) pseudo-lru binary tree often used for 4-8 way First In, First Out (FIFO) a.k.a. Round-Robin used in highly associative caches Not Least Recently Used (NLRU) FIFO with exception for most recently used block or blocks This is a second-order effect. Why? Replacement only happens on misses 13

Block Size and Spatial Locality Block is unit of transfer between the cache and memory Tag Word0 Word1 Word2 Word3 4 word block, b=4 Split CPU address block address offset b 32-b bits b bits 2 b = block size a.k.a line size (in bytes) Larger block size has distinct hardware advantages less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? Fewer blocks => more conflicts. Can waste bandwidth. 14

CPU-Cache Interaction (5-stage pipeline) PCen PC 0x4 Add addr nop inst hit? Primary Instruction Cache IR D Decode, Register Fetch E A B MD1 ALU M Y MD2 we addr Primary Data rdata Cache wdata wdata hit? R Stall entire CPU on data cache miss To Memory Control Cache Refill Data from Lower Levels of Memory Hierarchy 15

Improving Cache Performance Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate reduce the miss penalty What is the simplest design strategy? Biggest cache that doesn t increase hit time past 1-2 cycles (approx 8-32KB in modern technology) [ design issues more complex with out-of-order superscalar processors ] 16

Serial-versus-Parallel Cache and Memory Access a is HIT RATIO: Fraction of references in cache 1 - a is MISS RATIO: Remaining references Processor Addr Data CACHE Average access time for serial search: Addr Data Main Memory t cache + (1 - a) t mem Processor Addr Data CACHE Data Main Memory Average access time for parallel search: a t cache + (1 - a) t mem Savings are usually small, t mem >> t cache, hit ratio a high High bandwidth required for memory path Complexity of handling parallel paths can slow t cache 17

Causes for Cache Misses Compulsory: first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect replacement policy Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity 18

Effect of Cache Parameters on Performance Larger cache size + reduces capacity and conflict misses - hit time will increase Higher associativity + reduces conflict misses - may increase hit time Larger block size + reduces compulsory and capacity (reload) misses - increases conflict misses and miss penalty 19

Multilevel Caches A memory cannot be large and fast Increasing sizes of cache at each level CPU L1$ L2$ DRAM Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions 20

Multilevel Caches Primary (L1) caches attached to CPU Small, but fast Focusing on hit time rather than hit rate Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Unified instruction and data Focusing on hit rate rather than hit time Main memory services L2 cache misses Some high-end systems include L3 cache 21

A Typical Memory Hierarchy Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) CPU RF L1 Instruction Cache L1 Data Cache Unified L2 Cache Memory Memory Memory Memory Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM) 22

What About Writes? Where do we put the data we want to write? In the cache? In main memory? In both? Caches have different policies for this question Most systems store the data in the cache (why?) Some also store the data in memory as well (why?) Interesting observation Processor does not need to wait until the store completes 23

Cache Write Policies: Major Options Write-through (write data go to cache and memory) Main memory is updated on each cache write Replacing a cache entry is simple (just overwrite new block) Memory write causes significant delay if pipeline must stall Write-back (write data only goes to the cache) Only the cache entry is updated on each cache write so main memory and cache data are inconsistent Add dirty bit to the cache entry to indicate whether the data in the cache entry must be committed to memory Replacing a cache entry requires writing the data back to memory before replacing the entry if it is dirty 24

Write-through Write Policy Trade-offs Misses are simpler and cheaper (no write-back to memory) Easier to implement Requires buffering to be practical Uses a lot of bandwidth to the next level of memory Write-back Writes are fast on a hit Multiple writes within a block require only one writeback later Efficient block transfer on write back to memory at eviction 25

Cache hit: Write Policy Choices write through: write both cache & memory generally higher traffic but simplifies cache coherence write back: write cache only (memory is written only when the entry is evicted) Cache miss: a dirty bit per block can further reduce the traffic no write allocate: only write to main memory write allocate (aka fetch on write): fetch into cache Common combinations: write through and no write allocate write back with write allocate 26

Write Buffer to Reduce Read Miss Penalty CPU RF Data Cache Write buffer Unified L2 Cache Evicted dirty lines for writeback cache OR All writes in writethru cache Processor is not stalled on writes, and read misses can go ahead of write to main memory Problem: Write buffer may hold updated value of location needed by a read miss Simple scheme: on a read miss, wait for the write buffer to go empty Faster scheme: Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer 27

Write Buffers for Write-Through Caches Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory Q. Why a write buffer? Q. Why a buffer, why not just one register? Q. Are Read After Write (RAW) hazards an issue for write buffer? A. So CPU doesn t stall A. Bursts of writes are common. A. Yes! Drain buffer before next read, or check write buffers. 28

Avoiding the Stalls for Write-Through Use write buffer between cache and memory Processor writes data into the cache and the write buffer Memory controller slowly drains buffer to memory Write buffer: a first-in-first-out buffer (FIFO) Typically holds a small number of writes Can absorb small bursts as long as the long term rate of writing to the buffer does not exceed the maximum rate of writing to DRAM 29

Cache Write Policy: Allocation Options What happens on a cache write that misses? It s actually two subquestions Do you allocate space in the cache for the address? Write-allocate VS no-write allocate Actions: select a cache entry, evict old contents, update tags, Do you fetch the rest of the block contents from memory? Of interest if you do write allocate Remember a store updates up to 1 word from a wider block Fetch-on-miss VS no-fetch-on-miss For no-fecth-on-miss must remember which words are valid Use fine-grain valid bits in each cache line 30

Write-back caches Typical Choices Write-allocate, fetch-on-miss Write-through caches Write-allocate, fetch-on-miss Write-allocate, no-fetch-on-miss No-write-allocate, write-around Modern HW support multiple polices Select by OS on at some coarse granularity 31

Splitting Caches Most processors have separate caches for instructions & data Often noted $I and $D Advantages Extra access port Can customize to specific access patterns Low hit time Disadvantages Capacity utilization Miss rate 32

Cache Design: Datapath + Control Most design errors come from incorrect specification of state machine behavior! Common bugs: Stalls, Block replacement, Write buffer To CPU Control State Machine Control Control To Lower Level Memory To CPU Addr Din Dout Blocks Tags Addr Din Dout To Lower Level Memory 33

Cache Controller Example cache characteristics Direct-mapped, write-back, write allocate Block size: 4 words (16 bytes) Cache size: 16KB (1024 blocks) 32-bit byte addresses Valid bit and dirty bit per block Blocking cache CPU waits until assess is complete Address 34

Signals between the Processor and the Cache 35

Finite-state Machine Controllers Use and FSM to sequence control steps Set of states, transition on each clock edge State values are binary encoded Current state stored in a register Next state = fn (current state, current inputs) Control output signals = fo (current state) 36

Cache Controller FSM Idle state Waiting for a valid read or write request from the processor Compare Tag state Testing if hit or miss If hit, set Cache Ready after read or write -> Idle state If miss, updates the cache tag If dirty ->Write-Back state, else -> Allocate state Write-Back state Writing the 128-bit block to memory Waiting for ready signal from memory ->Allocate state Allocate state Fetching new blocks is from memory 37

Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer For 4-word block, 1-word-wide DRAM Miss penalty = 1 + 4x15 + 4x1 = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle 38

Measuring Performance 39

Measuring Performance Memory system is important for performance Cache access time often determines the overall system clock cycle time since it is often the slowest pipeline stage Memory stalls is a large contributor to CPI Stall due to instructions & data, reading & writing Stalls include both cache miss stalls and write buffer stalls Memory system & performance CPU Time = (CPU Cycles + Memory Stall Cycles) * Cycle Time MemStallCycles = Read Stall Cycles + Write Stalls Cycles CPI = CPIpipe + AvgMemStallCycles CPIpipe = 1 + HazardStallsCycles 40

Memory Performance Read stalls are fairly easy to understand Read Cycles = Read/prog * ReadMissRate * ReadMissPenalty Write stalls depend upon the write policy Write-through Write Stall = (Writes/Prog * WriteMissRate *WriteMissPenalty)+ Write Buffer Stalls Write-back Write Stall = (Writes/Prog * WriteMissRate * WriteMissPenalty) Write miss penalty can be complex: Can be partially hidden if processor can continue executing Can include extra time to write-back a value we are evicting 41

Worst-Case Simplicity Assume that write and read misses cause the same delay MemoryAccesses MemoryStallCycles MissRate MissPenalty Pr ogram Instructions Misses MemoryStallCycles MissPenalty Pr ogram Instruction In a single-level cache system MissPenalty = latency of DRAM In a multi-level cache system MissPenalty is the latency of L2 cache etc Calculate by considering MissRateL2, MissPenaltyL2 etc Watch out: global vs local miss rate for L2 42

Simple Cache Performance Example Consider the following Miss rate for instruction access is 5% Miss rate for data access is 8% Data references per instruction are 0.4 CPI with perfect cache is 2 Read and write miss penalty is 20 cycles Including possible write buffer stalls What is the performance of this machine relative to one without misses? Always start by considering execution times (IC*CPI*CCT) But IC and CCT are the same here, so focus on CPI CCT: Clock Cycle Time IC: Instruction Counter 43

Performance Solution Find the CPI for the base system without misses CPI no misses = CPIperfect = 2 Find the CPI for system with misses Misses/inst = I Cache Misses + D Cache Misses = 0.05 + (0.08*0.4) = 0.082 Memory Stall Cycles = Misses/Inst * MissPenalty = 0.082*20 = 1.64 cycles/inst CPI with misses = CPIperfect + Memory Stall Cycles = 2 + 1.64 = 3.64 Compare the performance Performance nomisses CPI withmisses 3.64 n 1.82 Performance CPI 2 withmisses nomisses 44

Another Cache Problem Given the following data Base CPI of 1.5 1 instruction reference per instruction fetch 0.27 loads/instruction 0.13 stores/instruction A 64KB, cache with 4-word block size has a miss rate of 1.7% Memory access time = 4 cycles + #words/block Suppose the cache uses a write through, write-around write strategy without a write buffer. How much faster would the machine be with a perfect write buffer? CPUtime = Instruction Count*(CPIbase + CPImemory) * ClockCycleTime Performance is proportional to CPI = 1.5 + CPImemory 45

No Write Buffer CPU Cache Lower Level Memory CPI memory = reads/inst.*miss rate * read miss penalty + writes/inst.* write penalty read miss penalty = 4 cycles + 4 words * 1cycle/word = 8 cycles write penalty = 4 cycles + 1word * 1cycle/word = 5 cycles CPI memory = (1 if + 0.27 ld)(1/inst.)*(0.017)*8 cycles + (0.13st)(1/inst.)*5cycles CPI memory = 0.17 cycles/inst. + 0.65 cycles/inst. = 0.82 cycles/inst. CPI overall = 1.5 cycles/inst. + 0.82 cycles/inst. = 2.32 cycles/inst. 46

Perfect Write Buffer CPU Cache Wbuff Lower Level Memory CPI memory = reads/inst.*miss rate * 8 cycle read miss penalty + writes/inst.* (1- miss rate) * 1 cycle hit penalty A hit penalty is required because on hits we must Access the cache tags during the MEM cycle to determine a hit Stall the processor for a cycle to update a hit cache block CPI memory = 0.17 cycles/inst. + (0.13st)(1/inst.)*( 1-0.017)*1cycle CPI memory = 0.17 cycles/inst. + 0.13 cycles/inst. = 0.30 cycles/inst. CPI overall = 1.5 cycles/inst. + 0.30 cycles/inst. = 1.80 cycles/inst. 47

Perfect Write Buffer + Cache Write Buffer WBuff CPU Cache Lower Level Memory CWB CPI memory = reads/inst.*miss rate * 8 cycle read miss penalty Avoid a hit penalty on write by: Add a one-entry write buffer to the cache itself Write the last store hit to the data array during next stors s MEM Hazard: On loads, must check CWB along with cache!. CPI memory = 0.17 cycles/inst. CPI overall = 1.5 cycles/inst. + 0.17 cycles/inst. = 1.67 cycles/inst. 48

Acknowledgements These slides contain material from courses: UCB CS152 Stanford EE108B 49