ECSE 425 Lecture 21: More Cache Basics; Cache Performance

ECSE 425 Lecture 21: More Cache Basics; Cache Performance H&P Appendix C 2011 Gross, Hayward, Arbel, Vu, Meyer Textbook figures

Last Time Two QuesRons Q1: Block placement Q2: Block idenrficaron ECSE 425, Fall 2011, Lecture 21 2

Today Two more quesrons Q3: Block replacement Q4: Write strategy Performance of CPUs with cache Appendix C.2 ECSE 425, Fall 2011, Lecture 21 3

Q3: Block Replacement Which block should be replaced on a miss? Direct mapped no choice! Index specifies the cache block for replacement Advantage: simple logic Disadvantage: higher miss rates Fully associarve or set associarve Several blocks to choose from Make good choices to reduce miss rates ECSE 425, Fall 2011, Lecture 20 4

Three Replacement Strategies Random simple! Least- Recently Used (LRU) Recently accessed blocks will likely be accessed again The converse is also true: blocks not recently accessed are less likely to be accessed again True LRU is too expensive; esrmate it E.g., record when items are accessed First in, first out (FIFO) Replace oldest block Another approximaron of LRU ECSE 425, Fall 2011, Lecture 20 5

Q4: Write Strategy Most cache accesses are reads All instrucron accesses are reads MIPS: 10% stores, 26% loads Writes represent 7% of overall memory traffic 28% of traffic to the data cache Two conflicrng design principles Make the common case fast: oprmize for reads Amdahl s law: don t neglect writes ECSE 425, Fall 2011, Lecture 20 6

Reads vs. Writes Reads are the easiest to make fast Read block at same Rme as tag check Discard data if tags don t match Power is the only drawback Why are writes slow? Writes cannot begin unrl ajer tag check Writes can be of variable width Reads too; power is the only drawback of reading more ECSE 425, Fall 2011, Lecture 20 7

Two Write Strategies Write back Write informaron only to block in the cache Write the modified cache block to the main memory only when it is replaced Write through Write informaron to the block in the cache and Write to the block in lower- level memory ECSE 425, Fall 2011, Lecture 20 8

Write Back Advantages: Writes occur at speed of cache MulRple writes to a block coalesced into one write to main memory Reduces pressure on memory bandwidth Reduces power dissipated in the memory system Dirty bit kept for each block Indicates whether the block is dirty (modified while in the cache) or clean (not modified) On replacement, clean blocks are discarded Only dirty blocks trigger wrirng to the next level ECSE 425, Fall 2011, Lecture 20 9

Write Through Simple, easier to implement Cache is always clean Read misses never result in writes to lower level Main memory always has current data Reduces the complexity of cache coherency ECSE 425, Fall 2011, Lecture 20 10

Stalling on Writes Avoid stalling on writes by using a write buffer Once data is wrioen to the buffer, conrnue Loads must then check the write buffer for data If the data is cached, but the write is srll buffered, cache doesn t contain the current value ECSE 425, Fall 2011, Lecture 20 11

Two Write Miss Strategies Write allocate Make write misses act like read misses Retrieve the block, then proceed as if access was a hit No- write allocate Don t retrieve the block: modify it directly in lower- level memory instead Normally WB caches write allocate (benefit from locality) WT caches don t write allocate (avoid redundant writes to mulrple levels of memory) ECSE 425, Fall 2011, Lecture 20 12

Average Memory Access Time Miss rate Accesses that miss / Total accesses Convenient metric Independent of the speed of the hardware A beoer measure Average Memory Access Time AMAT = HitTime + MissRate MissPenalty ECSE 425, Fall 2011, Lecture 21 13

Example: Unified vs. Split Caches InstrucRon and data streams are different InstrucRons are fixed size and read- only InstrucRon stream is predictable Divide cache capacity between two caches An instrucron cache (I$) accessed during IF A data cache (D$) accessed during MEM ECSE 425, Fall 2011, Lecture 21 14

Example: Unified vs. Split Caches, Cont d 16 KB I$ and 16 KB D$ vs. 32 KB U$ 16 KB I$ has 3.83 misses per 1000 instrucrons 16 KB D$ has 40.9 misses per 1000 instrucrons 32 KB U$ has 43.3 misses per 1000 instrucrons Assume 10% of instrucrons are stores 26% of instrucrons are loads 1 cycle hit Rme 100 cycle miss penalty 1 extra cycle penalty for U$ structural hazard What is the AMAT in each case? Does miss rate predict AMAT? ECSE 425, Fall 2011, Lecture 21 15

CPU Performance with Imperfect Caches When caches are perfect CPUTime = IC CPI base CCTime One cycle of hit Rme is included in CPI base When caches aren t perfect, stalls! ECSE 425, Fall 2011, Lecture 21 16

CPU Performance Example 1 Assume In- order processor Miss penalty of 200 CC CPI = 1 (when we ignore memory stalls) 1.5 memory accesses per instrucron on average 30 misses / 1000 instrucrons Compare CPUTime with, and without cache ECSE 425, Fall 2011, Lecture 21 17

Cache Impact on Performance As CPI decreases The relarve penalty of a cache miss increases With faster clocks, a fixed memory delay yields more stall clock cycles! Amdahl s law states that If we decrease CPI, But average memory access Rme is fixed, then The overall speedup is limited by the fracron of Rme spent compurng relarve to total execuron Rme ECSE 425, Fall 2011, Lecture 21 18

CPU Performance Example 2 With a perfect cache CPI = 1.6 CC = 0.35 ns 1.4 mem refs / instrucron Compare two 128 KB caches 64 bytes blocks Miss penalty = 65 ns Hit Rme = 1 cc Direct mapped Miss rate = 2.1% 2- way set associarve CC 2- way = 1.35 CC 1- way Miss rate = 1.9% What is CPUTime in each case? Does AMAT predict CPUTime? ECSE 425, Fall 2011, Lecture 21 19

AMAT is not CPU Time! In the previous example: AMAT is lower for 2- way CPUTime is lower for 1- way Full simularon is the best predictor of performance 2- way set associarve cache increases the clock cycle for ALL instrucrons Slower ALU operarons Slower hits (the common case) Degrades performance despite improving miss rate ECSE 425, Fall 2011, Lecture 21 20

Out- of- Order ExecuRon Miss penalty does not mean the same thing The machine does not totally stall on a cache miss Other instrucrons allowed to proceed What is the miss penalty for OOO ExecuRon? Full latency of the memory miss? No The non- overlapped latency when the CPU must stall Define: Because of O- O- E

Summary Block replacement Random, LRU, FIFO Write strategy Write- back Write- through Write misses Write allocate No- write allocate Performance with Caches Miss Rate Average Memory Access Time CPU Time OOO- E vs. IO- E Some of the miss penalty can be overlapped ECSE 425, Fall 2011, Lecture 21 22

Next Time Basic cache oprmizarons Appendix C.3 ECSE 425, Fall 2011, Lecture 21 23