Causes of Cache Msses: The 3 C s Computer Archtecture ELEC3441 Lecture 9 Cache (2) Dr. Hayden Kwo-Hay So Department of Electrcal and Electronc Engneerng Compulsory: frst reference to a lne (a..a. cold start msses) msses that would occur even wth nfnte cache Capacty: cache s too small to hold all data needed by the program msses that would occur even under perfect replacement polcy Conflct: msses that occur because of collsons due to lne-placement strategy msses that would not occur wth deal full assocatvty 2 AMAT n Average Memory Access Tme: AMAT = Ht Tme + Mss Rate Mss Penalty Example n Processor runs at 2 GHz wth CPI=1. Mss penalty of memory s 50 cloc cycles. L1 cache returns data n 1 cycle on cache ht. On a partcular program, nstructon mss rate s 1%. Load/store mae up 30% of dynamc nstructon, and have a mss rate of 5%. Assume read/wrte penaltes are the same and gnore other stalls. n What s AMAT for nstructon/data? n What s average CPI gven the above memory access tme? 3 4
Example: AMAT Instructon Cache: AMAT = Ht Tme + Mss Rate Mss Penalty =1+1% 50 =1.5 cycles Data Cache: AMAT = Ht Tme + Mss Rate Mss Penalty =1+ 5% 50 = 3.5 cycles Average CPI (wth Memory) # of nstr. = # of nstructon memory mss cycles = 1% 50 = 0.5 # of data memory mss cycles = 30% 5% 50 = 0.75 Total # of memory stall cycles = 0.5 + 0.75 = 1.25 Average CPI = +1.25 = 2.25 5 6 CPU-Cache Interacton (5-stage ppelne) 0x4 Add bubble PC addr nst ht? PCen Prmary Instructon Cache IR D To Memory Control Decode, Regster Fetch A B MD1 ALU Y MD2 Cache Refll Data from Lower Levels of Memory Herarchy E M we addr Prmary Data rdata Cache ht? wdata Stall entre CPU on data cache mss R Improvng Cache Performance Average memory access tme (AMAT) = Ht tme + Mss rate x Mss penalty To mprove performance: reduce the ht tme reduce the mss rate reduce the mss penalty What s best cache desgn for 5-stage ppelne? Bggest cache that doesn t ncrease ht tme past 1 cycle (approx 8-32KB n modern technology) [ desgn ssues more complex wth deeper ppelnes and/or out-oforder superscalar processors] 7 8
Effect of Cache Parameters on Performance Larger cache sze + reduces capacty and conflct msses - ht tme wll ncrease Hgher assocatvty + reduces conflct msses - may ncrease ht tme Larger lne sze + reduces compulsory and capacty (reload) msses - ncreases conflct msses and mss penalty Mss rate per type 10% Performance vs. Assocatvty 9% 8% 7% 6% 5% 4% 3% 2% 1% 0% 4 Capacty One-way Two-way Four-way 8 16 32 64 128 256 512 Cache sze (KB) 1024 Two-way Four-way Assocatvty n 1-way à 2-way èsgnfcant drop n mss rate n 2-way à 4-way è less sgnfcant n Effect of assocatvty sgnfcant n small cache Mss rate 15% 12% 9% 6% 3% 0 One-way 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB Eght-way 9 10 Wrte Polcy Choces Cache ht: wrte through: wrte both cache & memory Generally hgher traffc but smpler ppelne & cache desgn wrte bac: wrte cache only, memory s wrtten only when the entry s evcted A drty bt per lne further reduces wrte-bac traffc Must handle 0, 1, or 2 accesses to memory for each load/store Cache mss: no wrte allocate: only wrte to man memory wrte allocate (aa fetch on wrte): fetch nto cache Common combnatons: wrte through and no wrte allocate wrte bac wth wrte allocate t HIT Wrte Ht (Cache Wrtng) Tag V Tag = t Index WE Offset b Data Data Word or Byte 2 lnes Wrte bac: done Wrte through: wrte also to memory 11 12
Wrte Through va Wrte Buffer Processor n Processor wrtes to both $ and wrte buffer Memory wrtes completes as soon as data n wrte buffer n Memory controller completes the wrte to DRAM offlne n Wrtng too fast may saturate wrte buffer $ Wrte Buffer DRAM Read Mss wth Wrte Buffer n On Read Mss, need to read memory to fll cache n But data may stll be n wrte buffer pendng wrte to DRAM n 2 Solutons: Flush wrte buffer before read Chec all pendng wrtes n wrte buffer and return latest wrte data f address match Q: Would there be data n wrte buffer that needs to be forwarded on a read ht? 13 14 Wrte Mss n Wrte mss happens when wrte locaton not n cache n Wrte Allocate: At the end of the wrte, cache contans full lne of data Need to read from memory Wrte bac: must have wrte allocate Wrte through: may or may not n No wrte allocate: Data go straght to memory Multlevel Caches Problem: A memory cannot be large and fast Soluton: Increasng szes of cache at each level CPU L1$ L2$ DRAM Local mss rate = msses n cache / accesses to cache Global mss rate = msses n cache / CPU memory accesses Msses per nstructon = msses n cache / number of nstructons 15 16
Presence of L2 nfluences L1 desgn Use smaller L1 f there s also L2 Trade ncreased L1 mss rate for reduced L1 ht tme Bacup L2 reduces L1 mss penalty Reduces average access energy Use smpler wrte-through L1 wth on-chp L2 Wrte-bac L2 cache absorbs wrte traffc, doesn t go off-chp At most one L1 mss request per L1 access (no drty vctm wrte bac) smplfes ppelne control Smplfes coherence ssues Smplfes error recovery n L1 (can use just party bts n L1 and reload from L2 when party error detected on L1 read) Incluson Polcy Inclusve multlevel cache: Inner cache can only hold lnes also present n outer cache External coherence snoop access need only chec outer cache Exclusve multlevel caches: Inner cache may hold lnes not n outer cache Swap lnes between nner/outer caches on mss Used n AMD Athlon wth 64KB prmary and 256KB secondary cache Why choose one type or the other? Mss Rate 25.0% 20.0% 15.0% 10.0% L1 vs L2 Mss Rate 5.0% 0.0% twolf bzp2 gzp parser gap Data cache mss rates for ARM Cortex-A8 when runnng Mnnespec vpr perlbm gcc crafty vortex con mcf L1 Data Mss Rate L2 Data Mss Rate n Mss rate on L2$ usually much lower than L1$ n L2 usually has: Hgher capacty Hgher assocatvty n Only mssed L1 access arrved at L2 17 Itanum-2 On-Chp Caches (Intel/HP, 2002) Level 1: 16KB, 4-way s.a., 64B lne, quad-port (2 load+2 store), sngle cycle latency 18 Level 2: 256KB, 4-way s.a, 128B lne, quad-port (4 load or 4 store), fve cycle latency Level 3: 3MB, 12-way s.a., 128B lne, sngle 32B port, twelve cycle latency 19 20
Power 7 On-Chp Caches [IBM 2009] IBM z196 Manframe Caches 2010 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unfed L2$/core 8-cycle latency 32MB Unfed Shared L3$ Embedded DRAM (edram) 25-cycle latency to local slce 96 cores (4 cores/chp, 24 chps/system) Out-of-order, 3-way superscalar @ 5.2GHz L1: 64KB I-$/core + 128KB D-$/core L2: 1.5MB prvate/core (144MB total) L3: 24MB shared/chp (edram) (576MB total) L4: 768MB shared/system (edram) 21 22 Prefetchng Speculate on future nstructon and data accesses and fetch them nto cache(s) Instructon accesses easer to predct than data accesses Varetes of prefetchng Hardware prefetchng Software prefetchng Mxed schemes What types of msses does prefetchng affect? Issues n Prefetchng Usefulness should produce hts Tmelness not late and not too early Cache and bandwdth polluton CPU L1 Instructon Unfed L2 Cache RF L1 Data Prefetched data 23 24
Hardware Instructon Prefetchng Instructon prefetch n Alpha AXP 21064 Fetch two lnes on a mss; the requested lne () and the next consecutve lne (+1) Requested lne placed n cache, and next lne n nstructon stream buffer If mss n cache but ht n stream buffer, move stream buffer lne nto cache and prefetch next lne (+2) CPU RF Req lne Stream Buffer L1 Instructon Prefetched nstructon lne Req lne Unfed L2 Cache Hardware Data Prefetchng Prefetch-on-mss: Prefetch b + 1 upon mss on b One-Bloc Looahead (OBL) scheme Intate prefetch for bloc b + 1 when bloc b s accessed Why s ths dfferent from doublng bloc sze? Can extend to N-bloc looahead Strded prefetch If observe sequence of accesses to lne b, b+n, b+2n, then prefetch b+3n etc. Example: IBM Power 5 [2003] supports eght ndependent streams of strded prefetch per processor, prefetchng 12 lnes ahead of current access 25 26 Software Prefetchng for(=0; < N; ++) { prefetch( &a[ + 1] ); prefetch( &b[ + 1] ); SUM = SUM + a[] * b[]; Software Prefetchng Issues Tmng s the bggest ssue, not predctablty If you prefetch very close to when the data s requred, you mght be too late Prefetch too early, cause polluton Estmate how long t wll tae for the data to come nto L1, so we can set P approprately Why s ths hard to do? for(=0; < N; ++) { prefetch( &a[ + P] ); prefetch( &b[ + P] ); SUM = SUM + a[] * b[]; Must consder cost of prefetch nstructons 27 28
Compler Optmzatons Restructurng code affects the data access sequence Group data accesses together to mprove spatal localty Re-order data accesses to mprove temporal localty Prevent data from enterng the cache Useful for varables that wll only be accessed once before beng replaced Needs mechansm for software to tell hardware not to cache data ( noallocate nstructon hnts or page table bts) Kll data that wll never be used agan Streamng data explots spatal localty but not temporal localty Replace nto dead cache locatons Loop Interchange for(j=0; j < N; j++) { for(=0; < M; ++) { x[][j] = 2 * x[][j]; for(=0; < M; ++) { for(j=0; j < N; j++) { x[][j] = 2 * x[][j]; What type of localty does ths mprove? 29 30 Loop Fuson Matrx Multply, Naïve Code for(=0; < N; ++) a[] = b[] * c[]; for(=0; < N; ++) d[] = a[] * c[]; for(=0; < N; ++) for(j=0; j < N; j++) { r = 0; for(=0; < N; ++) r = r + y[][] * z[][j]; x[][j] = r; z j for(=0; < N; ++) { a[] = b[] * c[]; d[] = a[] * c[]; y x j What type of localty does ths mprove? Not touched Old access New access 31 32
Matrx Multply wth Cache Tlng for(jj=0; jj < N; jj=jj+b) for(=0; < N; =+B) for(=0; < N; ++) for(j=jj; j < mn(jj+b,n); j++) { r = 0; for(=; < mn(+b,n); ++) r = r + y[][] * z[][j]; x[][j] = x[][j] + r; y What type of localty does ths mprove? z x j j Acnowledgements n These sldes contan materal developed and copyrght by: Arvnd (MIT) Krste Asanovc (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubatowcz (UCB) Davd Patterson (UCB) John Lazzaro (UCB) n MIT materal derved from course 6.823 n UCB materal derved from course CS152, CS252 33 34