Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache? + bookkeepng) Ht/mss? Store (stores memory blocks) - 111 A 1-1 111 1-1 111 B Drect-Mapped Cache: Placement and Access Assume byte-addressable memory: 256 bytes, 8-byte blocks à 32 blocks Assume cache: 64 bytes, 8 blocks Drect-mapped: A block can go to only one locaton 2b 3 bts 3 bts Address V tag Cache ht rate = (# hts) / (# hts + # msses) = (# hts) / (# accesses) Average memory access tme (AMAT) = ( ht-rate ht-latency ) + ( mss-rate mss-latency ) 3 11-11 111 11 111-11 111 111 Memory MUX Addresses wth same ndex contend for the same locaton Cause conflct msses 4 1
A A, B, A, B, A, B A = b xxx B = b 1 xxx Drect-Mapped Cache: Placement and Access XXX 1 2 3 4 5 6 7 A A, B, A, B, A, B A = b xxx B = b 1 xxx Drect-Mapped Cache: Placement and Access XXX 1 2 3 4 5 6 7 1 XXXXXXXXX 2 bts 3 bts 3 bts 8-bt address MUX MISS: Fetch A and update tag 2 bts 3 bts 3 bts 8-bt address MUX B A, B, A, B, A, B A = b xxx B = b 1 xxx Drect-Mapped Cache: Placement and Access 1 XXX 1 2 3 4 5 6 7 1 XXXXXXXXX B A, B, A, B, A, B A = b xxx B = b 1 xxx Drect-Mapped Cache: Placement and Access 1 XXX 1 2 3 4 5 6 7 1 1 YYYYYYYYYY 2 bts 3 bts 3 bts 8-bt address MUX Tags do not match: MISS 2 bts 3 bts 3 bts 8-bt address MUX Fetch block B, update tag 2
A A, B, A, B, A, B A = x xxx B = x 1 xxx Drect-Mapped Cache: Placement and Access XXX 1 2 3 4 5 6 7 1 1 YYYYYYYYYY A A, B, A, B, A, B A = x xxx B = x 1 xxx Drect-Mapped Cache: Placement and Access XXX 1 2 3 4 5 6 7 1 XXXXXXXXX 2 bts 3 bts 3 bts 8-bt address MUX Tags do not match: MISS 2 bts 3 bts 3 bts 8-bt address MUX Fetch block A, update tag Set Assocatve Cache Assocatvty (and Tradeoffs) A A, B, A, B, A, B A = b xxx B = b 1 xxx XXX 1 1 2 3 1 1 XXXXXXXXX YYYYYYYYYY MUX Degree of assocatvty: How many blocks can map to the same ndex (or set)? Hgher assocatvty ++ Hgher ht rate -- Slower cache access tme (ht latency and data access latency) -- More expensve hardware (more comparators) 3 bts 2 bts 3 bts 8-bt address HIT MUX Dmnshng returns from hgher assocatvty 12 ht rate assocatvty 3
Issues n Set-Assocatve Caches Thnk of each block n a set havng a prorty Indcatng how mportant t s to keep the block n the cache Key ssue: How do you determne/adust block prortes? There are three key decsons n a set: Inserton, promoton, evcton (replacement) Inserton: What happens to prortes on a cache fll? Where to nsert the ncomng block, whether or not to nsert the block Promoton: What happens to prortes on a cache ht? Whether and how to change block prorty Evcton/replacement: What happens to prortes on a cache mss? Whch block to evct and how to adust prortes Evcton/Replacement Polcy Whch block n the set to replace on a cache mss? Any nvald block frst If all are vald, consult the replacement polcy Random FIFO Least recently used (how to mplement?) Not most recently used Least frequently used Hybrd replacement polces 13 14 Set LRU -1-2 A B C D Set LRU -1-2 E B C D ACCESS PATTERN: ACBD 15 ACCESS PATTERN: ACBDE 16 4
Set -1-2 E B C D Set -1-2 -1 E B C D ACCESS PATTERN: ACBDE 17 ACCESS PATTERN: ACBDE 18 Set -2-2 -1 E B C D Set -2 LRU -1 E B C D ACCESS PATTERN: ACBDE 19 ACCESS PATTERN: ACBDE 2 5
Set LRU -1 E B C D Set -1 LRU -1 E B C D ACCESS PATTERN: ACBDEB 21 ACCESS PATTERN: ACBDEB 22 Set -1 E B C D LRU -2 Implementng LRU Idea: Evct the least recently accessed block Problem: Need to keep track of access orderng of blocks Queston: 2-way set assocatve cache: What do you need to mplement LRU perfectly? Queston: 16-way set assocatve cache: What do you need to mplement LRU perfectly? What s the logc needed to determne the LRU vctm? ACCESS PATTERN: ACBDEB 23 24 6
Approxmatons of LRU Most modern processors do not mplement true LRU (also called perfect LRU ) n hghly-assocatve caches Why? True LRU s complex LRU s an approxmaton to predct localty anyway (.e., not the best possble cache management polcy) Examples: Not (not most recently used) Cache Replacement Polcy: LRU or Random LRU vs. Random: Whch one s better? Example: 4-way cache, cyclc references to A, B, C, D, E % ht rate wth LRU polcy Set thrashng: When the program workng set n a set s larger than set assocatvty Random replacement polcy s better when thrashng occurs In practce: Depends on workload Average ht rate of LRU and Random are smlar Best of both Worlds: Hybrd of LRU and Random How to choose between the two? Set samplng See Quresh et al., A Case for MLP-Aware Cache Replacement, ISCA 26. 25 26 What s In A Tag Store Entry? Vald bt Tag Replacement polcy bts Drty bt? Wrte back vs. wrte through caches Handlng Wrtes (I) n When do we wrte the modfed data n a cache to the next level? Wrte through: At the tme the wrte happens Wrte back: When the block s evcted Wrte-back + Can consoldate multple wrtes to the same block before evcton Potentally saves bandwdth between cache levels + saves energy -- Need a bt n the tag store ndcatng the block s drty/modfed Wrte-through + Smpler + All levels are up to date. Consstent -- More bandwdth ntensve; no coalescng of wrtes 27 28 7
Handlng Wrtes (II) Do we allocate a cache block on a wrte mss? Allocate on wrte mss No-allocate on wrte mss Allocate on wrte mss + Can consoldate wrtes nstead of wrtng each of them ndvdually to next level + Smpler because wrte msses can be treated the same way as read msses -- Requres (?) transfer of the whole cache block No-allocate + Conserves cache space f localty of wrtes s low (potentally better cache ht rate) Instructon vs. Caches Separate or Unfed? Unfed: + Dynamc sharng of cache space: no overprovsonng that mght happen wth statc parttonng (.e., splt I and D caches) -- Instructons and data can thrash each other (.e., no guaranteed space for ether) -- I and D are accessed n dfferent places n the ppelne. Where do we place the unfed cache for fast access? Frst level caches are almost always splt Manly for the last reason above Second and hgher levels are almost always unfed 29 3 Mult-level Cachng n a Ppelned Desgn Frst-level caches (nstructon and data) Decsons very much affected by cycle tme Small, lower assocatvty and data store accessed n parallel Second-level, thrd-level caches Decsons need to balance ht rate and access latency Usually large and hghly assocatve; latency less crtcal and data store accessed serally Cache Performance Seral vs. Parallel access of levels Seral: Second level cache accessed only f frst-level msses Second level does not see the same accesses as the frst Frst level acts as a flter (flters some temporal and spatal localty) Management polces are therefore dfferent 31 8
Cache Parameters vs. Mss/Ht Rate Cache sze Block sze Assocatvty Replacement polcy Inserton/Placement polcy 33 Cache Sze Cache sze: total data (not ncludng tag) capacty bgger can explot temporal localty better not ALWAYS better Too large a cache adversely affects ht and mss latency smaller s faster => bgger s slower access tme may degrade crtcal path ht rate Too small a cache doesn t explot temporal localty well useful data replaced often workng set Workng set: the whole set of data the executng applcaton references Wthn a tme nterval 34 sze cache sze Block Sze Block sze s the data that s assocated wth an address tag Assocatvty How many blocks can map to the same ndex (or set)? Too small blocks don t explot spatal localty well have larger tag overhead Too large blocks too few total # of blocks à less temporal localty explotaton waste of cache space and bandwdth/energy f spatal localty s not hgh Wll see more examples later ht rate block sze Larger assocatvty lower mss rate, less varaton among programs dmnshng returns, hgher ht latency ht rate Smaller assocatvty lower cost lower ht latency Especally mportant for L1 caches Power of 2 assocatvty requred? assocatvty 35 36 9
Hgher Assocatvty Hgher Assocatvty 3-way 4 bts 1 bts 3 bts 8-bt address 4 bts 1 bts 3 bts 8-bt address MUX MUX MUX MUX 37 38 Classfcaton of Cache Msses Compulsory mss frst reference to an address (block) always results n a mss subsequent references should ht unless the cache block s dsplaced for the reasons below Capacty mss cache s too small to hold everythng needed defned as the msses that would occur even n a fully-assocatve cache (wth optmal replacement) of the same capacty Conflct mss defned as any mss that s nether a compulsory nor a capacty mss How to Reduce Each Mss Type Compulsory Cachng cannot help Prefetchng Conflct More assocatvty Other ways to get more assocatvty wthout makng the cache assocatve Vctm cache Hashng Software hnts? Capacty Utlze cache space better: keep blocks that wll be referenced Software management: dvde workng set such that each phase fts n cache 39 4 1
Matrx Sum Cache Performance wth Code Examples nt sum1(nt matrx[4][8]) { nt sum = ; for (nt = ; < 4; ++) { for (nt = ; < 8; ++) { sum += matrx[][]; } } } access pattern: matrx[][], [][1], [][2],, [1][] Explotng Spatal Localty 8B cache block, 4 blocks, LRU, 4B nteger Access pattern matrx[][], [][1], [][2],, [1][] [][] à mss [][1] à ht [][2] à mss [][3] à ht [][4] à mss [][5] à ht [][6] à mss [][7] à ht [1][] à mss [1][1] à ht [][]-[][1] [][2]-[][3] [][4]-[][5] [][6]-[][7] Cache Blocks Replace [1][]-[1][1] [][2]-[][3] [][4]-[][5] [][6]-[][7] Explotng Spatal Localty block sze and spatal localty larger blocks explot spatal localty but larger blocks means fewer blocks for same sze less good at explotng temporal localty 11
Alternate Matrx Sum nt sum2(nt matrx[4][8]) { nt sum = ; // swapped loop order for (nt = ; < 8; ++) { for (nt = ; < 4; ++) { sum += matrx[][]; } } } access pattern: matrx[][], [1][], [2][], [3][], [][1], [1][1], [2][1], [3][1],, Bad at Explotng Spatal Localty 8B cache block, 4B nteger Access pattern matrx[][], [1][], [2][], [3][], [][1], [1][1], [2][1], [3][1],, [][] à mss [1][] à mss [2][] à mss [3][] à mss [][1] à ht [1][1] à ht [2][1] à ht [3][1] à ht [][2] à mss [1][2] à mss [][]-[][1] [1][]-[1][1] [2][]-[2][1] [3][]-[3][1] Cache Blocks Replace [][2]-[][3] [1][]-[1][1] [2][]-[2][1] [3][]-[3][1] Replace [][2]-[][3] [1][2]-[1][3] [2][]-[2][1] [3][]-[3][1] A note on matrx storage A > N X N matrx: represented as an 2D array makes dynamc szes easer: float A_2d_array[N][N]; float A_flat = malloc(n N); A_flat[ N + ] === A_2d_array[][] B "# = & A "( A (# (+, / verson 1: nner loop s k, mddle s / for (nt = ; < N; ++) for (nt = ; < N; ++) for (nt k = ; k < N; ++k) B[N+] += A[ N + k] A[k N + ]; 12
B B -, B -. B -/ B,- B,, B,. B,/ A -- A -, A -. A -/ A,- A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B B -, B -. B -/ B,- B,, B,. B,/ A A -, A -. A -/ A,- A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B -- = & A -( A (- (+- B -- = (A -- A -- ) + (A -, A,- ) + (A -. A.- ) + (A -/ A /- ) B -- = & A -( A (- (+- B -- = (A A ) + (A -, A,- ) + (A -. A.- ) + (A -/ A /- ) B B -, B -. B -/ B,- B,, B,. B,/ A A 1 A -. A -/ A 1 A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B B -, B -. B -/ B,- B,, B,. B,/ A A 1 A 2 A -/ A 1 A,, A,. A,/ A 2 A., A.. A./ A /- A /, A /. A // B -- = & A -( A (- (+- B -- = (A -- A -- ) + (A 1 A 1 ) + (A -. A.- ) + (A -/ A /- ) B -- = & A -( A (- (+- B -- = (A -- A -- ) + (A -, A,- ) + (A 2 A 2 ) + (A -/ A /- ) 13
B B -, B -. B -/ B,- B,, B,. B,/ A A 1 A 2 A 3 A 1 A,, A,. A,/ A 2 A., A.. A./ A 3 A /, A /. A // A k has spatal localty B -- B 1 B -. B -/ B,- B,, B,. B,/ A A 1 A 2 A 3 A,- A 11 A,. A,/ A.- A 21 A.. A./ A /- A 31 A /. A // A k has spatal localty B -- = & A -( A (- (+- B -- = (A -- A -- ) + (A -, A,- ) + (A -. A.- ) + (A 3 A 3 ) B -, = & A -( A (, (+- B 1 = (A A 1 ) + (A 1 A 11 ) + (A 2 A 21 ) + (A 3 A 31 ) Concluson B -- B -, B 2 B -/ B,- B,, B,. B,/ A A 1 A 2 A 3 A,- A,, A 12 A,/ A.- A., A 22 A./ A /- A /, A 32 A // A k has spatal localty A k has spatal localty B has temporal localty B -. = & A -( A (. (+- B 2 = (A A 2 ) + (A 1 A 12 ) + (A 2 A 22 ) + (A 3 A 32 ) 14
B "# = & A "( A (# (+, / verson 2: outer loop s k, mddle s / for (nt k = ; k < N; ++k) for (nt = ; < N; ++) for (nt = ; < N; ++) B[N+] += A[ N + k] A[k N + ]; Access pattern k =, = B[][] = A[][] A[][] B[][1] = A[][] A[][1] B[][2] = A[][] A[][2] B[][3] = A[][] A[][3] Access pattern k =, = 1 B[1][] = A[1][] A[][] B[1][1] = A[1][] A[][1] B[1][2] = A[1][] A[][2] B[1][3] = A[1][] A[][3] : k order B B 1 B 2 B 3 B,- B,, B,. B,/ A A 1 A 2 A 3 A,- A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B = (A A ) + (A -, A,- ) + (A -. A.- ) + (A -/ A /- ) B 1 = (A A 1 ) + (A -, A,, ) + (A -. A., ) + (A -/ A /, ) B 2 = (A A 2 ) + (A -, A,. ) + (A -. A.. ) + (A -/ A /. ) B 3 = (A A 3 ) + (A -, A,/ ) + (A -. A./ ) + (A -/ A // ) : k order B -- B -, B -. B -/ B 1 B 11 B 12 B 13 A A 1 A 2 A 3 A 1 A,, A,. A,/ A.- A., A.. A./ A /- A /, A /. A // B, A k have spatal localty A k has temporal localty B 1 = (A 1 A ) + (A,, A,- ) + (A,. A.- ) + (A,/ A /- ) B 11 = (A 1 A 1 ) + (A,, A,, ) + (A,. A., ) + (A,/ A /, ) B 12 = (A 1 A 2 ) + (A,, A,. ) + (A,. A.. ) + (A,/ A /. ) B 13 = (A 1 A 3 ) + (A,, A,/ ) + (A,. A./ ) + (A,/ A // ) k order B, A k have spatal localty A k has temporal localty k order A k has spatal localty B has temporal localty 15
Whch order s better? Order k performs much better 16