Topcs Lecture 4 Cache Memores Generc cache memory organzaton Drect mapped caches Set assocate caches Impact of caches on performance Cache Memores Cache memores are small, fast SRAM-based memores managed automatcally n hardware. Hold frequently accessed blocks of man memory CPU looks frst for data n L, then n L2, then n man memory. Typcal bus structure: L2 cache CPU chp regster fle L ALU cache cache bus system bus memory bus bus nterface I/O brdge man memory F4 2 Datorarktektur 28 Insertng an L Cache Between the CPU and Man Memory The transfer unt between the CPU regster fle and the cache s a 4-byte block. The transfer unt between the cache and man memory s a 4-word block (6 bytes). lne lne block block 2 block 3 a b c d... p q r s... w x y z... The tny, ery fast CPU regster fle has room for four 4-byte words. The small fast L cache has room for two 4-word blocks. The bg slow man memory has room for many 4-word blocks. F4 3 Datorarktektur 28 General Org of a Cache Memory bt t bts B = 2 b bytes Cache s an array per lne per lne per of sets. B Each set contans E lnes set : one or more lnes. per set B Each lne holds a block of data. B set : S = 2 s sets B B set S-: B Cache sze: C = B x E x S data bytes F4 4 Datorarktektur 28
Addressng Caches Address A: t bts s bts b bts Drect-Mapped Cache Smplest knd of cache set : B B m- <> <set ndex> <block offset> Characterzed by exactly one lne per set. set : set S-: B B B B The word at address A s n the cache f the bts n one of the <> lnes n set <set ndex> match <>. The word contents begn at offset <block offset> bytes from the begnnng of the block. set : set : set S-: E= lnes per set F4 5 Datorarktektur 28 F4 6 Datorarktektur 28 m- Accessng Drect-Mapped Caches Set selecton t bts Use the set ndex bts to determne the set of nterest. s bts selected set b bts set ndex block offset set : set : set S-: Accessng Drect-Mapped Caches Lne matchng and word selecton Lne matchng: Fnd a lne n the selected set wth a matchng Word selecton: Then extract the word selected set (): (2) The bts n the cache lne must match the bts n the address =? () The bt must be set m- =? t bts 2 3 4 5 6 7 w w w 2 w 3 s bts b bts set ndex block offset (3) If () and (2), then cache ht, and block offset selects startng byte. F4 7 Datorarktektur 28 F4 8 Datorarktektur 28
Drect-Mapped Cache Smulaton t= s=2 b= x xx () (4) x M=6 byte addresses, B=2 bytes/block, S=4 sets, E= entry/set Address trace (reads): [ 2 ], [ 2 ], 3 [ 2 ], 8 [ 2 ], [ 2 ] [ 2 ] (mss) data m[] M[-] m[] 8 [ 2 ] (mss) data m[9] M[8-9] m[8] M[2-3] 3 [ 2 ] (mss) data m[] M[-] m[] m[3] m[2] M[2-3] F4 9 Datorarktektur 28 (3) (5) [ 2 ] (mss) data m[] M[-] m[] m[3] M[2-3] m[2] Why Use Mddle Bts as Index? 4-lne Cache Hgh-Order Bt Indexng Adjacent memory lnes would map to same cache entry Poor use of spatal localty Mddle-Order Bt Indexng Consecute memory lnes map to dfferent cache lnes Can hold C-byte regon of address space n cache at one tme Hgh-Order Bt Indexng Mddle-Order Bt Indexng F4 Datorarktektur 28 Set Assocate Caches Characterzed by more than one lne per set Accessng Set Assocate Caches Set selecton dentcal to drect-mapped cache set : E=2 lnes per set set : set : Selected set set : set S-: m- t bts s bts b bts set ndex block offset set S-: F4 Datorarktektur 28 F4 2 Datorarktektur 28
Accessng Set Assocate Caches Lne matchng and word selecton must compare the n each lne n the selected set. Mult-Leel Caches Optons: separate data and nstructon caches,, or a unfed cache selected set (): =? () The bt must be set. 2 3 4 5 6 7 w w w 2 w 3 Processor Regs L d-cache L -cache Unfed L2 Cache Memory dsk (2) The bts n one of the cache lnes must match the bts n the address m- =? t bts s bts b bts set ndex block offset (3) If () and (2), then cache ht, and block offset selects startng byte. sze: speed: $/Mbyte: lne sze: 2 B 8-64 KB 3 ns 3 ns 8 B 32 B larger, slower, cheaper -4MB SRAM 6 ns $/MB 32 B 28 MB DRAM 6 ns $.5/MB 8 KB 3 GB 8 ms $.5/MB F4 3 Datorarktektur 28 F4 4 Datorarktektur 28 Intel Pentum Cache Herarchy L Data cycle latency 6 KB L2 Unfed Regs. 4-way assoc 28KB--2 MB Wrte-through Man 4-way assoc 32B lnes Memory Wrte-back Up to 4GB L Instructon Wrte allocate 6 KB, 4-way 32B lnes 32B lnes Processor Chp F4 5 Datorarktektur 28 Cache Performance Metrcs Mss Rate Fracton of memory references not found n cache (msses/references) Typcal numbers: 3-% for L can be qute small (e.g., < %) for L2, dependng on sze, etc. Ht Tme Tme to deler a lne n the cache to the processor (ncludes tme to determne whether the lne s n the cache) Typcal numbers: clock cycle for L 3-8 clock cycles for L2 Mss Penalty Addtonal tme requred because of a mss Typcally 25- cycles for man memory F4 6 Datorarktektur 28
Wrtng Cache Frendly Code Repeated references to arables are good (temporal localty) Strde- reference patterns are good (spatal localty) Examples: cold cache, 4-byte words, 4-word s nt sumarrayrows(nt a[m][n]) { nt, j, sum = ; nt sumarraycols(nt a[m][n]) { nt, j, sum = ; The Memory Mountan Read throughput (read bandwdth) Number of bytes read from memory per second (MB/s) Memory mountan Measured read throughput as a functon of spatal and temporal localty. Compact way to characterze memory system performance. for ( = ; < M; ++) for (j = ; j < N; j++) sum += a[][j]; return sum; for (j = ; j < N; j++) for ( = ; < M; ++) sum += a[][j]; return sum; Mss rate = /4 = 25% Mss rate = % F4 7 Datorarktektur 28 F4 8 Datorarktektur 28 Memory Mountan Test Functon /* The test functon */ od test(nt elems, nt strde) { nt, result = ; olatle nt snk; for ( = ; < elems; += strde) result += data[]; snk = result; /* So compler doesn't optmze away the loop */ /* Run test(elems, strde) and return read throughput (MB/s) */ double run(nt sze, nt strde, double Mhz) { double cycles; nt elems = sze / szeof(nt); test(elems, strde); /* warm up the cache */ cycles = fcyc2(test, elems, strde, ); /* call test(elems,strde) */ return (sze / strde) / (cycles / Mhz); /* conert cycles to MB/s */ F4 9 Datorarktektur 28 Memory Mountan Man Routne /* mountan.c - Generate the memory mountan. */ #defne MINBYTES ( << ) /* Workng set sze ranges from KB */ #defne MAXBYTES ( << 23) /*... up to 8 MB */ #defne MAXSTRIDE 6 /* Strdes range from to 6 */ #defne MAXELEMS MAXBYTES/szeof(nt) nt data[maxelems]; /* The array we'll be traersng */ nt man() { nt sze; /* Workng set sze (n bytes) */ nt strde; /* Strde (n array elements) */ double Mhz; /* Clock frequency */ nt_data(data, MAXELEMS); /* Intalze each element n data to */ Mhz = mhz(); /* Estmate the clock frequency */ for (sze = MAXBYTES; sze >= MINBYTES; sze >>= ) { for (strde = ; strde <= MAXSTRIDE; strde++) prntf("%.f\t", run(sze, strde, Mhz)); prntf("\n"); ext(); F4 2 Datorarktektur 28
The Memory Mountan read throughput (MB/s) Slopes of Spatal Localty 2 8 6 4 2 s s3 strde (words) s5 s7 s9 s s3 mem s5 8m F4 2 Datorarktektur 28 xe L2 2m 52k L 28k 32k 8k 2k Pentum III Xeon 55 MHz 6 KB on-chp L d-cache 6 KB on-chp L -cache 52 KB off-chp unfed L2 cache Rdges of Temporal Localty workng set sze (bytes) Rdges of Temporal Localty Slce through the memory mountan wth strde= llumnates read throughputs of dfferent caches and memory read througput (MB/s) 2 8 6 4 2 8m man memory regon 4m 2m 24k 52k 256k L2 cache regon 28k workng set sze (bytes) F4 22 Datorarktektur 28 64k 32k 6k 8k L cache regon 4k 2k k A Slope of Spatal Localty Slce through memory mountan wth sze=256kb shows sze. read throughput (MB/s) 8 7 6 5 4 3 2 s s2 s3 s4 s5 s6 s7 s8 s9 s s s2 s3 s4 s5 s6 strde (words) one access per cache lne F4 23 Datorarktektur 28 Matrx Multplcaton Example Major Cache Effects to Consder Total cache sze Explot temporal localty and keep the workng set small (e.g., by usng blockng) Block sze Descrpton: Explot spatal localty Multply N x N matrces O(N3) total operatons Accesses N reads per source element N alues summed per destnaton» but may be able to hold n regster /* jk */ Varable sum for (=; <n; ++) { held n regster for (j=; j<n; j++) { sum =.; for (k=; k<n; k++) c[][j] = sum; F4 24 Datorarktektur 28
Mss Rate Analyss for Matrx Multply Assume: Lne sze = 32B (bg enough for 4 64-bt words) Matrx dmenson (N) s ery large Approxmate /N as. Cache s not een bg enough to hold multple rows Analyss Method: Look at access pattern of nner loop k A k j B j C Layout of C Arrays n Memory (reew) C arrays allocated n row-major order each row n contguous memory locatons Steppng through columns n one row: for ( = ; < N; ++) sum += a[][]; accesses successe elements f block sze (B) > 4 bytes, explot spatal localty compulsory mss rate = 4 bytes / B Steppng through rows n one column: for ( = ; < n; ++) sum += a[][]; accesses dstant elements no spatal localty! compulsory mss rate = (.e. %) F4 25 Datorarktektur 28 F4 26 Datorarktektur 28 Matrx Multplcaton (jk) Matrx Multplcaton (jk) /* jk */ for (=; <n; ++) { for (j=; j<n; j++) { sum =.; for (k=; k<n; k++) (*,j) (,j) (,*) /* jk */ for (j=; j<n; j++) { for (=; <n; ++) { sum =.; for (k=; k<n; k++) (*,j) (,j) (,*) c[][j] = sum; c[][j] = sum Msses per Inner Loop Iteraton: Row-wse Fxed Msses per Inner Loop Iteraton: Row-wse Columnwse Columnwse Fxed.25...25.. F4 27 Datorarktektur 28 F4 28 Datorarktektur 28
Matrx Multplcaton (kj) Matrx Multplcaton (kj) /* kj */ for (k=; k<n; k++) { for (=; <n; ++) { r = a[][k]; for (j=; j<n; j++) c[][j] += r * b[k][j]; (,k) (k,*) (,*) Fxed Row-wse Row-wse /* kj */ for (=; <n; ++) { for (k=; k<n; k++) { r = a[][k]; for (j=; j<n; j++) c[][j] += r * b[k][j]; (,k) (k,*) (,*) Fxed Row-wse Row-wse Msses per Inner Loop Iteraton: Msses per Inner Loop Iteraton:..25.25..25.25 F4 29 Datorarktektur 28 F4 3 Datorarktektur 28 Matrx Multplcaton (jk) Matrx Multplcaton (kj) /* jk */ for (j=; j<n; j++) { for (k=; k<n; k++) { r = b[k][j]; for (=; <n; ++) c[][j] += a[][k] * r; Msses per Inner Loop Iteraton: /* kj */ for (k=; k<n; k++) { for (j=; j<n; j++) { r = b[k][j]; for (=; <n; ++) c[][j] += a[][k] * r; Msses per Inner Loop Iteraton: (*,k) (*,j) (k,j) Column - wse Fxed Columnwse (*,k) (*,j) (k,j) Columnwse Fxed Columnwse...... F4 3 Datorarktektur 28 F4 32 Datorarktektur 28
Summary of Matrx Multplcaton Pentum Matrx Multply Performance jk (& jk): kj (& kj): jk (& kj): Mss rates are helpful but not perfect predctors. 2 loads, stores 2 loads, store 2 loads, store Code schedulng matters, too. msses/ter =.25 msses/ter =.5 msses/ter = 2. 6 for (=; <n; ++) { for (k=; k<n; k++) { for (j=; j<n; j++) { for (j=; j<n; j++) { for (=; <n; ++) { for (k=; k<n; k++) { 5 sum =.; for (k=; k<n; k++) c[][j] = sum; r = a[][k]; for (j=; j<n; j++) c[][j] += r * b[k][j]; r = b[k][j]; for (=; <n; ++) c[][j] += a[][k] * r; Cycles/te 4 3 2 kj jk kj kj jk jk 25 5 75 25 5 75 2 225 25 275 3 325 35 375 4 Array sze (n) F4 33 Datorarktektur 28 F4 34 Datorarktektur 28 Improng Temporal Localty by Blockng Example: Blocked matrx multplcaton block (n ths context) does not mean. Instead, t mean a sub-block wthn the matrx. Example: N = 8; sub-block sze = 4 A A 2 A 2 A 22 B B 2 X = B 2 B 22 C C 2 C 2 C 22 Key dea: Sub-blocks (.e., A xy ) can be treated just lke scalars. C = A B + A 2 B 2 C 2 = A B 2 + A 2 B 22 C 2 = A 2 B + A 22 B 2 C 22 = A 2 B 2 + A 22 B 22 Blocked Matrx Multply (bjk) for (jj=; jj<n; jj+=bsze) { for (=; <n; ++) for (j=jj; j < mn(jj+bsze,n); j++) c[][j] =.; for (kk=; kk<n; kk+=bsze) { for (=; <n; ++) { for (j=jj; j < mn(jj+bsze,n); j++) { sum =. for (k=kk; k < mn(kk+bsze,n); k++) { c[][j] += sum; F4 35 Datorarktektur 28 F4 36 Datorarktektur 28
Blocked Matrx Multply Analyss Innermost loop par multples a X bsze sler of A by a bsze X bsze block of B and accumulates nto X bsze sler of C Loop oer steps through n row slers of A & C, usng same B for (=; <n; ++) { for (j=jj; j < mn(jj+bsze,n); j++) { sum =. for (k=kk; k < mn(kk+bsze,n); k++) { Innermost c[][j] += sum; kk jj jj Loop Par kk row sler accessed Update successe bsze tmes block reused n tmes elements of sler n successon F4 37 Datorarktektur 28 Pentum Blocked Matrx Multply Performance Blockng (bjk and bkj) mproes performance by a factor of two oer unblocked ersons (jk and jk) relately nsenste to array sze. Cycles/teraton 6 5 4 3 2 25 5 75 25 5 75 2 225 25 275 3 325 35 375 4 kj jk kj kj jk jk bjk (bsze = 25) bkj (bsze = 25) Array sze (n) F4 38 Datorarktektur 28 Concludng Obseratons Programmer can optmze for cache performance How data structures are organzed How data are accessed Nested loop structure Blockng s a general technque All systems faor cache frendly code Gettng absolute optmum performance s ery platform specfc Cache szes, lne szes, assocattes, etc. Can get most of the adane wth generc code Keep workng set reasonably small (temporal localty) Use small strdes (spatal localty) F4 39 Datorarktektur 28