DECstation 5 Miss Rates Cache Performance Measures % 3 5 5 5 KB KB KB 8 KB 6 KB 3 KB KB 8 KB Cache size Direct-mapped cache with 3-byte blocks Percentage of instruction references is 75% Instr. Cache Data Cache Unified Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, Average memory-access time = Hit time + Miss rate x Miss penalty (ns) Miss penalty: time to replace a block from lower level, including time to replace in CPU access time to lower level = f(latency to lower level) transfer time: time to transfer block =f(bandwidth) Cache Performance Improvements Which has the lower rate miss: A 6-KB instruction cache with a 6-KB data cache or A 3-KB unified cache Mix: 75% instruction references, 5% data references Hit time = cycle Miss penalty = 5 cycles Load/store hit = cycles on a unified cache Overall miss rate for split caches =.75*.% +.5*6.7% =.% Miss rate for unified cache =.99% Average memory access times: Split =.75 * ( +. * 5) +.5 * ( +.7 * 5) =.5 Unified =.75 * ( +.99 * 5) +.5 * ( + +.99 * 5) =. Average memory-access time = Hit time + Miss rate x Miss penalty Cache optimizations Reducing the miss rate Reducing the miss penalty Reducing the hit time Cache Performance Equations CPU time = (CPU execution cycles + Mem stall cycles) * Cycle time Mem stall cycles = Mem accesses * Miss rate * Miss penalty CPU time = IC * (CPI execution + Mem accesses per instr * Miss rate * Miss penalty) * Cycle time Misses per instr = Mem accesses per instr * Miss rate CPU time = IC * (CPI execution + Misses per instr * Miss penalty) * Cycle time Types of Cache Misses Compulsory First reference or cold start misses Capacity Working set is too big for the cache Fully associative caches Conflict (collision) Many blocks map to the same block frame (line) Affects Set associative caches Direct mapped caches
3Cs Absolute Miss Rates 3Cs Relative Miss Rate....8.6.. -way Conflict -way -way 8-way Capacity % 8% 6% % % -way -way -way 8-way Conflict Capacity : cache rule Cache Size (KB) 8 % 6 3 8 Compulsory 8 6 Cache Size (KB) 3 8 Compulsory Reducing the 3Cs Cache Misses. Larger block size. Higher associativity 3. Victim Caches. Pseudo-associative caches 5. Hardware prefetching 6. Compiler controlled prefetching 7. Compiler optimizations Execution time is the only final measure. Larger Block Size Effects of larger block sizes Reduction of compulsory misses Spatial locality Increase of miss penalty (transfer time) Reduction of number of blocks Potential increase of conflict and capacity misses Latency and bandwidth of lower-level memory High latency and bandwidth => large block size Small increase in miss penalty. Higher Associativity Miss Rate 5% % 5% % 5% % 6 3 8 56 K K 6K K 56K Eight-way set associative is good enough : Cache Rule: Miss Rate of direct mapped cache size N = Miss Rate -way cache size N/ Higher Associativity can increase Clock cycle time Hit time for -way vs. -way external cache +%, internal + % Block Size (bytes)
Average memory access time Clock cycle time =. for -way,. for -way,. for 8-way, and. for direct mapped Miss penalty = cycles Cache Size Associativity (KB) -way -way -way 8-way.33.5.7..98.86.76.68.7.67.6.53 8.6.8.7.3 6.9.3.3.3 3...5.7....3 8..7.8. Tag 3. Victim Caches Data Fully associative, small cache Reduces conflict misses without impairing clock rate Victim Cache CPU Address Data Data in out Write buffer Lower level memory. Pseudo-Associative Caches Pseudo Associative Cache Fast hit time of direct mapped and lower conflict misses of -way set-associative cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Hit time Tag Data CPU Address Data Data in out Pseudo hit time Miss penalty Drawback: CPU pipeline design is hard if hit takes or cycles Better for caches not tied directly to processor (L) Used in MIPS R L cache, similar in UltraSPARC 3 Write buffer Lower level memory Compare: direct-mapped, -way set associative, and pseudo associative. Use AMAT It takes two extra cycles to find an entry in the alternative location (one cycle to check and one cycle to swap) Miss rate pseudo = Miss rate -way Miss penalty pseudo = Miss penalty -way + Hit time pseudo = Hit time -way + Alternate hit rate pseudo * Alternate hit rate pseudo = Hit rate -way Hit rate -way = Miss rate -way Miss rate -way 5. Hardware Prefetching Instruction Prefetching Alpha fetches blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too: data stream buffer gets 5% misses from KB DM cache; streams get 3% For scientific programs: 8 streams got 5% to 7% of misses from KB, -way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty 3
6. Software Prefetching Compiler inserts data prefetch instructions Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC) Special prefetching instructions cannot cause faults; a form of speculative execution Nonblocking cache: overlap execution with prefetch Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth 7. Compiler Optimizations Avoid hardware changes Instructions Profiling to look at conflicts between groups of instructions Data Merging Arrays: improve spatial locality by single array of compound elements vs. arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows Merging Arrays /* Before: sequential arrays */ int key[size]; int val[size]; /* After: array of stuctures */ struct merge { int key; int val; struct merge merged_array[size]; Reducing conflicts between val & key; improved spatial locality Loop Interchange for (j = ; j < ; j = j+) for (i = ; i < 5; i = i+) x[i][j] = * x[i][j]; for (i = ; i < 5; i = i+) for (j = ; j < ; j = j+) x[i][j] = * x[i][j]; Sequential accesses instead of striding through memory every words; improved spatial locality Same number of executed instructions Loop Fusion for (j = ; j < N; j = j+) a[i][j] = /b[i][j] * c[i][j]; for (j = ; j < N; j = j+) d[i][j] = a[i][j] + c[i][j]; for (j = ; j < N; j = j+) { a[i][j] = /b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} misses per access to a & c vs. one miss per access; improve temporal locality Blocking (/) for (j = ; j < N; j = j+){ r = ; for (k = ; k < N; k = k+) r = r + y[i][k]*z[k][j]; x[i][j] = r; Two Inner Loops: Read all NxN elements of z[] Read N elements of row of y[] repeatedly Write N elements of row of x[] Capacity Misses a function of N & Cache Size: 3 NxNx => no capacity misses Idea: compute on BxB submatrix that fits
Blocking (/) for (jj = ; jj < N; jj = jj+b) for (kk = ; kk < N; kk = kk+b) for (j = jj; j < min(jj+b-,n); j=j+){ r = ; for(k=kk; k<min(kk+b-,n);k =k+) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; B called Blocking Factor Capacity Misses from N 3 + N to N 3 /B +N Conflict Misses Too? Conflict Misses and Blocking Miss Rate..5 Direct Mapped Cache Fully Associative Cache 5 5 Blocking Factor Compiler Optimization Performance Summary vpenta (nasa7) gmty (nasa7) tomcatv btrix (nasa7) mxm (nasa7) spice cholesky (nasa7) compress merged arrays.5.5 3 Performance Improvement loop interchange loop fusion blocking Memory accesses CPUtime = IC CPI Execution + Miss rate Miss penalty Clock cycle time Instruction 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate. Larger Block Size. Higher Associativity 3. Victim Cache. Pseudo-Associativity 5. HW Prefetching Instr, Data 6. SW Prefetching Data 7. Compiler Optimizations Danger of concentrating on just one parameter when evaluating performance 5