CS222: Cache Performance Improvement

Size: px

Start display at page:

Download "CS222: Cache Performance Improvement"

Arleen Jennings
5 years ago
Views:

1 CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati

2 Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing bandwidth Prev: Reducing miss penalty Reducing miss rate Rd Reducing miss penalty * miss rate

3 Eleven Advanced Optimization for Reducing hit time Cache Performance Reducing miss penalty Reducing miss rate Reducing miss penalty * miss rate Ref: 5.2, Computer Architecture: A Quantitative Approach, Hennessy Patterson Book, 4 th Edition, PDF Version Available on Course website (Intranet)

4 Reducing Hit Time Small and dsimple pecaches Pipelined cache access Trace caches Avoid time loss in address translation (Out of scope of this course: First Read OS) Virtually indexed, physically tagged cache simple and effective approach possible only if cache is not too large Virtually addressed cache protection?, multiple processes?, aliasing?, I/O?

5 Reducing Miss Penalty Multi level caches Critical word first and early restart Giving priority to read misses over write Merging write buffer Victim caches

6 Reducing Miss Rate Large Block Size Larger Cache Higher ih Associativity i i Way prediction and pseudo associative cache Compiler optimizations

7 Large Block Size Take benefit of spatial locality Reduces compulsory misses Too large block size misses increase Miss Penalty increases

8 Large Cache Reduces capacity misses Hit time increases Keep small L1 cache and large L2 cache

9 Higher Associativity Reduces conflict misses 8 way is almost like fully associative Hit time increases: What to do? Pseudo Associativity

10 Way Prediction and Pseudo associative Cache Way prediction: low miss rate of SA cache with hit time of DM cache Only one tag is compared initially Extra bits are kept for prediction Hit time in case of mis prediction is high Pseudo assoc. or column assoc. cache: get advantage of SA cache in a DM cache Check sequentially in a pseudo set Fast hit and slow hit

11 Compiler optimizations Loop interchange Improve spatial locality by scanning arrays row wise Blocking Improve temporal and spatial locality

12 Improving Locality Matrix Multiplication example [ C ] = [ A ] [ B ] L M L N N M

13 Cache Organization for the example Cache line (or block) = 4 matrix elements. Matrices are stored row wise. Cache can t accommodate a full row/column. L, M and N are so large w.r.t. the cache size After an iteration along any of the three indices, when an element is accessed again, it results in a miss. Ignore misses due to conflict between matrices. As if there was a separate cache for each matrix.

14 Matrix Multiplication : Code I for (i = 0; i < L; i++) for (j = 0; j < M; j++) for (k = 0; k < N; k++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LM LMN LMN misses LM/4 LMN/4 LMN Total misses = LM(5N+1)/4

15 Matrix Multiplication : Code II for (k = 0; k < N; k++) for (i = 0; i < L; i++) for (j = 0; j < M; j++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LMN LN LMN misses LMN/4 LN LMN/4 Total misses = LN(2M+4)/4

16 Matrix Multiplication : Code III for (i = 0; i < L; i++) for (k = 0; k < N; k++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; C A B accesses LMN LN LMN misses LMN/4 LN/4 LMN/4 Total misses = LN(2M+1)/4

17 Reducing MissRate*MissPenality

18 Reducing Miss Penalty * Miss Rate Non blocking cache Hardware prefetching Compiler controlled prefetching

19 Non blocking Cache In OOO processor Hit under a miss complexity of cache controller increases Hit under multiple misses or miss under a miss memoryshould be able to handle multiple misses

20 Hardware Prefetching Prefetch items before they are requested both data and instructions What and when to prefetch? fetch two blocks on a miss (requested+next) Where to keep prefetched information? in cache in a separate buffer (most common case)

21 Prefetch Buffer/Stream Buffer to proc Cache prefetch buffer from mem

22 Compiler Controlled Pre fetching Semantically invisible (no change in registers or cache contents) Makes sense if processor doesn t stall while prefetching (non blocking cache) Overhead of prefetch hinstruction i should not exceed the benefit

23 SW Prefetch Example 8 KB direct mapped, write back data cache with 16 byte blocks. a is 3 100, b is for (i = 0; i < 3; i++) for (j = 0; j < 100; j++) a[i][j] = b[j][0] * b[j+1][0]; each array element is 8 bytes each array element is 8 bytes misses in array a = 3 * 100 /2 = 150 misses in array b = 101 total misses = 251

24 SW Prefetch Example contd. Suppose we need to prefetch 7 iterations in advance for (j = 0; j < 100; j++){ prefetch(b[j+7]][0]); prefetch(a[0][j+7]); a[0][j] = b[j][0] * b[j+1][0];}; for (i = 1; i < 3; i++) for (j = 0; j < 100; j++){ prefetch(a[i][j+7]); a[i][j] = b[j][0] * b[j+1][0];}; misses in first loop = 7 (for b[0..6][0]) + 4 (for a[0][0..6] ) misses in second loop = 4 (for a[1][0..6]) + 4 (for a[2][0..6] ) total misses = 19, total prefetches = 400

25 SW Prefetch Example contd. Performance improvement? Assume no capacity and conflict misses, prefetches overlap with each other and with misses Original loop: 7, Prefetch loops: 9 and 8 cycles Miss penalty = 100 cycles Original loop = 300* *100 = 27,200 cycles 1 st prefetch loop = 100*9 + 11*100 = 2,000 cycles 2 nd prefetch loop = 200*8 + 8*100 = 2,400 cycles Speedup = 27200/( ) = 6.2

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped