CSCI 402: Computer Architectures. Performance of Multilevel Cache

CSCI 402: Computer Architectures Memory Hierarchy (5) Fengguang Song Department of Computer & Information Science IUPUI Performance of Multilevel Cache Main Memory CPU L1 cache L2 cache Given CPU base CPI = 1, clock rate = 4GHz Miss rate per instruction = 2% //an average Main memory access time = 100ns Q: What is the Effective CPI? = Base CPI + Miss Cycles Per Instruction = Base CPI + Miss rate per instruction x miss penalty For now, suppose only L-1 cache (i.e., no L2) Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 cycle + 2% 400 = 9 cycles 3 1

Multilevel Cache Example CPU 2% 0.5% L1 cache L2 cache Now, suppose we add L-2 cache L2 cache access time = 5ns L2 global miss rate = 0.5% L2 hit time, and how many cycles? 5ns --> 20 cycles //4GHz CPU L2 miss time, and how many cycles? 100 ns --> 400 cycles Effective CPI = 1 + 2% 20 + 0.5% 400 = 3.4cycles Performance ratio = 9 / 3.4 = 2.6x Main Memory Effective CPI = Base CPI + L1 Miss Rate x L2 Hit cycles + L2 global miss rate x L2 Miss cycles AMAT = L1 Hit Time + L1 Miss Rate x L2 Hit Time + L2 global miss rate x L2 Miss Time 4 How Do We Get the AMAT Formula? The original version: AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 And, Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 è AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 x Miss Penalty L2 ) Definitions: Local miss rate misses in the cache divided by the total number of accesses to the cache (i.e., Miss rate L2 ) Often referred to cache miss rate //ignoring local. Global miss rate misses in the cache divided by the total number of accesses generated by the CPU (i.e., Miss Rate L1 x Miss Rate L2 ) Note: Global Miss Rate is DIFFERENT from Local Miss Rate! 2

Multilevel Cache Design Considerations L-1 cache Focus on the minimal hit time (as fast as possible) L-2 cache Focus on low miss rate (to avoid memory access) Hit time has less overall impact in L2 cache. Result: L-1 cache usually smaller than L-2 cache 6 How Caches Affect Software Performance? Time complexity Cache Misses depend on your code s memory access patterns, which also depend on: your algorithm design compiler optimization for memory access Wallclock Time Example: Radix sort (see next slide) VS Quick sort 8 3

Two digits --> two rounds 9 Another Example: DGEMM Assuming Cache block size = 32B (i.e., big enough for 4 double s, 4 8 = 32) Suppose n is very large (matrix size is ' ') Approximate 1/n as 0.0 Cache not even big enough to hold 2 rows. Analysis Method: Look at memory access pattern of the inner loop k j j i k i A B (inner loop variable = k) C 10 4

Matrix multiplication (ijk) /* ijk */ for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; Inner loop: (*,j) (i,j) (i,*) A B C Approx. Cache Miss Rates a b c 0.25 1.0 0.0 1 miss every 1 miss every 4 th access access Row-wise Columnwise Fixed Note: Assuming n is very large. 11 Matrix multiplication (jik) /* jik */ for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; Inner loop: (*,j) (i,j) (i,*) A B C Approx. Miss Rates a b c 0.25 1.0 0.0 Row-wise Columnwise Fixed 12 5

Matrix multiplication (kij) /* kij */ r = a[i][k];/* keep in reg */ for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) A B C (i,*) Approx. Miss Rates a b c 0.0 0.25 0.25 Fixed Row-wise Row-wise Generating partial sums for C 13 Matrix multiplication (ikj) /* ikj */ r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Approx. Miss Rates a b c 0.0 0.25 0.25 Inner loop: (i,k) (k,*) (i,*) A B C Fixed Row-wise Row-wise 14 6

Matrix multiplication (jki) /* jki */ r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Inner loop: (*,k) (*,j) (k,j) A B C Approx. Miss Rates a b c 1.0 0.0 1.0 Column - wise Fixed Columnwise 15 Matrix multiplication (kji) /* kji */ r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Approx. Miss Rates a b c 1.0 0.0 1.0 Inner loop: (*,k) (k,j) (*,j) A B C Fixed Columnwise Columnwise 16 7

Summary of Matrix Multiplication ijk sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; jik sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum kij (best) r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; ikj (best) r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r*b[k][j]; jki r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; kji r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; 17 Software Optimization via Technique of Blocking n Goal: Maximize accesses to data before it is replaced in cache n Consider inner loops of DGEMM (the ijk version) for (int i = 0; i < n; ++i) for (int j = 0; j < n; ++j) { double cij = C[i+j*n]; for( int k = 0; k < n; k++ ) cij += A[i+k*n] * B[k+j*n]; C[i+j*n] = cij; 18 8

Cache Blocked DGEMM 1 #define BLOCKSIZE 32 2 void do_block (int n, int si, int sj, int sk, double *A, double 3 *B, double *C) 4 { 5 for (int i = si; i < si+blocksize; ++i) 6 for (int j = sj; j < sj+blocksize; ++j) 7 { 8 double cij = C[i+j*n];/* cij = C[i][j] */ 9 for( int k = sk; k < sk+blocksize; k++ ) 10 cij += A[i+k*n] * B[k+j*n];/* cij+=a[i][k]*b[k][j] */ 11 C[i+j*n] = cij;/* C[i][j] = cij */ 12 13 14 void dgemm (int n, double* A, double* B, double* C) 15 { 16 for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 17 for ( int si = 0; si < n; si += BLOCKSIZE ) 18 for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 19 do_block(n, si, sj, sk, A, B, C); 20 19 Blocked DGEMM Access Pattern How it works: Why? 2x Faster! Unoptimized Blocked 20 9

Dependability and Memory A system alternates between two states: Accomplishment and Interruption: Service accomplishment Service delivered as specified Restoration Failure Service interruption Different from the specified service 22 Dependability Measures How to measure how dependable a system is? 2 related terms Reliability : Mean time to failure (MTTF) It is a measurement of service accomplishment Service interruption time: Mean time to repair (MTTR) It is a measurement of service interruption Mean time between failures (MTBF) MTBF = MTTR + MTTF Availability = MTTF / (MTTF + MTTR) To improve Availability Increase MTTF: fault avoidance, fault tolerance, fault forecasting Reduce MTTR: improved tools and processes for diagnosis and repair 23 10

Memory Error and Hamming SEC Code Soft Error (or transient error) //one ore more bits may flip Why? In a modern chip, devices are so small that cosmic rays or alpha particles can change the value of bits that are stored in registers/cache, or when they are simply moving across the chips. In low-voltage low-power CPU, even worse. Because of small voltage difference between 0 and 1. Hence, SEC (Single Error Correcting) and DED (Double Error Detecting) Hamming distance: Minimum number of bits that are different between two bit patterns If minimum distance = 3, we can provide 1-bit error correcting 24 What is the Memory Fault Rate Today? WHAT KIND OF SRAM FAULTS OCCUR IN PRACTICE? 1000 L2 Cache L2 Data Array Jaguar Cielo 1000 L3 Cache L3 Data Array Jaguar Cielo 735 Relative Monthly Fault Rate 100 10 1 1 0.39 172 77 Relative Monthly Fault Rate 100 10 1 1 2.5 219 0.1 0.1 Permanent Transient Permanent Transient Most SRAM faults are transient, especially in mature process technologies 5 MEMORY ERRORS IN MODERN SYSTEMS OCTOBER 2, 2014 PUBLIC Sridharan et al., Feng Shui of Supercomputer Memory, SC 2013 11

Encoding for SEC To calculate Hamming code: Number the bits from 1 starting on the left i.e., 1, 2, 3, 16,..., 32 All positions that are a power 2 are parity bits E.g., use 12 bits to encode 8 data bits (4 parity bits) Each parity bit will check certain bits: 26 Parity bit 1 covers all bit positions which have the rightmost bit set: bit 1 (the parity bit itself), 3, 5, 7, 9, etc. Parity bit 2 covers all bit positions which have the 2nd to the right bit set: bit 2 (the parity bit itself), 3, 6, 7, 10, 11, etc. Parity bit 4 covers all bit positions which have the 3rd to the right bit set: bits 4 7, 12 15, 20 23, etc. Parity bit 8 covers all bit positions which have the 4th to the right bit set: bits 8 15, 24 31, 40 47, etc. 27 12

Note: Syndrome bits = 0000 2 indicates no error 28 DED Code DED: Double Error Detecting Coding: Add an additional parity bit for the whole word (i.e., p n ) So that we make Hamming distance = 4 Decoding: Let H be the original group of SEC parity bits H even, p n even, no error H odd, p n odd, correctable single bit error H even, p n odd, error in the p n bit H odd, p n even, double error occurred (but cannot correct) Assumption: 1 bit error is very important; 2 bits error is rare; 3 bits error is so rare (that we can ignore it). Note: Current ECC DRAM uses SEC/DEC Using 8 bits to protect each 64 bits (therefore, DIMMS are 72 bits wide). 30 13

We have finished the cache subject! Remark: Cache could be the most important topic in computer architectures. Next topic: Virtual Memory 31 A New Deeper Memory Hierarchy Registers Instructions and Operands Cache Cache blocks Memory Pages Disk Cache is the cache for main memory VM: main memory is the cache for disks 32 14