INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 14 Title: Cache Memory - Cache Performance Optimization Summary: Miss penalty reduction (multi-level caches, greater priority to reads than to writes, victim caches); Miss rate reduction (analysis of the misses, increase the block size, increase the cache capacity, increase of the associativity level and way prediction). 2010/2011 Nuno.Roma@ist.utl.pt

Architectures for Embedded Computing Cache Memory: Cache Performance Optimization Prof. Nuno Roma ACE 2010/11 - DEI-IST 1 / 28 Previous Class In the previous class... Memory systems; Program access patterns; Cache memories: Operation principles; Internal organization; Cache management policies. Prof. Nuno Roma ACE 2010/11 - DEI-IST 2 / 28

Road Map Prof. Nuno Roma ACE 2010/11 - DEI-IST 3 / 28 Summary Today: : Multi-level caches; Greater priority to reads than to writes; Victim caches; : Analysis of the misses; Increase the block size; Increase the cache capacity; Increase of the associativity level; Way prediction. Bibliography: Computer Architecture: a Quantitative Approach, Sections 5.2 and C.3 Prof. Nuno Roma ACE 2010/11 - DEI-IST 4 / 28

Prof. Nuno Roma ACE 2010/11 - DEI-IST 5 / 28 Caches: Objective Objective: minimize the memory mean access time, from the processor point of view. t access = t hit + p miss t penalty Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 28

Caches: Objective Objective: minimize the memory mean access time, from the processor point of view. t access = t hit + p miss t penalty Hit Time (t hit ): hardware designers make all their efforts so that the cache responds in a single clock cycle; Miss Rate (p miss ): maximization of the probability to find the requested data in cache; (t penalty ): upon a miss, minimize the required time to resolve it. Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 28 Example Consider a load-store computer architecture where CPI=1.0 (when all cache accesses are successfully satisfied). The load and store instructions correspond to 50% of the whole set of executed instructions. If the miss penalty is 25T and the miss rate is 10%, how much faster would be the processor if the miss rate was reduced to one half? Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 28

Example Consider a load-store computer architecture where CPI=1.0 (when all cache accesses are successfully satisfied). The load and store instructions correspond to 50% of the whole set of executed instructions. If the miss penalty is 25T and the miss rate is 10%, how much faster would be the processor if the miss rate was reduced to one half? Solution: CP I A = 50% 1T + 50% (t hit + p miss t penalty ) = 0.5 1T + 0.5 (1T + 0.1 25T ) = 0.5T + 0.5 3.5T = 2.25T Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 28 Example Consider a load-store computer architecture where CPI=1.0 (when all cache accesses are successfully satisfied). The load and store instructions correspond to 50% of the whole set of executed instructions. If the miss penalty is 25T and the miss rate is 10%, how much faster would be the processor if the miss rate was reduced to one half? Solution: CP I A = 50% 1T + 50% (t hit + p miss t penalty ) = 0.5 1T + 0.5 (1T + 0.1 25T ) = 0.5T + 0.5 3.5T = 2.25T CP I B = 50% 1T + 50% (t hit + p miss t penalty ) = 0.5 1T + 0.5 (1T + 0.05 25T ) = 0.5T + 0.5 2.25T = 1.625T Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 28

Example Consider a load-store computer architecture where CPI=1.0 (when all cache accesses are successfully satisfied). The load and store instructions correspond to 50% of the whole set of executed instructions. If the miss penalty is 25T and the miss rate is 10%, how much faster would be the processor if the miss rate was reduced to one half? Solution: CP I A = 50% 1T + 50% (t hit + p miss t penalty ) = 0.5 1T + 0.5 (1T + 0.1 25T ) = 0.5T + 0.5 3.5T = 2.25T CP I B = 50% 1T + 50% (t hit + p miss t penalty ) = 0.5 1T + 0.5 (1T + 0.05 25T ) = 0.5T + 0.5 2.25T = 1.625T Speedup = CP I A/CP I B = 2.25T/1.625T = 1.385 Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 28 Multi-Level Caches µp L1 Cache L2 Cache Primary Memory t access = t hit L1 + p missl1 t penaltyl1 t penaltyl1 = t hit L2 + p missl2 t penaltyl2 t access = t hit L1 + p missl1 (t hit L2 + p missl2 t penaltyl2 ) Prof. Nuno Roma ACE 2010/11 - DEI-IST 8 / 28

Local and Global Miss Rates Local Miss Rate: fraction of the accesses that are done to a given cache that are not in such cache (miss) Global Miss Rate: fraction of the whole processor accesses that are not in the cache system p missgloball2 = p misslocall1 p misslocall2 Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 28 Local and Global Miss Rates Local Miss Rate: fraction of the accesses that are done to a given cache that are not in such cache (miss) Global Miss Rate: fraction of the whole processor accesses that are not in the cache system p missgloball2 = p misslocall1 p misslocall2 Local is the same as global in L1 cache; Local miss-rate in L2 is usually high - the global miss-rate is a better measure. Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 28

Example Consider that 1000 memory accesses give rise to 40 misses in level 1 cache (L1) and 20 misses in level 2 cache (L2). On average, there are 1.5 memory accesses in each instruction. Also assume that the hit access time to both caches is 1 and 10 clock cycles, respectively, and the memory access is accomplished within 200 clock cycles. Compute the several miss-rates and the memory mean access time. Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 28 Example Consider that 1000 memory accesses give rise to 40 misses in level 1 cache (L1) and 20 misses in level 2 cache (L2). On average, there are 1.5 memory accesses in each instruction. Also assume that the hit access time to both caches is 1 and 10 clock cycles, respectively, and the memory access is accomplished within 200 clock cycles. Compute the several miss-rates and the memory mean access time. Solution: local miss rate L1 = global miss rate L1 = 40 1000 = 4% local miss rate L2 = 20 40 = 50% global miss rate L2 = 20 1000 = 2% Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 28

Example Consider that 1000 memory accesses give rise to 40 misses in level 1 cache (L1) and 20 misses in level 2 cache (L2). On average, there are 1.5 memory accesses in each instruction. Also assume that the hit access time to both caches is 1 and 10 clock cycles, respectively, and the memory access is accomplished within 200 clock cycles. Compute the several miss-rates and the memory mean access time. Solution: local miss rate L1 = global miss rate L1 = 40 1000 = 4% local miss rate L2 = 20 40 = 50% global miss rate L2 = 20 1000 = 2% Memory Mean Access Time = = hit time L1 + miss rate L1 (hit time L2 + miss rate L2 miss penalty L2 ) = 1 + 4% (10 + 50% 200) = 1 + 4% 110 = 5.4 clock cycles Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 28 L2 Cache Configuration Variation of the miss rate with L2 cache capacity (L1 with 64kB): Capacity of L2 greater than L1; For greater capacities of L2, the global miss rate is similar to the one that would be obtained with a single (and a lot more expensive!) L1 cache, with the same size. Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 28

L2 Cache Configuration Variation of the relative execution time with L2 cache capacity and L2 hit time: Hit time is not critical; More complex cache, to minimize the miss rate. Prof. Nuno Roma ACE 2010/11 - DEI-IST 12 / 28 Coherency Between Memory and Cache Sources of Incoherence: other devices that may also change the memory positions (DMA, I/O controllers, other processors, etc.). Typically, they only change the primary memory, in order to not interfere with the processor accesses to the cache. Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 28

Coherency Solutions Selective caches; Shared caches (between all agents); Caches with coherency protocols. Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 28 Coherency Solutions Selective caches: Only operate over restricted areas of the addressing space (defined by configuration); Non stored areas: Input/output buffers; Communication buffers between processors. Shared caches (between all agents); Caches with coherency protocols. Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 28

Coherency Solutions Selective caches; Shared caches (between all agents): The several agents do not directly access the main memory - instead, they access the cache: Greater contention to access the cache; Increase of the cache miss rate. Caches with coherency protocols. Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 28 Coherency Solutions Selective caches; Shared caches (between all agents); Caches with coherency protocols: Bus Snooping: The cache controller checks all writes in primary memory and invalidates those cache positions that were modified at memory; Reads from primary memory positions corresponding to cache blocks with updated values imply the copy of such values into memory. Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 28

Data Coherency Between Caches Inclusion: L2 always has all the data stored in L1: It is only necessary to check L2, in order to invalidate both caches; Implies the adoption of blocks with the same size or extra hardware to search for sub-blocks. Exclusion: each block is never simultaneously stored in both caches: Optimization of cache memory occupation; Miss in L1 leads to a swap of the block between L1 and L2. Prof. Nuno Roma ACE 2010/11 - DEI-IST 15 / 28 Loading Policy Blocking: may have a significant impact in the miss penalty. Non-Blocking: reduces the current miss penalty, but may have a serious impact in subsequent misses: Early Restart Critical Word First Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 28

Giving Priority to Read Misses Over Writes After a read miss, instead of stalling the read operation until the write buffer writes its whole content into memory, the read is sent before the next word of the write buffer. Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 28 Giving Priority to Read Misses Over Writes After a read miss, instead of stalling the read operation until the write buffer writes its whole content into memory, the read is sent before the next word of the write buffer. Complications: the read has to check whether the acceded position will be updated by the write buffer. Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 28

Giving Priority to Read Misses Over Writes After a read miss, instead of stalling the read operation until the write buffer writes its whole content into memory, the read is sent before the next word of the write buffer. Complications: the read has to check whether the acceded position will be updated by the write buffer. Example: SW R5,384(R0) ; M[384] R5 SW R3,512(R0) ; M[512] R3 LW R1,1024(R0) ; R1 M[1024] LW R2,512(R0) ; R2 M[512] Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 28 Victim Cache Instead of completely discarding each block when it has to be replaced, temporarily keep it in a victim buffer. Rather than stalling on a subsequent cache miss, the contents of the buffer are checked on a subsequent miss to see if they have the desired data before going to the next lower-level memory. Small cache (e.g.: 4 to 16 positions); Fully associative; Particularly efficient for small direct mapped caches (more than 25% reduction of the miss rate in a 4kB cache). Prof. Nuno Roma ACE 2010/11 - DEI-IST 18 / 28

Prof. Nuno Roma ACE 2010/11 - DEI-IST 19 / 28 of the Miss Rate Miss classification: Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 28

of the Miss Rate Miss classification: Compulsory: occur at the beginning o the program (cannot be avoided). Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 28 of the Miss Rate Miss classification: Compulsory: occur at the beginning o the program (cannot be avoided). Capacity: the cache cannot contain all the blocks needed during execution of a program. Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 28

of the Miss Rate Miss classification: Compulsory: occur at the beginning o the program (cannot be avoided). Capacity: the cache cannot contain all the blocks needed during execution of a program. Conflict: occur due to the adopted placement strategy (direct mapped or n-set associative). Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 28 Distribution of the Miss Rate Prof. Nuno Roma ACE 2010/11 - DEI-IST 21 / 28

Minimizing each Type of Miss Compulsory: Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 28 Minimizing each Type of Miss Compulsory: Solution: Increase the size of the block Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 28

Minimizing each Type of Miss Compulsory: Solution: Increase the size of the block Capacity: Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 28 Minimizing each Type of Miss Compulsory: Solution: Increase the size of the block Capacity: Solution: Increase the size of the cache Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 28

Minimizing each Type of Miss Compulsory: Solution: Increase the size of the block Capacity: Solution: Increase the size of the cache Conflict: Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 28 Minimizing each Type of Miss Compulsory: Solution: Increase the size of the block Capacity: Solution: Increase the size of the cache Conflict: Solution: Increase the associativity level Common objective: Try not to increase t penalty, but more importantly, not to increase t hit! Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 28

Increase the Block Size Takes advantage of spatial locality. Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 28 Increase the Block Size Takes advantage of spatial locality. But: The loading of the block may increase the miss penalty; May also increase the capacity and conflict miss rates. Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 28

Increase the Cache Capacity Obviously, it decreases the miss rate: mainly, the capacity faults, but also the conflict faults. Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 28 Increase the Cache Capacity Obviously, it decreases the miss rate: mainly, the capacity faults, but also the conflict faults. But: Slower caches; More expensive caches. Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 28

Increase the Cache Capacity Obviously, it decreases the miss rate: mainly, the capacity faults, but also the conflict faults. But: Slower caches; More expensive caches. Solution: Use greater caches in the upper levels. Current L2 caches have the same capacity as those that were used about 10 years ago! Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 28 Increase the Associativity Level of the conflict misses. Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 28

Increase the Associativity Level of the conflict misses. But: Slower caches; More expensive caches. Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 28 Way Prediction and Pseudo-Associative Caches Way Prediction: extra bits are kept in the cache to predict the way of the next cache access. it is only necessary to compare the tag field. Pseudo-Associative Caches: upon a miss, certain direct mapped caches try a second block to find the desired address. Typically, this second block is obtained by inverting one bit of the index field. Several possible values for the hit time: Hit time, considering a correct prediction, t hit correct Hit time, considering an incorrect prediction, t hit incorrect Miss penalty time, t penalty Objective: t hit correct < t hit incorrect t penalty Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 28

Prof. Nuno Roma ACE 2010/11 - DEI-IST 27 / 28 Code optimization: Data access; Program access; of miss penalty with parallel techniques: Pre-Fetching; Non-blocking caches. Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 28