Improving Inclusive Cache Performance with Two-level Eviction Priority

Size: px

Start display at page:

Download "Improving Inclusive Cache Performance with Two-level Eviction Priority"

Rafe Jennings
5 years ago
Views:

1 Improving Inclusive Cache Performance with Two-level Eviction Priority Lingda Li, Dong Tong, Zichao Xie, Junlin Lu, Xu Cheng Microprocessor Research and Development Center, Peking University, Beijing, China {lilingda, tongdong, xiezichao, lujunlin, Abstract Inclusive cache hierarchies are widely adopted in modern processors, since they can simplify the implementation of cache coherence. However, it sacrifices some performance to guarantee inclusion. Many recent intelligent management policies are proposed to improve the last-level cache (LLC) performance by evicting blocks with poor locality earlier. Unfortunately, they are inapplicable in inclusive LLCs. In this paper, we propose Two-level Eviction Priority (TEP) policy. Besides the eviction priority provided by the baseline replacement policy, TEP appends an additional high level of eviction priority to LLC blocks, which is decided at the insertion time and cannot be changed during their lifetime in the LLC. When blocks with high eviction priority are not in inner caches anymore, they get evicted from the LLC preferentially. Thus, the LLC can retain more useful blocks to improve performance. TEP can cooperate well with various baseline replacement policies. Our evaluation shows that TEP with NRU can improve the performance of inclusive LLCs significantly while requiring negligible extra storage. It also outperforms other recent proposals including QBS, DIP, and DRRIP. I. INTRODUCTION The performance gap between processors and memory has been widened for several decades, and modern processors use multi-level cache hierarchies to bridge it. An important design choice of the multi-level cache hierarchy is whether the last-level cache (LLC) should guarantee inclusion [], [2]. Since inclusive caches include all blocks in inner caches, they can filter unnecessary cache coherence messages to simplify the implementation of cache coherence protocol [], [3]. Therefore, inclusive cache hierarchies are adopted by many recent desktop, embedded, and server processors [4], [5], [6]. However, compared to non-inclusive and exclusive caches, inclusive caches suffer performance loss for two reasons. One is that the effective cache space is reduced due to data duplication. The other reason is that, when a block is evicted from the LLC, that block should also be invalidated in inner caches to guarantee inclusion. Those blocks which are invalidated in inner caches to guarantee inclusion are called inclusion victims [7]. Since inclusion victims may show good temporal locality in inner caches, evicting them earlier can cause extra misses. Recently many intelligent management policies are proposed to improve the LLC performance. By predicting the This work is supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China under grant 2009ZX Unless stated otherwise, cache refers to the last-level cache in this paper, and the L2 cache is used as the LLC in this paper. temporal locality of LLC blocks, those techniques evict blocks with poor locality earlier or directly bypass them from the LLC [8], [9], [0], [], [2], [3], [4]. However, in inclusive LLCs, the early eviction of blocks increases the probability that those blocks are still in inner caches, and thus causes more inclusion victims, which may show good locality in inner caches. Therefore, those intelligent policies can degrade the performance of inclusive LLCs, and blocks with poor locality should not be evicted by the LLC until their locality is exhausted in inner caches. In this paper, we propose an inclusive cache replacement policy called Two-level Eviction Priority (TEP) policy. TEP appends an additional level of eviction priority to LLC blocks. When a block is brought into the LLC, if it is predicted to have poor temporal locality, TEP marks it with high eviction priority. On a miss, if one of the victim candidates which have high eviction priority is not present in inner caches, it is chosen to be the victim preferentially. Otherwise, the baseline replacement policy, which provides the second level of eviction priority, selects the victim from the remaining victim candidates which have low eviction priority. In doing so, blocks with poor locality are evicted right after they are not in inner caches, and thus they have short lifetime in the LLC while avoiding the eviction of inclusion victims. On the other hand, blocks with good locality are selected for replacement only when all blocks with poor locality are in inner caches. Thus, the LLC can retain them longer to capture more hits. TEP can cooperate with any baseline replacement policy, and it needs negligible modifications to the existing cache design. We evaluate TEP with NRU, LRU, and SRRIP [9]. Our experiments show that TEP can improve the performance of inclusive LLCs significantly. Especially, TEP with NRU reduces average misses by.0% compared to LRU and thus achieves a geometric mean speedup of 4.3% in a 52KB L2 cache. TEP with NRU also outperforms other recent proposals significantly, including QBS [7], DIP [8], and DRRIP [9], while TEP only requires roughly 2KB of storage overhead. II. MOTIVATION In order to illustrate why recently proposed intelligent replacement policies cannot be applied to inclusive LLCs, we evaluate the performance of DIP and DRRIP 2 in both noninclusive and inclusive LLCs. 2 We use 2-bit DRRIP in the experiments because it performs better /2/$ IEEE 387

2 .2.23 DIP_non DRRIP_non. DIP_in DRRIP_in Step Step 2 Check B: in inner caches. Check D: not in inner caches; D is selected as the victim. 0.9 Fig. 2. Example of TEP victim selection. Fig.. Normalized speedup for various replacement policies in non-inclusive and inclusive L2 caches. Figure shows the normalized speedup for various techniques in a 52KB L2 cache over a memory-intensive subset of SPEC CPU2000 benchmarks 3. The speedup of DIP non and DRRIP non, which is achieved in non-inclusive LLCs, is normalized to that of LRU in non-inclusive LLCs, and the speedup of DIP in and DRRIP in is normalized to that of LRU in inclusive LLCs. For non-inclusive LLCs, DIP and DRRIP can achieve a geometric mean speedup of more than 2%. On the other hand, for inclusive LLCs, they cannot improve performance anymore, and their geometric mean speedup is slightly lower than that of LRU. For benchmarks such as mgrid and galgel, DIP and DRRIP can significantly degrade performance in inclusive LLCs. The reason for their poor performance in inclusive LLCs is that those proposals significantly increase the number of inclusion victims. We study DIP for galgel to show the reason. DIP adaptively chooses the better one between LRU and BIP, which inserts most incoming blocks into the LRU position to avoid thrashing when the working set is larger than the cache size. When LRU is used, 0.4% of blocks evicted by the LLC are present in inner caches to cause the eviction of inclusion victims. While for BIP, because most incoming blocks are inserted into the LRU position, they are selected for replacement on the next miss of their sets. Thus, the lifetime of those blocks in the LLC is very short, which increases the probability that they are still in inner caches when getting evicted from the LLC. As a result, for BIP the number of inclusion victims increases significantly, and 45.9% of blocks evicted by the LLC cause the eviction of inclusion victims in inner caches. The early eviction of inclusion victims causes their next accesses to miss in both the L and L2 caches, which makes the misses of L and L2 caches increase by 9.2% and 36.3% respectively compared to BIP in the non-inclusive LLC. As a result, the performance of BIP is degraded in inclusive LLCs, which causes the poor performance of DIP. Consequently, modern intelligent replacement policies do not work well in inclusive LLCs, and it is necessary to design a specific replacement policy for inclusive LLCs. The 3 The detail of our experiment configuration is described in Section IV. proposed policy should achieve two design goals: it should make effective use of the cache space to improve performance, while the number of inclusion victims is minimized. A. Overview III. TWO-LEVEL EVICTION PRIORITY POLICY In order to avoid evicting inclusion victims, blocks should not be evicted from the LLC until their temporal locality in inner caches is exhausted. To improve the LLC performance, the lifetime of useless blocks in the LLC should be as short as possible. We propose Two-level Eviction Priority (TEP) policy for the inclusive cache replacement to achieve the two goals above. TEP associates each LLC block with an Eviction Priority Bit (EPB) to indicate its first level of eviction priority. When a block is brought into the cache, a temporal locality predictor is employed to predict its locality. If it is predicted to have poor locality, its EPB is set to to indicate its high eviction priority. Otherwise, its EPB is set to 0. On a miss, among all victim candidates in the corresponding set, blocks whose EPBs are are checked first. TEP starts the search from the first physical way to the last physical way. For a victim candidate, TEP identifies whether it is in inner caches or not. If it is not present in inner caches, it is selected for replacement. Otherwise, TEP checks the next candidate until a candidate which is not in inner caches is found, or all blocks whose EPBs are have been checked. If all blocks whose EPBs are are present in inner caches, the baseline replacement policy provides the second level of eviction priority to select a victim candidate among the remaining blocks whose EPBs are 0. If the victim candidate is in inner caches, its replacement state is updated to the Most Recently Used (MRU) position. Then the baseline replacement policy selects the next candidate to restart the process. Figure 2 shows an example of the victim selection process for an 8-way LRU-managed cache. In doing so, for blocks with poor locality (those whose EPBs are ), they get evicted from the LLC right after they are removed in inner caches. Thus, their lifetime in the LLC is shortened while not evicting inclusion victims. For blocks with good locality (those whose EPBs are 0), they are selected for replacement only when all blocks with high eviction priority reside in inner caches. Thus, the LLC can retain them longer to improve performance. 388

3 0: good locality < 0: poor locality counter Locality Prediction Table Fig. 3. Consult Update PC PC The structure of TEP. IB address VB address IB tag Replacement History Table VB tag TEP can cooperate with any baseline replacement policy. In our experiments, we evaluate the performance of TEP when it works with the Not Recently Used (NRU) policy, LRU, and SRRIP. For TEP, there are two key problems. One is how to predict the temporal locality of blocks in the LLC. The other is how to identify whether a block is in inner caches or not. We will introduce them in the next two sections respectively. B. Temporal Locality Prediction Figure 3 shows the structure of our temporal locality predictor. On a cache miss, TEP uses a Replacement History Table (RHT) to record a pair of the incoming block (IB) and the victim block (VB), which is selected by the baseline replacement policy. The program counter (PC) of the instruction which accesses IB is also recorded. Then according to the following reuse sequence on a IB-VB pair, the temporal locality of IB can be determined: Condition : if IB is accessed earlier, it indicates that IB has good locality; Condition 2: if VB is accessed earlier, it indicates that IB has poor locality; Condition 3: if IB is not reused before eviction, it also indicates that IB has poor locality. The first and third conditions are easy to understand, because a reused block is expected to have good locality, and no reused blocks should have poor locality. The heuristic of the second condition is that VB is expected to have the worst locality among all blocks in the cache. If VB is accessed earlier than IB, it indicates that the locality of IB is worse than that of VB. Thus, IB is considered to have poor locality. A saturating counter table called Locality Prediction Table (LPT) is used to learn and predict the temporal locality of blocks. LPT learns the locality based on the PC of the instruction which accesses the block, because previous work has shown that the prediction using PC is more accurate compared to other methods[0], [2], [5]. All counters in the LPT are initiated to 0. When the locality of an IB-VB pair is learned in the RHT, the recorded PC of that pair is used to index a corresponding counter in the LPT. If that block shows On an access to block x: for each valid entry A in the corresponding set of RHT if x.tag == A.IB_tag // Condition LPT[A.PC]++; Invalidate A; else if x.tag == A.VB_tag // Condition 2 LPT[A.PC]--; Invalidate A; if x misses in the cache y = Select_Victim_Candidate(); for each valid entry A in the corresponding set of RHT if y.tag == A.IB_tag // Condition 3 LPT[A.PC]--; Invalidate A; if RHT.Record(x) == true B = RHT.Select_Victim(x); // Select an entry to record x B.PC = x.pc; B.IB_tag = x.tag; B.VB_tag = y.tag; if LPT[x.PC] < 0 x.epb = ; else x.epb = 0; Fig. 4. The algorithm of temporal locality prediction. good locality, the counter is increased by. Otherwise, it is decreased by. On a miss, the incoming block consults the LPT with its PC to predict the locality. If its corresponding counter is less than 0, it is predicted to be poor locality, and its EPB is set to. Otherwise, its EPB is set to 0. Figure 4 shows the detail of our temporal locality prediction algorithm. We use a 28-entry 8-way set-associative RHT in our experiments. Besides PC, IB tag, and VB tag, each RHT entry contains bit to indicate whether it is valid, and 3 bits for implementing the replacement policy of RHT, which is LRU. To reduce the hardware overhead, the RHT does not record the IB-VB pair on every miss. Only /28 of misses are recorded. Partial tags are also used to reduce the overhead of recording IB and VB, and the RHT keeps the lower 6 bits of IB and VB tags. For a 024-entry LPT, the th to 2nd bits of PC are stored in the RHT 4. The shortened 0-bit PC is delivered along with the cache request in the cache hierarchy like all prior PC based methods [0], [2], [6]. A lot of techniques can predict the temporal locality of LLC blocks [8], [9], [0], [2], [6], [7], [8], and they can all potentially be used as the temporal locality predictor for TEP. Therefore, TEP is a general framework for inclusive cache replacement. C. Inner Cache Block Awareness To make TEP be aware of whether an LLC block resides in inner caches, we use a query based method which is similar to QBS [7]. When a victim candidate is chosen, TEP sends its address to inner caches to query whether it is in them. Then inner caches look up that address and return the response. To avoid sending too many queries to boost the bus traffic, we set 4 The benchmarks are compiled to Alpha binaries, and the lowest 2 bits of PC in Alpha instructions are always

4 0% LRU_non QBS DIP DRRIP DIP+QBS DRRIP+QBS TEP.5 LRU_non QBS DIP DRRIP DIP+QBS DRRIP+QBS TEP.20 Normalized MPKI 00% 90% 80% 70% , , 0.92 Fig. 5. Normalized L2 cache MPKI for various techniques. Fig. 6. Speedup for various techniques. a threshold for the maximum number of queries allowed in one replacement. If in one replacement of the LLC, the number of queries has reached the threshold, the next candidate will be selected as the victim without querying inner caches. Our experiments show that a qualified victim can be found within 2 queries generally, and thus the threshold is set to 2. As stated in [7], the bandwidth requirement for queries is proportional to the number of LLC misses, which is very small. Thus, the extra traffic is negligible. A. Experimental Methodology IV. EVALUATION We uses a modified version of SimpleScalar [9] as our simulator. Table I shows the detailed configuration of the simulator. It models a 2-level cache hierarchy, and the L2 cache is used as the LLC. The inner cache size to the LLC size ratio is similar to that of modern processors [4], [5]. When inclusion is satisfied, the L2 cache sends backinvalidation messages to both the L icache and dcache on the replacement. Besides misses, the simulator also allocates MSHRs for our queries sent by the L2 cache to model the extra traffic introduced by those messages. Parameter TABLE I PARAMETERS OF SIMULATOR Configuration Fetch/Issue/Commit 4 Width Reorder Buffer 28-entry Load/Store Queue 32-entry L ICache 64B blocks, 32KB, 4-way, cycle, LRU L DCache 64B blocks, 32KB, 4-way, cycle, LRU, 2 ports, 8 MSHRs, 6 WriteBuffers L2 Unified Cache 64B blocks, 52KB, 6-way, 0 cycles, 32 MSHRs, 64 WriteBuffers Memory Latency 50 cycles RHT 28-entry, 8-way LPT 024-entry, 3-bit counters We use the precompiled Alpha binaries of SPEC CPU2000 benchmarks in our experiments, which was available from A subset of memory intensive benchmarks whose compulsory misses are less than 50% are chosen, because compulsory misses cannot be reduced by any replacement policy. SimPoint [20] is used to select representative 250 million instructions to execute for each benchmark. B. Results for TEP with NRU At first we evaluate the performance of TEP when NRU is used as the baseline replacement policy. Besides TEP, we also investigate the performance of various replacement policies, including LRU in the non-inclusive cache (LRU non), QBS, DIP, and DRRIP. In addition, we also study DIP+QBS and DRRIP+QBS, which extend DIP and DRRIP with QBS respectively. DIP+QBS and DRRIP+QBS query inner caches when selecting a victim candidate. Figure 5 shows misses per thousand instructions (MPKI) of the L2 cache for various techniques in inclusive LLCs, which is normalized to that of LRU. TEP reduces the average MPKI by.0% compared to LRU, while the MPKI reduction of other techniques is less than 5%. Figure 6 shows the speedup for various techniques. The speedup is computed by dividing the IPC of various techniques by that of LRU in inclusive caches. TEP achieves a geometric mean speedup of 4.3% compared to LRU. Compared with other techniques, in which the best performance improvement is.3% for DRRIP+QBS, TEP outperforms them significantly. Since QBS is based on LRU, its performance is lower than that of LRU non. DIP and DRRIP suffer performance loss since they cause more inclusion victims, as stated in Section II. For DIP+QBS and DRRIP+QBS, they insert most blocks with poor locality into the LRU position, which are likely in inner caches when they are selected for replacement. Thus, when QBS identifies that those blocks are in inner caches, they are considered to have good locality and updated to the MRU position. As a result, DIP+QBS and DRRIP+QBS have similar behavior and performance as LRU, and thus they lost the opportunity to improve performance. That is the reason why TEP uses one extra level of eviction priority to record blocks with poor locality consistently. Those blocks are checked repeatedly until they are not in inner caches, and thus TEP can evict them as early as possible. Figure 7 shows the fraction of blocks which are predicted to be poor locality. For benchmarks which TEP performs well such as art and ammp, more than 50% of incoming blocks 390

5 Percentage of blocks 00% 80% 60% 40% 20% good locality poor locality ,.8,.20 SRRIP TEP+SRRIP TEP+LRU TEP+NRU 0% 0.95 Fig. 7. Results of temporal locality prediction. Fig. 8. Speedup for TEP with LRU and SRRIP QBS LRU_non DIP DRRIP TEP 256KB 52KB M 2M 4M bit 2-bit 3-bit 4-bit 5-bit Fig. 9. Speedup for different cache sizes. Fig. 0. Sensitivity to the LPT size. are predicted to have poor locality. The EPBs of those blocks are set to, and thus they are evicted earlier to release cache space for useful blocks. C. Results for TEP with Other Policies Besides NRU, we also evaluate the performance of TEP with other baseline replacement policies including LRU and SRRIP. As shown in Figure 8, the results are similar to that of TEP with NRU. TEP+LRU outperforms LRU by 4.2% and reduces average misses by 0.5%. TEP+SRRIP outperforms LRU by 3.8% and reduces average misses by 9.2%. Compared to SRRIP, the performance gain of TEP+SRRIP is 3.8%. Our results show that TEP can cooperate with various replacement policies to improve performance. Among them, NRU is the simplest and requires the least hardware overhead. Therefore, we focus on the study of TEP with NRU in the rest of this paper, and TEP refers to TEP with NRU in this paper. D. Sensitivity to the Cache Size Figure 9 shows the speedup of TEP for different L2 cache sizes. The L2 cache size is changed from 256KB to 4MB, and the associativity is fixed at 6. The speedup is normalized to that of LRU in the specific cache size. For small L2 caches, since the L to L2 cache size ratio is larger, the probability that an L2 block is in L caches is higher. As a result, DIP and DRRIP cause more inclusion victims to harm the performance. For large L2 caches, the performance of DIP and DRRIP is improved because they suffer less inclusion victims. The performance of QBS is always lower than that of LRU non. On the other hand, TEP can improve the performance of inclusive caches for all cache sizes, although for small caches the performance gain is limited because there is less space to place useful blocks, and for large caches the working set is more likely to fit into the cache. E. Sensitivity to the Size of RHT and LPT We investigate the performance sensitivity of TEP to the RHT size. The results show that the performance is not sensitive to different RHT sizes. A 28-entry RHT can perform well enough and more entries are not necessary. Figure 0 shows the performance sensitivity to different LPT sizes with a 28-entry, 8-way RHT. We vary the LPT size from 28 to 4096 and the LPT counter size is changed from bit to 5 bits. The experimental results show that a 024-entry LPT with 3-bit counters is enough. F. Storage For TEP, each cache block needs -bit EPB. Each RHT entry contains valid bit, 3 replacement bits, 0 bits to keep the shortened PC of IB, 6-bit IB tag and 6-bit VB tag. The saturating counter of LPT is 3-bit. Therefore, TEP totally consumes ( 892+( ) )bits = 2.09KB of extra storage, which is less than 0.5% of the total storage of a 52KB L2 cache. Since NRU needs KB of storage, TEP with NRU consumes roughly 3KB of storage in total. Compared with other recent proposals, TEP with NRU performs best while requiring a moderate storage overhead. 39

6 V. RELATED WORK A. Inclusive Cache Management In order to reduce inclusion victims, global replacement [2], [22] exposes the hits in inner caches to the LLC. On an inner cache hit, a message is sent to update the replacement state of that block in the LLC. For direct-associative network caches, Fletcher et al. propose three methods to reduce inclusion victims [23]: increasing the cache associativity, using a victim cache, and adding a snoop filter to relax the inclusion property. Jaleel et al. propose three Temporal Locality Aware policies to improve inclusive cache performance [7]: Temporal Locality Hint (TLH) conveys the temporal locality in inner caches by sending locality hints; Early Core Invalidation (ECI) invalidates inner cache blocks before being evicted by the LLC to derive their locality; Query Based Selection (QBS) queries inner caches to get the locality. However, those proposals only focus on reducing inclusion victims, and they cannot make effective use of cache space to improve performance. B. Non-inclusive or Exclusive Cache Management Many intelligent cache management policies are recently proposed to improve the LLC performance, and they are all designed for non-inclusive or exclusive LLCs. DIP [8] dynamically inserts incoming blocks into the LRU position to avoid thrashing. RRIP [9] distinguishes no reused blocks with others to evict them earlier. Pseudo-LIFO [24] uses a fill stack to keep blocks in the bottom of the stack stably. SHiP [0] proposes a signature based predictor for re-reference interval prediction. Dead block prediction techniques improve the LLC performance by predicting the last touch of a block. Based on how to predict, they are classified into trace based [6], [25], counter based [], and time based [7]. Cache burst predictor [8] makes prediction for a burst of accesses to improve accuracy. Sampling dead block prediction [2] learns a fraction of sets for low overhead and high prediction accuracy. By not placing blocks with poor locality, bypass techniques can improve performance. LRF [5] uses both PC and address based predictors for high performance. DSB [26] compares the reuse order of incoming and victim blocks to adjust the bypass probability. Gaur et al. propose an insertion and bypass policy for exclusive LLCs [3]. OBM [4] monitors the behavior of the optimal bypass to make bypass decisions. All those techniques can potentially be extended with TEP to apply them in inclusive LLCs. We leave them in the future work. VI. CONCLUSION The implementation of the cache coherence protocol can be simplified when using an inclusive cache hierarchy. However, the performance is limited to satisfy the inclusion property. When a block is evicted from the LLC, that block in inner caches must be invalidated, which makes recently proposed intelligent cache management policies inapplicable in inclusive LLCs. In this paper, we propose Two-level Eviction Priority (TEP) policy for inclusive LLCs. TEP marks blocks with poor temporal locality in the LLC. Once those blocks are not resident in inner caches, they will be selected for replacement immediately. As a result, without evicting inclusion victims, TEP shortens the lifetime of useless blocks in the LLC to improve performance. Our evaluation shows that TEP can significantly improve the performance of inclusive LLCs. Although we only evaluate TEP in single-thread environments, it is straightforward to apply TEP in multi-thread environments, and it is a part of our future work. REFERENCES [] J.-L. Baer and W.-H. Wang, On the inclusion properties for multi-level cache hierarchies, in ISCA-5, 988. [2] N. P. Jouppi and S. J. E. Wilton, Tradeoffs in two-level on-chip caching, in ISCA-2, 994. [3] A. W. Wilson, Jr., Hierarchical cache/bus architecture for shared memory multiprocessors, in ISCA-4, 987. [4] Intel, Intel core i7 processor, corei7/. [5] Arm, Cortex-a5 processor, cortex-a/cortex-a5.php. [6] Oracle, Opensparc t2, [7] A. Jaleel, E. Borch, M. Bhandaru, S. Steely, and J. Emer, Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies, in MICRO-43, dec [8] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, Adaptive insertion policies for high performance caching, in ISCA-34, [9] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, High performance cache replacement using re-reference interval prediction (rrip), in ISCA-37, 200. [0] C. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. Steely Jr, and J. Emer, Ship: Signature-based hit predictor for high performance caching, in MICRO-44, 20. [] M. Kharbutli and Y. Solihin, Counter-based cache replacement and bypassing algorithms, Computers, IEEE Transactions on, vol. 57, no. 4, pp , april [2] S. M. Khan, Y. Tian, and D. A. Jimenez, Sampling dead block prediction for last-level caches, in MICRO-43, 200. [3] J. Gaur, M. Chaudhuri, and S. Subramoney, Bypass and insertion algorithms for exclusive last-level caches, in ISCA-38, 20. [4] L. Li, D. Tong, Z. Xie, J. Lu, and X. Cheng, Optimal bypass monitor for high performance last-level caches, in PACT-2, 202. [5] L. Xiang, T. Chen, Q. Shi, and W. Hu, Less reused filter: improving l2 cache performance via filtering less reused lines, in ICS-23, [6] A.-C. Lai, C. Fide, and B. Falsafi, Dead-block prediction & dead-block correlating prefetchers, in ISCA-28, 200. [7] Z. Hu, S. Kaxiras, and M. Martonosi, Timekeeping in the memory system: predicting and optimizing memory behavior, in ISCA-29, [8] H. Liu, M. Ferdman, J. Huh, and D. Burger, Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency, in MICRO-4, [9] D. Burger and T. Austin, The simplescalar tool set, version 2.0, ACM SIGARCH Computer Architecture News, vol. 25, no. 3, pp. 3 25, 997. [20] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder, Using simpoint for accurate and efficient simulation, SIGMETRICS Perform. Eval. Rev., vol. 3, pp , June [2] M. Zahran, Cache replacement policy revisited, in WDDD-6, [22] R. Garde, S. Subramaniam, and G. Loh, Deconstructing the inefficacy of global cache replacement policies, in WDDD-7, [23] K. Fletcher, W. Speight, and J. Bennett, Techniques for reducing the impact of inclusion in shared network cache multiprocessors, Rice ELEC TR, vol. 943, 994. [24] M. Chaudhuri, Pseudo-lifo: the foundation of a new family of replacement policies for last-level caches, in MICRO-42, [25] S. M. Khan, D. A. Jiménez, D. Burger, and B. Falsafi, Using dead blocks as a virtual victim cache, in PACT-9, 200. [26] H. Gao and C. Wilkerson, A dueling segmented lru replacement algorithm with adaptive bypassing, in JWAC-,

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies

Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset