Improving Inclusive Cache Performance with Two-level Eviction Priority

Size: px
Start display at page:

Download "Improving Inclusive Cache Performance with Two-level Eviction Priority"

Transcription

1 Improving Inclusive Cache Performance with Two-level Eviction Priority Lingda Li, Dong Tong, Zichao Xie, Junlin Lu, Xu Cheng Microprocessor Research and Development Center, Peking University, Beijing, China {lilingda, tongdong, xiezichao, lujunlin, Abstract Inclusive cache hierarchies are widely adopted in modern processors, since they can simplify the implementation of cache coherence. However, it sacrifices some performance to guarantee inclusion. Many recent intelligent management policies are proposed to improve the last-level cache (LLC) performance by evicting blocks with poor locality earlier. Unfortunately, they are inapplicable in inclusive LLCs. In this paper, we propose Two-level Eviction Priority (TEP) policy. Besides the eviction priority provided by the baseline replacement policy, TEP appends an additional high level of eviction priority to LLC blocks, which is decided at the insertion time and cannot be changed during their lifetime in the LLC. When blocks with high eviction priority are not in inner caches anymore, they get evicted from the LLC preferentially. Thus, the LLC can retain more useful blocks to improve performance. TEP can cooperate well with various baseline replacement policies. Our evaluation shows that TEP with NRU can improve the performance of inclusive LLCs significantly while requiring negligible extra storage. It also outperforms other recent proposals including QBS, DIP, and DRRIP. I. INTRODUCTION The performance gap between processors and memory has been widened for several decades, and modern processors use multi-level cache hierarchies to bridge it. An important design choice of the multi-level cache hierarchy is whether the last-level cache (LLC) should guarantee inclusion [], [2]. Since inclusive caches include all blocks in inner caches, they can filter unnecessary cache coherence messages to simplify the implementation of cache coherence protocol [], [3]. Therefore, inclusive cache hierarchies are adopted by many recent desktop, embedded, and server processors [4], [5], [6]. However, compared to non-inclusive and exclusive caches, inclusive caches suffer performance loss for two reasons. One is that the effective cache space is reduced due to data duplication. The other reason is that, when a block is evicted from the LLC, that block should also be invalidated in inner caches to guarantee inclusion. Those blocks which are invalidated in inner caches to guarantee inclusion are called inclusion victims [7]. Since inclusion victims may show good temporal locality in inner caches, evicting them earlier can cause extra misses. Recently many intelligent management policies are proposed to improve the LLC performance. By predicting the This work is supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China under grant 2009ZX Unless stated otherwise, cache refers to the last-level cache in this paper, and the L2 cache is used as the LLC in this paper. temporal locality of LLC blocks, those techniques evict blocks with poor locality earlier or directly bypass them from the LLC [8], [9], [0], [], [2], [3], [4]. However, in inclusive LLCs, the early eviction of blocks increases the probability that those blocks are still in inner caches, and thus causes more inclusion victims, which may show good locality in inner caches. Therefore, those intelligent policies can degrade the performance of inclusive LLCs, and blocks with poor locality should not be evicted by the LLC until their locality is exhausted in inner caches. In this paper, we propose an inclusive cache replacement policy called Two-level Eviction Priority (TEP) policy. TEP appends an additional level of eviction priority to LLC blocks. When a block is brought into the LLC, if it is predicted to have poor temporal locality, TEP marks it with high eviction priority. On a miss, if one of the victim candidates which have high eviction priority is not present in inner caches, it is chosen to be the victim preferentially. Otherwise, the baseline replacement policy, which provides the second level of eviction priority, selects the victim from the remaining victim candidates which have low eviction priority. In doing so, blocks with poor locality are evicted right after they are not in inner caches, and thus they have short lifetime in the LLC while avoiding the eviction of inclusion victims. On the other hand, blocks with good locality are selected for replacement only when all blocks with poor locality are in inner caches. Thus, the LLC can retain them longer to capture more hits. TEP can cooperate with any baseline replacement policy, and it needs negligible modifications to the existing cache design. We evaluate TEP with NRU, LRU, and SRRIP [9]. Our experiments show that TEP can improve the performance of inclusive LLCs significantly. Especially, TEP with NRU reduces average misses by.0% compared to LRU and thus achieves a geometric mean speedup of 4.3% in a 52KB L2 cache. TEP with NRU also outperforms other recent proposals significantly, including QBS [7], DIP [8], and DRRIP [9], while TEP only requires roughly 2KB of storage overhead. II. MOTIVATION In order to illustrate why recently proposed intelligent replacement policies cannot be applied to inclusive LLCs, we evaluate the performance of DIP and DRRIP 2 in both noninclusive and inclusive LLCs. 2 We use 2-bit DRRIP in the experiments because it performs better /2/$ IEEE 387

2 .2.23 DIP_non DRRIP_non. DIP_in DRRIP_in Step Step 2 Check B: in inner caches. Check D: not in inner caches; D is selected as the victim. 0.9 Fig. 2. Example of TEP victim selection. Fig.. Normalized speedup for various replacement policies in non-inclusive and inclusive L2 caches. Figure shows the normalized speedup for various techniques in a 52KB L2 cache over a memory-intensive subset of SPEC CPU2000 benchmarks 3. The speedup of DIP non and DRRIP non, which is achieved in non-inclusive LLCs, is normalized to that of LRU in non-inclusive LLCs, and the speedup of DIP in and DRRIP in is normalized to that of LRU in inclusive LLCs. For non-inclusive LLCs, DIP and DRRIP can achieve a geometric mean speedup of more than 2%. On the other hand, for inclusive LLCs, they cannot improve performance anymore, and their geometric mean speedup is slightly lower than that of LRU. For benchmarks such as mgrid and galgel, DIP and DRRIP can significantly degrade performance in inclusive LLCs. The reason for their poor performance in inclusive LLCs is that those proposals significantly increase the number of inclusion victims. We study DIP for galgel to show the reason. DIP adaptively chooses the better one between LRU and BIP, which inserts most incoming blocks into the LRU position to avoid thrashing when the working set is larger than the cache size. When LRU is used, 0.4% of blocks evicted by the LLC are present in inner caches to cause the eviction of inclusion victims. While for BIP, because most incoming blocks are inserted into the LRU position, they are selected for replacement on the next miss of their sets. Thus, the lifetime of those blocks in the LLC is very short, which increases the probability that they are still in inner caches when getting evicted from the LLC. As a result, for BIP the number of inclusion victims increases significantly, and 45.9% of blocks evicted by the LLC cause the eviction of inclusion victims in inner caches. The early eviction of inclusion victims causes their next accesses to miss in both the L and L2 caches, which makes the misses of L and L2 caches increase by 9.2% and 36.3% respectively compared to BIP in the non-inclusive LLC. As a result, the performance of BIP is degraded in inclusive LLCs, which causes the poor performance of DIP. Consequently, modern intelligent replacement policies do not work well in inclusive LLCs, and it is necessary to design a specific replacement policy for inclusive LLCs. The 3 The detail of our experiment configuration is described in Section IV. proposed policy should achieve two design goals: it should make effective use of the cache space to improve performance, while the number of inclusion victims is minimized. A. Overview III. TWO-LEVEL EVICTION PRIORITY POLICY In order to avoid evicting inclusion victims, blocks should not be evicted from the LLC until their temporal locality in inner caches is exhausted. To improve the LLC performance, the lifetime of useless blocks in the LLC should be as short as possible. We propose Two-level Eviction Priority (TEP) policy for the inclusive cache replacement to achieve the two goals above. TEP associates each LLC block with an Eviction Priority Bit (EPB) to indicate its first level of eviction priority. When a block is brought into the cache, a temporal locality predictor is employed to predict its locality. If it is predicted to have poor locality, its EPB is set to to indicate its high eviction priority. Otherwise, its EPB is set to 0. On a miss, among all victim candidates in the corresponding set, blocks whose EPBs are are checked first. TEP starts the search from the first physical way to the last physical way. For a victim candidate, TEP identifies whether it is in inner caches or not. If it is not present in inner caches, it is selected for replacement. Otherwise, TEP checks the next candidate until a candidate which is not in inner caches is found, or all blocks whose EPBs are have been checked. If all blocks whose EPBs are are present in inner caches, the baseline replacement policy provides the second level of eviction priority to select a victim candidate among the remaining blocks whose EPBs are 0. If the victim candidate is in inner caches, its replacement state is updated to the Most Recently Used (MRU) position. Then the baseline replacement policy selects the next candidate to restart the process. Figure 2 shows an example of the victim selection process for an 8-way LRU-managed cache. In doing so, for blocks with poor locality (those whose EPBs are ), they get evicted from the LLC right after they are removed in inner caches. Thus, their lifetime in the LLC is shortened while not evicting inclusion victims. For blocks with good locality (those whose EPBs are 0), they are selected for replacement only when all blocks with high eviction priority reside in inner caches. Thus, the LLC can retain them longer to improve performance. 388

3 0: good locality < 0: poor locality counter Locality Prediction Table Fig. 3. Consult Update PC PC The structure of TEP. IB address VB address IB tag Replacement History Table VB tag TEP can cooperate with any baseline replacement policy. In our experiments, we evaluate the performance of TEP when it works with the Not Recently Used (NRU) policy, LRU, and SRRIP. For TEP, there are two key problems. One is how to predict the temporal locality of blocks in the LLC. The other is how to identify whether a block is in inner caches or not. We will introduce them in the next two sections respectively. B. Temporal Locality Prediction Figure 3 shows the structure of our temporal locality predictor. On a cache miss, TEP uses a Replacement History Table (RHT) to record a pair of the incoming block (IB) and the victim block (VB), which is selected by the baseline replacement policy. The program counter (PC) of the instruction which accesses IB is also recorded. Then according to the following reuse sequence on a IB-VB pair, the temporal locality of IB can be determined: Condition : if IB is accessed earlier, it indicates that IB has good locality; Condition 2: if VB is accessed earlier, it indicates that IB has poor locality; Condition 3: if IB is not reused before eviction, it also indicates that IB has poor locality. The first and third conditions are easy to understand, because a reused block is expected to have good locality, and no reused blocks should have poor locality. The heuristic of the second condition is that VB is expected to have the worst locality among all blocks in the cache. If VB is accessed earlier than IB, it indicates that the locality of IB is worse than that of VB. Thus, IB is considered to have poor locality. A saturating counter table called Locality Prediction Table (LPT) is used to learn and predict the temporal locality of blocks. LPT learns the locality based on the PC of the instruction which accesses the block, because previous work has shown that the prediction using PC is more accurate compared to other methods[0], [2], [5]. All counters in the LPT are initiated to 0. When the locality of an IB-VB pair is learned in the RHT, the recorded PC of that pair is used to index a corresponding counter in the LPT. If that block shows On an access to block x: for each valid entry A in the corresponding set of RHT if x.tag == A.IB_tag // Condition LPT[A.PC]++; Invalidate A; else if x.tag == A.VB_tag // Condition 2 LPT[A.PC]--; Invalidate A; if x misses in the cache y = Select_Victim_Candidate(); for each valid entry A in the corresponding set of RHT if y.tag == A.IB_tag // Condition 3 LPT[A.PC]--; Invalidate A; if RHT.Record(x) == true B = RHT.Select_Victim(x); // Select an entry to record x B.PC = x.pc; B.IB_tag = x.tag; B.VB_tag = y.tag; if LPT[x.PC] < 0 x.epb = ; else x.epb = 0; Fig. 4. The algorithm of temporal locality prediction. good locality, the counter is increased by. Otherwise, it is decreased by. On a miss, the incoming block consults the LPT with its PC to predict the locality. If its corresponding counter is less than 0, it is predicted to be poor locality, and its EPB is set to. Otherwise, its EPB is set to 0. Figure 4 shows the detail of our temporal locality prediction algorithm. We use a 28-entry 8-way set-associative RHT in our experiments. Besides PC, IB tag, and VB tag, each RHT entry contains bit to indicate whether it is valid, and 3 bits for implementing the replacement policy of RHT, which is LRU. To reduce the hardware overhead, the RHT does not record the IB-VB pair on every miss. Only /28 of misses are recorded. Partial tags are also used to reduce the overhead of recording IB and VB, and the RHT keeps the lower 6 bits of IB and VB tags. For a 024-entry LPT, the th to 2nd bits of PC are stored in the RHT 4. The shortened 0-bit PC is delivered along with the cache request in the cache hierarchy like all prior PC based methods [0], [2], [6]. A lot of techniques can predict the temporal locality of LLC blocks [8], [9], [0], [2], [6], [7], [8], and they can all potentially be used as the temporal locality predictor for TEP. Therefore, TEP is a general framework for inclusive cache replacement. C. Inner Cache Block Awareness To make TEP be aware of whether an LLC block resides in inner caches, we use a query based method which is similar to QBS [7]. When a victim candidate is chosen, TEP sends its address to inner caches to query whether it is in them. Then inner caches look up that address and return the response. To avoid sending too many queries to boost the bus traffic, we set 4 The benchmarks are compiled to Alpha binaries, and the lowest 2 bits of PC in Alpha instructions are always

4 0% LRU_non QBS DIP DRRIP DIP+QBS DRRIP+QBS TEP.5 LRU_non QBS DIP DRRIP DIP+QBS DRRIP+QBS TEP.20 Normalized MPKI 00% 90% 80% 70% , , 0.92 Fig. 5. Normalized L2 cache MPKI for various techniques. Fig. 6. Speedup for various techniques. a threshold for the maximum number of queries allowed in one replacement. If in one replacement of the LLC, the number of queries has reached the threshold, the next candidate will be selected as the victim without querying inner caches. Our experiments show that a qualified victim can be found within 2 queries generally, and thus the threshold is set to 2. As stated in [7], the bandwidth requirement for queries is proportional to the number of LLC misses, which is very small. Thus, the extra traffic is negligible. A. Experimental Methodology IV. EVALUATION We uses a modified version of SimpleScalar [9] as our simulator. Table I shows the detailed configuration of the simulator. It models a 2-level cache hierarchy, and the L2 cache is used as the LLC. The inner cache size to the LLC size ratio is similar to that of modern processors [4], [5]. When inclusion is satisfied, the L2 cache sends backinvalidation messages to both the L icache and dcache on the replacement. Besides misses, the simulator also allocates MSHRs for our queries sent by the L2 cache to model the extra traffic introduced by those messages. Parameter TABLE I PARAMETERS OF SIMULATOR Configuration Fetch/Issue/Commit 4 Width Reorder Buffer 28-entry Load/Store Queue 32-entry L ICache 64B blocks, 32KB, 4-way, cycle, LRU L DCache 64B blocks, 32KB, 4-way, cycle, LRU, 2 ports, 8 MSHRs, 6 WriteBuffers L2 Unified Cache 64B blocks, 52KB, 6-way, 0 cycles, 32 MSHRs, 64 WriteBuffers Memory Latency 50 cycles RHT 28-entry, 8-way LPT 024-entry, 3-bit counters We use the precompiled Alpha binaries of SPEC CPU2000 benchmarks in our experiments, which was available from A subset of memory intensive benchmarks whose compulsory misses are less than 50% are chosen, because compulsory misses cannot be reduced by any replacement policy. SimPoint [20] is used to select representative 250 million instructions to execute for each benchmark. B. Results for TEP with NRU At first we evaluate the performance of TEP when NRU is used as the baseline replacement policy. Besides TEP, we also investigate the performance of various replacement policies, including LRU in the non-inclusive cache (LRU non), QBS, DIP, and DRRIP. In addition, we also study DIP+QBS and DRRIP+QBS, which extend DIP and DRRIP with QBS respectively. DIP+QBS and DRRIP+QBS query inner caches when selecting a victim candidate. Figure 5 shows misses per thousand instructions (MPKI) of the L2 cache for various techniques in inclusive LLCs, which is normalized to that of LRU. TEP reduces the average MPKI by.0% compared to LRU, while the MPKI reduction of other techniques is less than 5%. Figure 6 shows the speedup for various techniques. The speedup is computed by dividing the IPC of various techniques by that of LRU in inclusive caches. TEP achieves a geometric mean speedup of 4.3% compared to LRU. Compared with other techniques, in which the best performance improvement is.3% for DRRIP+QBS, TEP outperforms them significantly. Since QBS is based on LRU, its performance is lower than that of LRU non. DIP and DRRIP suffer performance loss since they cause more inclusion victims, as stated in Section II. For DIP+QBS and DRRIP+QBS, they insert most blocks with poor locality into the LRU position, which are likely in inner caches when they are selected for replacement. Thus, when QBS identifies that those blocks are in inner caches, they are considered to have good locality and updated to the MRU position. As a result, DIP+QBS and DRRIP+QBS have similar behavior and performance as LRU, and thus they lost the opportunity to improve performance. That is the reason why TEP uses one extra level of eviction priority to record blocks with poor locality consistently. Those blocks are checked repeatedly until they are not in inner caches, and thus TEP can evict them as early as possible. Figure 7 shows the fraction of blocks which are predicted to be poor locality. For benchmarks which TEP performs well such as art and ammp, more than 50% of incoming blocks 390

5 Percentage of blocks 00% 80% 60% 40% 20% good locality poor locality ,.8,.20 SRRIP TEP+SRRIP TEP+LRU TEP+NRU 0% 0.95 Fig. 7. Results of temporal locality prediction. Fig. 8. Speedup for TEP with LRU and SRRIP QBS LRU_non DIP DRRIP TEP 256KB 52KB M 2M 4M bit 2-bit 3-bit 4-bit 5-bit Fig. 9. Speedup for different cache sizes. Fig. 0. Sensitivity to the LPT size. are predicted to have poor locality. The EPBs of those blocks are set to, and thus they are evicted earlier to release cache space for useful blocks. C. Results for TEP with Other Policies Besides NRU, we also evaluate the performance of TEP with other baseline replacement policies including LRU and SRRIP. As shown in Figure 8, the results are similar to that of TEP with NRU. TEP+LRU outperforms LRU by 4.2% and reduces average misses by 0.5%. TEP+SRRIP outperforms LRU by 3.8% and reduces average misses by 9.2%. Compared to SRRIP, the performance gain of TEP+SRRIP is 3.8%. Our results show that TEP can cooperate with various replacement policies to improve performance. Among them, NRU is the simplest and requires the least hardware overhead. Therefore, we focus on the study of TEP with NRU in the rest of this paper, and TEP refers to TEP with NRU in this paper. D. Sensitivity to the Cache Size Figure 9 shows the speedup of TEP for different L2 cache sizes. The L2 cache size is changed from 256KB to 4MB, and the associativity is fixed at 6. The speedup is normalized to that of LRU in the specific cache size. For small L2 caches, since the L to L2 cache size ratio is larger, the probability that an L2 block is in L caches is higher. As a result, DIP and DRRIP cause more inclusion victims to harm the performance. For large L2 caches, the performance of DIP and DRRIP is improved because they suffer less inclusion victims. The performance of QBS is always lower than that of LRU non. On the other hand, TEP can improve the performance of inclusive caches for all cache sizes, although for small caches the performance gain is limited because there is less space to place useful blocks, and for large caches the working set is more likely to fit into the cache. E. Sensitivity to the Size of RHT and LPT We investigate the performance sensitivity of TEP to the RHT size. The results show that the performance is not sensitive to different RHT sizes. A 28-entry RHT can perform well enough and more entries are not necessary. Figure 0 shows the performance sensitivity to different LPT sizes with a 28-entry, 8-way RHT. We vary the LPT size from 28 to 4096 and the LPT counter size is changed from bit to 5 bits. The experimental results show that a 024-entry LPT with 3-bit counters is enough. F. Storage For TEP, each cache block needs -bit EPB. Each RHT entry contains valid bit, 3 replacement bits, 0 bits to keep the shortened PC of IB, 6-bit IB tag and 6-bit VB tag. The saturating counter of LPT is 3-bit. Therefore, TEP totally consumes ( 892+( ) )bits = 2.09KB of extra storage, which is less than 0.5% of the total storage of a 52KB L2 cache. Since NRU needs KB of storage, TEP with NRU consumes roughly 3KB of storage in total. Compared with other recent proposals, TEP with NRU performs best while requiring a moderate storage overhead. 39

6 V. RELATED WORK A. Inclusive Cache Management In order to reduce inclusion victims, global replacement [2], [22] exposes the hits in inner caches to the LLC. On an inner cache hit, a message is sent to update the replacement state of that block in the LLC. For direct-associative network caches, Fletcher et al. propose three methods to reduce inclusion victims [23]: increasing the cache associativity, using a victim cache, and adding a snoop filter to relax the inclusion property. Jaleel et al. propose three Temporal Locality Aware policies to improve inclusive cache performance [7]: Temporal Locality Hint (TLH) conveys the temporal locality in inner caches by sending locality hints; Early Core Invalidation (ECI) invalidates inner cache blocks before being evicted by the LLC to derive their locality; Query Based Selection (QBS) queries inner caches to get the locality. However, those proposals only focus on reducing inclusion victims, and they cannot make effective use of cache space to improve performance. B. Non-inclusive or Exclusive Cache Management Many intelligent cache management policies are recently proposed to improve the LLC performance, and they are all designed for non-inclusive or exclusive LLCs. DIP [8] dynamically inserts incoming blocks into the LRU position to avoid thrashing. RRIP [9] distinguishes no reused blocks with others to evict them earlier. Pseudo-LIFO [24] uses a fill stack to keep blocks in the bottom of the stack stably. SHiP [0] proposes a signature based predictor for re-reference interval prediction. Dead block prediction techniques improve the LLC performance by predicting the last touch of a block. Based on how to predict, they are classified into trace based [6], [25], counter based [], and time based [7]. Cache burst predictor [8] makes prediction for a burst of accesses to improve accuracy. Sampling dead block prediction [2] learns a fraction of sets for low overhead and high prediction accuracy. By not placing blocks with poor locality, bypass techniques can improve performance. LRF [5] uses both PC and address based predictors for high performance. DSB [26] compares the reuse order of incoming and victim blocks to adjust the bypass probability. Gaur et al. propose an insertion and bypass policy for exclusive LLCs [3]. OBM [4] monitors the behavior of the optimal bypass to make bypass decisions. All those techniques can potentially be extended with TEP to apply them in inclusive LLCs. We leave them in the future work. VI. CONCLUSION The implementation of the cache coherence protocol can be simplified when using an inclusive cache hierarchy. However, the performance is limited to satisfy the inclusion property. When a block is evicted from the LLC, that block in inner caches must be invalidated, which makes recently proposed intelligent cache management policies inapplicable in inclusive LLCs. In this paper, we propose Two-level Eviction Priority (TEP) policy for inclusive LLCs. TEP marks blocks with poor temporal locality in the LLC. Once those blocks are not resident in inner caches, they will be selected for replacement immediately. As a result, without evicting inclusion victims, TEP shortens the lifetime of useless blocks in the LLC to improve performance. Our evaluation shows that TEP can significantly improve the performance of inclusive LLCs. Although we only evaluate TEP in single-thread environments, it is straightforward to apply TEP in multi-thread environments, and it is a part of our future work. REFERENCES [] J.-L. Baer and W.-H. Wang, On the inclusion properties for multi-level cache hierarchies, in ISCA-5, 988. [2] N. P. Jouppi and S. J. E. Wilton, Tradeoffs in two-level on-chip caching, in ISCA-2, 994. [3] A. W. Wilson, Jr., Hierarchical cache/bus architecture for shared memory multiprocessors, in ISCA-4, 987. [4] Intel, Intel core i7 processor, corei7/. [5] Arm, Cortex-a5 processor, cortex-a/cortex-a5.php. [6] Oracle, Opensparc t2, [7] A. Jaleel, E. Borch, M. Bhandaru, S. Steely, and J. Emer, Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies, in MICRO-43, dec [8] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, Adaptive insertion policies for high performance caching, in ISCA-34, [9] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, High performance cache replacement using re-reference interval prediction (rrip), in ISCA-37, 200. [0] C. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. Steely Jr, and J. Emer, Ship: Signature-based hit predictor for high performance caching, in MICRO-44, 20. [] M. Kharbutli and Y. Solihin, Counter-based cache replacement and bypassing algorithms, Computers, IEEE Transactions on, vol. 57, no. 4, pp , april [2] S. M. Khan, Y. Tian, and D. A. Jimenez, Sampling dead block prediction for last-level caches, in MICRO-43, 200. [3] J. Gaur, M. Chaudhuri, and S. Subramoney, Bypass and insertion algorithms for exclusive last-level caches, in ISCA-38, 20. [4] L. Li, D. Tong, Z. Xie, J. Lu, and X. Cheng, Optimal bypass monitor for high performance last-level caches, in PACT-2, 202. [5] L. Xiang, T. Chen, Q. Shi, and W. Hu, Less reused filter: improving l2 cache performance via filtering less reused lines, in ICS-23, [6] A.-C. Lai, C. Fide, and B. Falsafi, Dead-block prediction & dead-block correlating prefetchers, in ISCA-28, 200. [7] Z. Hu, S. Kaxiras, and M. Martonosi, Timekeeping in the memory system: predicting and optimizing memory behavior, in ISCA-29, [8] H. Liu, M. Ferdman, J. Huh, and D. Burger, Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency, in MICRO-4, [9] D. Burger and T. Austin, The simplescalar tool set, version 2.0, ACM SIGARCH Computer Architecture News, vol. 25, no. 3, pp. 3 25, 997. [20] E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder, Using simpoint for accurate and efficient simulation, SIGMETRICS Perform. Eval. Rev., vol. 3, pp , June [2] M. Zahran, Cache replacement policy revisited, in WDDD-6, [22] R. Garde, S. Subramaniam, and G. Loh, Deconstructing the inefficacy of global cache replacement policies, in WDDD-7, [23] K. Fletcher, W. Speight, and J. Bennett, Techniques for reducing the impact of inclusion in shared network cache multiprocessors, Rice ELEC TR, vol. 943, 994. [24] M. Chaudhuri, Pseudo-lifo: the foundation of a new family of replacement policies for last-level caches, in MICRO-42, [25] S. M. Khan, D. A. Jiménez, D. Burger, and B. Falsafi, Using dead blocks as a virtual victim cache, in PACT-9, 200. [26] H. Gao and C. Wilkerson, A dueling segmented lru replacement algorithm with adaptive bypassing, in JWAC-,

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset

More information

ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management

ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management This is a pre-print, author's version of the paper to appear in the 32nd IEEE International Conference on Computer Design. ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Lecture-16 (Cache Replacement Policies) CS422-Spring

Lecture-16 (Cache Replacement Policies) CS422-Spring Lecture-16 (Cache Replacement Policies) CS422-Spring 2018 Biswa@CSE-IITK 1 2 4 8 16 32 64 128 From SPEC92 Miss rate: Still Applicable Today 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity

More information

Lecture 10: Large Cache Design III

Lecture 10: Large Cache Design III Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a 9% higher miss rate than true LRU 1 Overview 2 Set

More information

An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching

An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching 18th Asia and South Pacific Design Automation Conference January 22-25, 2013 - Yokohama, Japan An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching Xianglei Dang, Xiaoyin Wang, Dong Tong,

More information

Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks

Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks Lecture 15: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

Enhancing LRU Replacement via Phantom Associativity

Enhancing LRU Replacement via Phantom Associativity Enhancing Replacement via Phantom Associativity Min Feng Chen Tian Rajiv Gupta Dept. of CSE, University of California, Riverside Email: {mfeng, tianc, gupta}@cs.ucr.edu Abstract In this paper, we propose

More information

Decoupled Dynamic Cache Segmentation

Decoupled Dynamic Cache Segmentation Appears in Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA-8), February, 202. Decoupled Dynamic Cache Segmentation Samira M. Khan, Zhe Wang and Daniel A.

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

Cache Controller with Enhanced Features using Verilog HDL

Cache Controller with Enhanced Features using Verilog HDL Cache Controller with Enhanced Features using Verilog HDL Prof. V. B. Baru 1, Sweety Pinjani 2 Assistant Professor, Dept. of ECE, Sinhgad College of Engineering, Vadgaon (BK), Pune, India 1 PG Student

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Lecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers

Lecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers Lecture 12: Large ache Design Topics: Shared vs. private, centralized vs. decentralized, UA vs. NUA, recent papers 1 Shared Vs. rivate SHR: No replication of blocks SHR: Dynamic allocation of space among

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Perceptron Learning for Reuse Prediction

Perceptron Learning for Reuse Prediction Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Rethinking Belady s Algorithm to Accommodate Prefetching

Rethinking Belady s Algorithm to Accommodate Prefetching To Appear in ISCA 218 Final Version In Preparation. Rethinking Belady s Algorithm to Accommodate Prefetching Akanksha Jain Calvin Lin Department of Computer Science The University of Texas at Austin Austin,

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy

Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy ing-aware Cache Line Turnoff for Saving Leakage Energy Ismail Kadayif Mahmut Kandemir Feihui Li Dept. of Computer Engineering Dept. of Computer Sci. & Eng. Dept. of Computer Sci. & Eng. Canakkale Onsekiz

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Adaptive Insertion Policies for High Performance Caching

Adaptive Insertion Policies for High Performance Caching Adaptive Insertion Policies for High Performance Caching Moinuddin K. Qureshi Aamer Jaleel Yale N. Patt Simon C. Steely Jr. Joel Emer ECE Depment The University of Texas at Austin {moin, patt}@hps.utexas.edu

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

AS the processor-memory speed gap continues to widen,

AS the processor-memory speed gap continues to widen, IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Jian Chen, Ruihua Peng, Yuzhuo Fu School of Micro-electronics, Shanghai Jiao Tong University, Shanghai 200030, China {chenjian,

More information

A Cache Scheme Based on LRU-Like Algorithm

A Cache Scheme Based on LRU-Like Algorithm Proceedings of the 2010 IEEE International Conference on Information and Automation June 20-23, Harbin, China A Cache Scheme Based on LRU-Like Algorithm Dongxing Bao College of Electronic Engineering Heilongjiang

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Research on Many-core Cooperative Caching

Research on Many-core Cooperative Caching Tokyo Institute of Technology Department of Computer Science Research on Many-core Cooperative Caching A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Engineering

More information

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont Lecture 14 Advanced Caches DEC Alpha Fall 2018 Instruction Cache BIU Jon Beaumont www.eecs.umich.edu/courses/eecs470/ Data Cache Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Tag-Split Cache for Efficient GPGPU Cache Utilization

Tag-Split Cache for Efficient GPGPU Cache Utilization Tag-Split Cache for Efficient GPGPU Cache Utilization Lingda Li Ari B. Hayes Shuaiwen Leon Song Eddy Z. Zhang Department of Computer Science, Rutgers University Pacific Northwest National Lab lingda.li@cs.rutgers.edu

More information

Sampling Dead Block Prediction for Last-Level Caches

Sampling Dead Block Prediction for Last-Level Caches Appears in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43), December 2010 Sampling Dead Block Prediction for Last-Level Caches Samira Khan, Yingying Tian,

More information

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural

More information

Advanced Caches. ECE/CS 752 Fall 2017

Advanced Caches. ECE/CS 752 Fall 2017 Advanced Caches ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti Read on your own: Review:

More information

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy Bushra Ahsan Electrical Engineering Department City University of New York bahsan@gc.cuny.edu Mohamed Zahran Electrical Engineering

More information

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 5 Tien-Fu Chen National Chung Cheng Univ. Chap5-0 Topics in Memory Hierachy! Memory Hierachy Features: temporal & spatial locality Common: Faster -> more expensive -> smaller!

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

Using Cacheline Reuse Characteristics for Prefetcher Throttling

Using Cacheline Reuse Characteristics for Prefetcher Throttling 2928 PAPER Special Section on Parallel and Distributed Computing and Networking Using Cacheline Reuse Characteristics for Prefetcher Throttling Hidetsugu IRIE a), Takefumi MIYOSHI, Members, Goki HONJO,

More information

Analyzing Instructtion Based Cache Replacement Policies

Analyzing Instructtion Based Cache Replacement Policies University of Central Florida Electronic Theses and Dissertations Masters Thesis (Open Access) Analyzing Instructtion Based Cache Replacement Policies 2010 Ping Xiang University of Central Florida Find

More information

Cache in a Memory Hierarchy

Cache in a Memory Hierarchy Cache in a Memory Hierarchy Paulo J. D. Domingues Curso de Especialização em Informática Universidade do Minho pado@clix.pt Abstract: In the past two decades, the steady increase on processor performance

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

CS 838 Chip Multiprocessor Prefetching

CS 838 Chip Multiprocessor Prefetching CS 838 Chip Multiprocessor Prefetching Kyle Nesbit and Nick Lindberg Department of Electrical and Computer Engineering University of Wisconsin Madison 1. Introduction Over the past two decades, advances

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Energy Savings via Dead Sub-Block Prediction

Energy Savings via Dead Sub-Block Prediction Energy Savings via Dead Sub-Block Prediction Marco A. Z. Alves Khubaib Eiman Ebrahimi Veynu T. Narasiman Carlos Villavieja Philippe O. A. Navaux Yale N. Patt Informatics Institute - Federal University

More information

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

A Survey of Cache Bypassing Techniques

A Survey of Cache Bypassing Techniques Journal of Low Power Electronics and Applications Article A Survey of Cache Bypassing Techniques Sparsh Mittal Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA; mittals@ornl.gov; Tel.: +1-865-574-8531

More information

NUcache: An Efficient Multicore Cache Organization Based on Next-Use Distance

NUcache: An Efficient Multicore Cache Organization Based on Next-Use Distance NUcache: An Efficient Multicore Cache Organization Based on Next-Use Distance R Manikantan Kaushik Rajan R Govindarajan Indian Institute of Science, Bangalore, India Microsoft Research India, Bangalore,

More information

Improving Cache Management Policies Using Dynamic Reuse Distances

Improving Cache Management Policies Using Dynamic Reuse Distances Improving Cache Management Policies Using Dynamic Reuse Distances Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero and Alexander V. Veidenbaum University of California, Irvine Universitat

More information

Predicting Last-Touch References under Optimal Replacement. Abstract

Predicting Last-Touch References under Optimal Replacement. Abstract Predicting Last-Touch References under Optimal Replacement Wei-Fen Lin and Steven K. Reinhardt Advanced Computer Architecture Laboratory EECS Department University of Michigan Ann Arbor, MI 4819-2122 {wflin,stever}@eecs.umich.edu

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Runtime-Assisted Shared Cache Insertion Policies Based on Re-Reference Intervals

Runtime-Assisted Shared Cache Insertion Policies Based on Re-Reference Intervals Runtime-Assisted Shared Cache Insertion Policies Based on Re-Reference Intervals Vladimir Dimić, Miquel Moretó, Marc Casas, and Mateo Valero Barcelona Supercomputing Center (BSC), Barcelona, Spain firstname.lastname@bsc.es,

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

A Best-Offset Prefetcher

A Best-Offset Prefetcher A Best-Offset Prefetcher Pierre Michaud Inria pierre.michaud@inria.fr The Best-Offset (BO) prefetcher submitted to the DPC contest prefetches one line into the level-two (L) cache on every cache miss or

More information

Dynamic Performance Tuning for Speculative Threads

Dynamic Performance Tuning for Speculative Threads Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

Utilizing Reuse Information in Data Cache Management

Utilizing Reuse Information in Data Cache Management Utilizing Reuse Information in Data Cache Management Jude A. Rivers, Edward S. Tam, Gary S. Tyson, Edward S. Davidson Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, Michigan

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy Jinchun Kim Texas A&M University cienlux@tamu.edu Daniel A. Jiménez Texas A&M University djimenez@cse.tamu.edu

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

A Review on Cache Memory with Multiprocessor System

A Review on Cache Memory with Multiprocessor System A Review on Cache Memory with Multiprocessor System Chirag R. Patel 1, Rajesh H. Davda 2 1,2 Computer Engineering Department, C. U. Shah College of Engineering & Technology, Wadhwan (Gujarat) Abstract

More information

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm 1 LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers) Mazen Kharbutli is with Jordan University of Science and

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Australian Journal of Basic and Applied Sciences Journal home page: www.ajbasweb.com Adaptive Replacement and Insertion Policy for Last Level Cache 1 Muthukumar S. and 2 HariHaran S. 1 Professor,

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Assignment 1 due Mon (Feb 4pm

Assignment 1 due Mon (Feb 4pm Announcements Assignment 1 due Mon (Feb 19) @ 4pm Next week: no classes Inf3 Computer Architecture - 2017-2018 1 The Memory Gap 1.2x-1.5x 1.07x H&P 5/e, Fig. 2.2 Memory subsystem design increasingly important!

More information

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first Cache Memory memory hierarchy CPU memory request presented to first-level cache first if data NOT in cache, request sent to next level in hierarchy and so on CS3021/3421 2017 jones@tcd.ie School of Computer

More information

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for

More information

Copyright by Dong Li 2014

Copyright by Dong Li 2014 Copyright by Dong Li 2014 The Dissertation Committee for Dong Li certifies that this is the approved version of the following dissertation: Orchestrating Thread Scheduling and Cache Management to Improve

More information

A Cache Utility Monitor for Multi-core Processor

A Cache Utility Monitor for Multi-core Processor 3rd International Conference on Wireless Communication and Sensor Network (WCSN 2016) A Cache Utility Monitor for Multi-core Juan Fang, Yan-Jin Cheng, Min Cai, Ze-Qing Chang College of Computer Science,

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving

More information

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks : Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Rethinking Belady s Algorithm to Accommodate Prefetching

Rethinking Belady s Algorithm to Accommodate Prefetching Rethinking Belady s Algorithm to Accommodate Prefetching Akanksha Jain Calvin Lin Department of Computer Science The University of Texas at Austin Austin, Texas 771, USA {akanksha, lin}@cs.utexas.edu Abstract

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information