Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Similar documents
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

Improving Cache Performance using Victim Tag Stores

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

Impact of Cache Coherence Protocols on the Processing of Network Traffic

A Hybrid Adaptive Feedback Based Prefetcher

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

A Comparison of Capacity Management Schemes for Shared CMP Caches

Lecture 10: Large Cache Design III

Microarchitecture Overview. Performance

Lecture-16 (Cache Replacement Policies) CS422-Spring

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Decoupled Dynamic Cache Segmentation

Selective Fill Data Cache

Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Perceptron Learning for Reuse Prediction

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Enhancing LRU Replacement via Phantom Associativity

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Multiperspective Reuse Prediction

Footprint-based Locality Analysis

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Cache Controller with Enhanced Features using Verilog HDL

Improving Cache Management Policies Using Dynamic Reuse Distances

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Memory Hierarchy. Slides contents from:

A Cache Scheme Based on LRU-Like Algorithm

Memory Hierarchy. Slides contents from:

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Microarchitecture Overview. Performance

Exploring the performance of split data cache schemes on superscalar processors and symmetric multiprocessors q

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Improving Writeback Efficiency with Decoupled Last-Write Prediction

Low-Complexity Reorder Buffer Architecture*

Sandbox Based Optimal Offset Estimation [DPC2]

Area-Efficient Error Protection for Caches

Computer Systems Architecture

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Page 1. Multilevel Memories (Improving performance using a little cash )

A Cache Utility Monitor for Multi-core Processor

Sampling Dead Block Prediction for Last-Level Caches

ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management

SF-LRU Cache Replacement Algorithm

Cache Insertion Policies to Reduce Bus Traffic and Cache Conflicts

RECENT studies have shown that, in highly associative

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Decoupled Zero-Compressed Memory

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Workloads, Scalability and QoS Considerations in CMP Platforms

Spatial Memory Streaming (with rotated patterns)

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Chapter-5 Memory Hierarchy Design

Utilizing Reuse Information in Data Cache Management

Inserting Data Prefetches into Loops in Dynamically Translated Code in IA-32EL. Inserting Prefetches IA-32 Execution Layer - 1


Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Australian Journal of Basic and Applied Sciences

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Advanced Computer Architecture

Performance analysis of Intel Core 2 Duo processor

The V-Way Cache : Demand-Based Associativity via Global Replacement

Adaptive Cache Partitioning on a Composite Core

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS

Understanding Cache Interference

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Advanced Memory Organizations

AS the processor-memory speed gap continues to widen,

FACT: a Framework for Adaptive Contention-aware Thread Migrations

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy

Demand fetching is commonly employed to bring the data

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Probabilistic Directed Writebacks for Exclusive Caches

ECE404 Term Project Sentinel Thread

A Fast Instruction Set Simulator for RISC-V

Chapter 2: Memory Hierarchy Design Part 2

A LITERATURE SURVEY ON CPU CACHE RECONFIGURATION

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Lecture notes for CS Chapter 2, part 1 10/23/18

Threshold-Based Markov Prefetchers

Lightweight Memory Tracing

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

One-Level Cache Memory Design for Scalable SMT Architectures

Thesis Defense Lavanya Subramanian

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

OpenPrefetch. (in-progress)

Precise Instruction Scheduling

Transcription:

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India Abstract Current day processors employ multi-level cache hierarchy with one or two levels of private caches and a shared last-level cache (LLC). An efficient cache replacement policy at LLC is essential for reducing the off-chip memory transfer as well as conflict for memory bandwidth. Cache replacement techniques for inclusive LLCs may not be efficient for multilevel cache as it can be shared by enormous applications with varying access behavior, running simultaneously. One application may dominate another by flooding of cache requests and evicting the useful data of the other application. From the performance point of view, an exclusive LLC make the replacement policies more demanding, as compared to an inclusive LLC. This paper analyzes some of the existing replacement techniques on the LLC with their performance assessment. Keywords: Cache Performance, Multi-Level Cache, Last Level Cache (LLC).

1. Introduction A cache memory is mainly designed to alleviate the speed gap between the processor and main memory, which is determined by the access time and miss rates. Caching was initially performed by a single cache on a uniprocessor, but as the processor applications and their competition increased, multi-level cache and multi-core processors were necessary in computer systems. The first level (L1 cache) is closest to the processor and it is used for direct access, while the second or the third level (L2 cache, L3 cache) is referred to as LLC (Last-Level Cache) are designed to hide long miss penalty of accessing to main memory, which involves hundreds of processor cycles in current microprocessors. The system performance strongly depends on the cache hierarchy performance. Thus, many researches have been done to improve the cache performance, although usually focusing on a given level of the cache hierarchy (i.e., L1, L2 or L3). Techniques like load-bypassing, way-prediction, or prefetching, have been widely investigated and implemented in many commercial products. Although these techniques have been successfully implemented in typical monolithic processors, but the performance of the LLC in particular, is a major design issue in current processors. Now a day s large size LLC are designed to keep much information and reduce capacity misses. So the sizes range from several hundreds of KB up to several MB [3] [4]. In order to keep low number of conflict misses, current LLCs implement a large number of ways (like 16 ways/32 ways). Mostly, caches use temporal locality by implementing the Least Recently Used (LRU) replacement algorithm.

This algorithm acts as a stack that places the most recent block on the top of the stack and the least recent block on the bottom, which is the block to be evicted during replacement. 2. Literature Review Different cache eviction policies have been proposed in this section for both unicore and multicore processors. The system performance strongly depends on the cache hierarchy performance. Although it usually focus on a given level of the cache hierarchy (i.e., L1, L2 or L3). Techniques like load-bypassing, way-prediction, or prefetching, have been widely investigated and implemented in many commercial products. Though these techniques have been successfully implemented in typical monolithic processors, the pressure on the memory controller is much higher in multicore and manycore systems than in monolithic processors. Therefore, the performance of the cache hierarchy in general, and the performance of the LLC in particular, is a major design concern in current processors. Caches in the first-level (i.e. in L1) use to be very small as compared to other levels, so it is important to efficiently handle the available space. Some schemes utilize information of past behavior of a given block, using reuse information, to improve the cache performance. The NTS approach by Tyson et al. [5] and the MAT proposed by Johnson et al. [6] are some of the examples. The former marks a block as cacheable based on its reuse information, the latter classifies the blocks as temporal and not temporal based on their reuse information during its past live time. Rivers et al. in [7] propose

to exploit reuse information based on the effective address of the referenced data and on the program counter of the load instruction. In [8], Lin and Reinhardt proposed a hardware approach that predicts when to evict a block before it reaches the bottom of the LRU stack. The first approach is referred to as sequence-based prediction. This approach records and predicts the sequence of memory events leading up to the last touch of a block. The second one is the time-based approach, which tracks a line s timing to predict when a line s last touch will likely occur. In [9], Kharbutli et al. proposed the counter-based L2 cache replacement, which predicts when to evict a block before it reaches the bottom of the stack. Two approaches are presented for the way prediction i.e. Access Interval Predictor (AIP) and the Live-time Predictor (LvP) [10]. The AIP scheme predicts by using counters to keep track of the number of accesses to the same set during a given access interval of a cache line. If the counter reaches a threshold value, the associated line can be selected for replacement. In LvP it counts the number of accesses to each line instead of to the same set. 3. Recently Proposed Schemes This section represents the research works related with the placement and replacement schemes at the Last Level Cache (LLC), along with their performance analysis.

3.1. Replacement Schemes in Unicore Processors Counter-based replacement technique [9] for unicore LLC predicts access interval using a counter for each cache line. All counters in a set are incremented on an access and a cache line whose counter value exceeds a given threshold is selected as victim. The use of reuse information during victim selection for unicore LLC has been exploited in [11]. It predicts the reuse distance and uses the predicted values for cache eviction. On a cache miss, if the predicted reuse distance of the memory reference is higher than the reuse distance seen by all cache lines in the set, the requested data is directly sent to the processor without storing it in the cache. Otherwise, a cache line with highest reuse distance is replaced with the requested data.using Adaptive Weight Ranking Policy [12], ranking of the referenced blocks is done three factors. F i, R i, N, where F i be the frequency index which holds the frequency of each block i in the buffer and R i be the recency index which shows the clock of last access when buffer has been referenced and N is the total number of access to be made. Then the weighting value of block i can be computed as: 3.2 Replacement Schemes in Multicore Processors 3.2.1 MRUT-based Prediction and Replacement [10] The MRU-based algorithms based on the concepts of live and dead times of a block have been widely used in cache research [13]. The generation time of a block defines the elapsed time since the block is fetched into the cache until it is replaced. This amount of time can be divided in live and dead

times. The live time refers to the elapsed time since the block is fetched until its last access before it is replaced, and the dead time refers to the time from its last access until eviction. Figure 1 illustrates the concept of MRUT in the context of the live time of the A cache block, assuming the LRU replacement policy. Figure 1. Generation time of the A cache block [10]. Initially block A is allocated at time t 1 in the MRU position at the top of the LRU stack. The block maintains this position while it is being accessed. Then, the block leaves this position because block B is accessed. At this point, we say that block A has finished its first MRUT. After accessing block B, block A is referenced again so returning to the MRU position and starting a second MRUT. At time t 2, block A finishes its third MRUT, which is the last MRU-Tour of this block before leaving the cache at time t 3.MRUT-based replacement algorithms can be a single or multiple MRUTs depending on the number of times the blocks are evicted. Figure 2 depicts the results for the SPEC2000 benchmark suite [14] and 1MB-16way L2 cache under the LRU algorithm.

Figure 2. Number of replacements split into single and multiple MRUTs 10. 3.2.2 The Application-Aware Cache Replacement [1] This policy prevents victimizing low-access rate application by a high-access rate application. To make this policy access rate aware, separate local eviction priority chains are maintained for different cores. ACR technique updates the eviction priority of only those cache lines that belong to the referencing application and maintains separate eviction priority chains for each application. It dynamically keeps track of maximum life-time of cache lines in shared LLC for each concurrent application and helps in efficient utilization of the cache space. Experimental evaluation of ACR technique for 2-core and 4-core systems using SPEC CPU 2000 and 2006 benchmark suites shows significant speed-up improvement over the least recently used and thread-aware dynamic re-reference interval prediction techniques. Each application has different reuse patterns, which can change during various stages of the execution.

SPEC benchmark LLC statistics SPEC benchmark LLC statistics APKI Miss Rate (%) APKI Miss Rate (%) 164.gzip 1.22 17.08 429.mcf 64.47 90.91 168.wupwise 3.01 99.13 435.gromacs 1.72 19.59 171.swim 22.89 99.98 437.leslie3d 9.15 82.64 172.mgrid 12.32 64.95 444.namd 0.68 98.68 173.applu 20.16 99.92 450.soplex 2.94 35.67 175.vpr 11.78 27.48 454.calculix 0.91 62.92 177.mesa 0.72 91.53 456.hmmer 2.14 71.36 178.galgel 14.09 43.91 458.sjeng 0.37 79.98 179.art 129.64 78.81 459.GemsFDTD 0.006 70.98 186.crafty 0.58 9.65 462.libquantum 6.72 99.64 193.fma3d 0.00051 100 464.h264ref 0.88 10.41 300.twolf 15.24 32.37 470.lbm 32.07 99.99 401.bzip2 5.18 43.57 Table 1. LLC statistics for SPEC CPU 2000 and 2006 bencmarks 14 Figure 3. Comparative performance analysis between ACR & IA-DRRIP for different Workloads 3.

Conclusion In this paper we presented different cache management mechanisms that determine the cache placement & replacement policies on a per-block basis. The key idea is to predict a missed block s temporal locality before inserting it into the cache and choose the appropriate insertion policy for the block. In our future work we will include developing and analyzing other temporal locality prediction schemes and also investigating the interaction of our mechanism with pre-fetching. We will practical, a low-overhead implementation based scheme on real time simulation using different benchmarks for significant performance compared to other similar approaches on a wide variety of workloads and system configurations. It will be tested for both unicore and multicore processors. References [1] Tripti S. Warrier, B. Anupama, and Madhu Mutyam, An Application-Aware Cache Replacement Policy for Last-Level Caches, ARCS 2013, LNCS 7767, pp. 207 219, 2013. [2] Chaudhuri, M., Subramoney, S.Bypass and insertion algorithms for exclusive last-level caches, Computer Architecture (ISCA), 2011, 38th Annual International Symposium on pp. 81-92. [3] B. Stackhouse et al. A 65 nm 2-Billion Transistor Quad-Core Itanium Processor. IEEE Journal of Solid-State Circuits, 44(1):18 31, 2009. [4] R. Kalla et al. Power7: IBM s Next-Generation Server Processor.IEEE Micro, 30:7 15, 2010.

[5] G. Tyson et al. A Modified Approach to Data Cache Management. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 93 103, 1995. [6] T. L. Johnson et al. Run-Time Cache Bypassing. IEEE Transactions on Computers, 48:1338 1354, December 1999. [7] J. A. Rivers et al. Utilizing Reuse Information in Data Cache Management. In Proceedings of the 12th International Conference on Supercomputing, 449 456, 1998. [8] W.-F. Lin and S. Reinhardt. Predicting Last-Touch References under Optimal Replacement. Technical Report CSETR- 447-02, University of Michigan, 2002. [9] M. Kharbutli and Y. Solihin. Counter-Based Cache Replacement and Bypassing Algorithms. IEEE Transactions on Computers, 57:433 447, 2008. [10] Alejandro Valero, Julio Sahuquillo, Salvador Petit, Pedro L opez, and Jos e Duato, MRU-Tour-based Replacement Algorithms for Last-Level Caches, 2011 23rd International Symposium on Computer Architecture and High Performance Computing. [11] Keramidas, G., Petoumenos, P., Kaxiras, S.: Cache replacement based on reuse distance prediction. In: International Conference on Computer Design (2007). [12] Swain D., Paikaray B., Swain D., AWRP: Adaptive Weight Ranking Policy for Improving Cache Performance, Journal of computing, volume 3, issue 2, February 2011, 2151-9617. [13] Wood D., Hill M. D., and Kessler. R. E., A Model for Estimating Trace-Sample Miss Ratios. In Proceedings of the 1991 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 79 89, 1991. [14] Standard Performance Evaluation Corporation, available online at http://www.spec.org/cpu2000.