Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Similar documents
Managing Off-Chip Bandwidth: A Case for Bandwidth-Friendly Replacement Policy

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Last Level Cache Size Flexible Heterogeneity in Embedded Systems

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

ECE 669 Parallel Computer Architecture

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

Cray XE6 Performance Workshop

Towards Bandwidth-Efficient Prefetching with Slim AMPM

Introduction to OpenMP. Lecture 10: Caches

Pull based Migration of Real-Time Tasks in Multi-Core Processors

Area-Efficient Error Protection for Caches

Flexible Cache Error Protection using an ECC FIFO

Locality-Aware Data Replication in the Last-Level Cache

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Performance Improvement by N-Chance Clustered Caching in NoC based Chip Multi-Processors

CSE502: Computer Architecture CSE 502: Computer Architecture

A Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Limiting the Number of Dirty Cache Lines

Performance of coherence protocols

Cache Coherence Protocols for Chip Multiprocessors - I

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

Investigating design tradeoffs in S-NUCA based CMP systems

Page 1. Memory Hierarchies (Part 2)

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Cache Controller with Enhanced Features using Verilog HDL

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors

Threshold-Based Markov Prefetchers

Variable Forwarding Cache Coherence for Chip Multiprocessors

Page 1. Multilevel Memories (Improving performance using a little cash )

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

EE 4683/5683: COMPUTER ARCHITECTURE

Caches Concepts Review

Selective Fill Data Cache

SF-LRU Cache Replacement Algorithm

Effective Prefetching for Multicore/Multiprocessor Systems

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Demand fetching is commonly employed to bring the data

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera s Tile Processor

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Snatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Cache Side Channel Attacks on Intel SGX

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

Author's personal copy

Cache Performance (H&P 5.3; 5.5; 5.6)

Improving Writeback Efficiency with Decoupled Last-Write Prediction

h Coherence Controllers

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Computer Sciences Department

Latency Symbol DRAM cycles Latency Symbol DRAM cycles Latency Symbol DRAM cycles Read / Write tbl 4 Additive Latency AL 0 Activate to Activate trc 39

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Advanced Caches. ECE/CS 752 Fall 2017

Chapter 2: Memory Hierarchy Design Part 2

Memory Hierarchy. Slides contents from:

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

3D Memory Architecture. Kyushu University

Data Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger

Performance metrics for caches

Studying the Impact of Multicore Processor Scaling on Directory Techniques via Reuse Distance Analysis

Nonblocking Memory Refresh. Kate Nguyen, Kehan Lyu, Xianze Meng, Vilas Sridharan, Xun Jian

Lecture notes for CS Chapter 2, part 1 10/23/18

Performance and Power Solutions for Caches Using 8T SRAM Cells

Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis

Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

ECE331: Hardware Organization and Design

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

12 Cache-Organization 1

Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Lecture 17: Memory Hierarchy: Cache Design

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Predictive Modeling based Power Estimation for Embedded Multicore Systems

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

[ 5.4] What cache line size is performs best? Which protocol is best to use?

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

Concerning with On-Chip Network Features to Improve Cache Coherence Protocols for CMPs

Phastlane: A Rapid Transit Optical Routing Network

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Why Multiprocessors?

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Transcription:

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu Abstract One of the major challenges NoC and on-board interconnection has to face in current and future multicore chips is the skyrocketing off-chip bandwidth requirement. As the number of cores increases, the demand for off-chip bus, memory ports, and chip pins increases and this can severely hurt performance. It leads to bus congestion, processor stalls and hence, performance loss. Off-chip bandwidth is generated by the on-chip cache hierarchy (cache misses and cache writebacks). This paper studies the interaction among off-chip bandwidth requirement, cache performance, and overall system performance in multicore systems. The traffic from the chip to the memory is due to the writes which are sent from the last level on-chip cache to the memory, or to the following level external cache, whenever a block is replaced from the cache. We relax some constraints of the well known LRU replacement policy. We call it Modified Least Recently Used (MLRU) policy which essentially reduces the traffic from the chip to the memory system. Simulations using the MLRU policy show considerable writeback decrease by more than 90% with a minimum performance impact. However, it shows also some interesting interaction among cache performance, overall system performance, and off-chip writeback traffic. 1 Introduction The gap between processor speed and memory speed continues to widen [1]. As the number of cores on chip increases, off-chip bandwidth is becoming a big wall. With the increasing number of cores, there is an ever increasing contention for the off-chip bus. The offchip bandwidth is divided between traffic to memory and traffic from memory. The traffic from memory is mainly due to cache misses. The traffic from chip to memory occurs due to blocks evicted from last level cache to be sent down to the memory. In this paper we try to decrease traffic from the chip which occurs whenever there is a replacement in the last level on-chip cache(llc). We try to decrease the bandwidth requirement by modifying the replacement policy of LLC to reduce the number of writes to the memory. The technique we use is the Modified-LRU (MLRU). In the new policy, we select the first nondirty block from LRU to LRU-m as the victim block (0 m assoc) to be replaced. If no victim block is found, then the LRU block is used as victim and is written down if dirty. This results in lesser number of writebacks. However we might be overwriting some blocks earlier than in regular replacement scheme which may result in more cache misses. We simulate Splash-2 benchmark to test our replacement policy. The result is a huge reduction in the writeback traffic. We expected some performance loss as we replace non-lru blocks, but simulations revealed otherwise. Also, the MLRU technique requires very little hardware overhead. The rest of the paper is organized as follows. The off-chip bandwidth problem is briefly summarized in section 2. We present our technique, the Modified LRU in section 3. The experimental setup, methodology and the results are given in section 4. We conclude in section 5. 2 Off-Chip Bandwidth Problem On-chip speeds have grown consistently over the past several decades creating a corresponding demand for high-bandwidth. The off-chip Bandwidth is shared between traffic from LLC to memory and from memory to LLC. The traffic from memory to LLC occurs whenever there is a cache miss in the LLC and the corresponding block has to be fetched from the memory. These misses if reduced can in turn reduce the number of blocks to be fetched and can thus save traffic coming to the chip. [5] and [6] have shown similar type of work which targets bandwidth optimization by reducing misses. The second type of traffic is that from the LLC to the memory. This is due to the writes sent to the memory whenever a block in LLC is evicted and is dirty. The block chosen to be evicted depends on the replacement policy of LLC. The victim block is then written to the memory. We aim to decrease the traffic arising from writing

back dirty blocks to the memory when evicted from LLC. We modify the existing technique, LRU, to reduce the number of writebacks to the memory for dirty blocks. There is a big body of research work that involves modifying the current replacement policies. Although LRU remains the most popular, other schemes like the pseudo-lru have been proposed to reduce the hardware complexity of the LRU. For example, in Modified Pseudo-LRU (MPLRU) [2] the cache misses are decreased by exploiting the second chance concept of the PLRU scheme. The more off-chip bandwidth demand increases, the higher the chance of stalls due to contention in buses and memory ports. This reduces the pressure on the NoC due to system stalls, but also affects overall performance. So we are looking for a scheme that reduces the pressure on NoC but also without affecting performance and without increasing the pressure on off-chip interconnect. 3 Modified LRU: Decreasing Writebacks Whenever a new block is to be placed in the cache, an older blocked is evicted, if the set is full, which is then written down to the memory if dirty. The victim block to be evicted is the least recently used, which is one of the widely used replacement policy together with its many variations. We modified this technique to reduce writebacks to the memory, and observed the effect on performance. Our technique works as follows. Suppose we have an n way associative cache. The blocks are placed from position 1 to n with the most recently used block at position 1 and the least at position n. In a regular LRU whenever a new block has to be placed in the cache, the block at position n (the LRU block) is evicted. In MLRU, we choose the victim block from LRU to LRUm. The first non-dirty block from LRU to LRU-n is overwritten with the new block instead of evicting it. This saves bandwidth by retaining the dirty blocks in the cache as long as possible. If however, all the blocks from LRU to LRU-m are dirty, the LRU block is chosen to be evicted like in the regular replacement policy. For example if m=3, the technique would look for the first non dirty block from 3 blocks starting from LRU and overwrite it. We are expecting some performance loss due to victimizing non LRU blocks. The aim of this paper is to study the effect of trading bandwidth for performance. Block Size Associativity Cache Size L1 64B 2-way 32KB L2 64B 8-way 1MB Table 1: Parameters for the L1 and L2 Cache Cholesky 24.15% fft 23.78% Radix 8.255% Radiosity 2.234% Raytrace -2.25% fmm -7.27% Barnes -27.1% Table 2: Percentage of Total Traffic Change for MLRU (Negative numbers show traffic increase) 4 Experimental Evaluation In this section we present the results of our simulations to assess the efficiency of MLRU. First, we present the experimental setup, then we present the results and their implications. 4.1 Experimental Methodology We use the SESC simulator [4] to model our system and simulate the Splash-2 benchmarks to compare MLRU with the LRU. Out of the Splash-2, four benchmarks have little or no writebacks (lu, ocean, water-nsquared, water-spatial) so we do not present their results. We examine the effect of MLRU on the rest of the benchmarks. Each benchmark is simulated for LRU and for MLRU-m where m is varied from 2 till the associativity of the cache (m = 1 is the LRU itself). We are simulating an 8-way set associative cache. We take the number of writebacks, read misses, and the total number of cycles till completion as our bandwidth, cache performance, and system performance measures respectively. It is to be noted that we have used the number of read misses as a measure of cache performance because a write miss is considered a read miss followed by a write hit. Table 1 gives the summary of the cache system. We simulate 4 cores each with a private L1 data cache and instruction cache, and a shared L2. L2 cache is non-inclusive. 4.2 Experimental Results In the first set of experiments we analyze the effect of our scheme on the number of writebacks. All of the benchmarks show a decrease in the number of writebacks. This is because we keep dirty blocks in the cache for longer and overwrite them if they are not dirty which reduces the number of writebacks to the mem-

9.4e+06 FFT 2e+05 FFT 9.2e+06 9e+06 8.8e+06 8.6e+06 8.4e+06 1.9e+05 1.8e+05 1.7e+05 1.6e+05 8.2e+06 1.5e+05 0 2 4 6 8 Figure 1: Comparison of Cycles and Total Traffic with Increasing M for FFT Benchmark 2.7e+08 Cholesky 2.6e+06 Cholesky 2.65e+08 2.6e+08 2.55e+08 2.5e+08 2.45e+08 2.5e+06 2.4e+06 2.3e+06 2.2e+06 2.1e+06 2.4e+08 2e+06 Figure 2: Comparison of Cycles and Total Traffic with Increasing M for Cholesky Benchmark 1.617e+07 Radix 18000 Radix 1.6165e+07 1.616e+07 1.6155e+07 17500 17000 16500 1.615e+07 16000 0 2 4 6 8 Figure 3: Comparison of Cycles and Total Traffic with Increasing M for Radix Benchmark 6.2e+08 Barnes 1.6e+06 Barnes 6.1e+08 6e+08 5.9e+08 5.8e+08 1.5e+06 1.4e+06 1.3e+06 1.2e+06 5.7e+08 1.1e+06 Figure 4: Comparison of Cycles and Total Traffic with Increasing M for Barnes Benchmark

7.06e+07 FMM 7.06e+07 FMM 7.05e+07 7.04e+07 7.03e+07 7.05e+07 7.04e+07 7.03e+07 7.02e+07 7.02e+07 Figure 5: Comparison of Cycles and Total Traffic with Increasing M for Fmm Benchmark 2.41e+08 Raytrace Raytrace 2.405e+08 7.3e+05 2.4e+08 2.395e+08 2.39e+08 7.25e+05 2.385e+08 7.2e+05 2.38e+08 0 1 2 3 4 5 6 7 8 Figure 6: Comparison of Cycles and Total Traffic with Increasing M for Raytrace Benchmark Cholesky 36.8% fft 93.9% Radix 40.2% Radiosity 4.73% Raytrace 100% fmm 5.9% Barnes 29.30% Table 3: Percentage of Writeback Decrease Cholesky 14% fft 16% Radix 2.4% Radiosity 0% Raytrace -2.4% fmm -14.8% Barnes -56.9% Table 4: Percentage Decrease in Read Misses (Negative numbers show increase in misses) Cholesky 7.23% fft 12.15% Radix 0% Radiosity 0.3% Raytrace -2.4% fmm -0.3% Barnes -7.5% Table 5: Percentage Decrease of Total (Negative numbers show increase in misses) ory. The maximum reduction in writebacks is shown by Raytrace in which the writebacks are reduced to 0. This means that most of the writebacks for that application were for local variables which are no longer needed or for register spilled. That is, they are dead values. Cholesky and FFT show a decrease of 36.8% and 93.9% respectively. Table 3 summarizes the writebacks for all the benchmarks. Results shown in this and the following tables are for M = 8 to show the performance of our technique in its extreme case. Next we see the effect of our scheme on the number of misses. We expect the misses to increase to some extent because of victimizing non-lru blocks. This may overwrite some blocks needed by the application, that in the case of non MLRU replacement scheme, would be still in the cache. Hence we expect to see

some increase in misses. However, for most benchmarks the increase in misses is less than 15% except forbarnes which has an increase in miss of 56.9%. The high increase in misses ofbarnes means that it is LRU friendly, and it has high temporal locality. For FFT and Cholesky we even see a decrease in the number of misses by 16% and 14% respectively. This means that LRU is not always the best replacement policy, and it is better sometime to violate LRU to gain reduction in off-chip bandwidth. Table 4 summarizes the misses for all the benchmarks. We measure the total traffic to be the total number of times the bus is accessed. Each time the bus is accessed, a cache block is sent or received. This is the combined sum of the writebacks and the number of misses. The overall effect on the traffic is shown in Table2. The total traffic for Cholesky and FFT decreases by 24.15% and 23.78% respectively since both the misses and the writebacks decrease for these benchmarks. We then see the effect on the overall system performance in terms of total number of cycles for the whole execution. For three benchmarks (FFT, Cholesky and Radix) we see an increase in performance since the number of cycles have decreased. This is because we decreased delay caused by bus contention and memory port contention. For the benchmarks that have an increase in misses we expect some performance loss. Only three benchmarks show a decrease in performance out of which two benchmarks have increase in number of cycles less than 1% except for barnes which is 7.5%. This shows how off-chip bandwidth is a severe limitation Figures 1-6 show the effect of changing the value on M on the total performance () and the total traffic of each benchmark. Although we gain by reducing the number of traffic, we tradeoff some performance loss. 5 Conclusions Off-chip bandwidth continues to challenge microarchitects. As the number of cores on chip increases, the off-chip traffic will continue to increase. In this paper we show that we can trade some performance loss with bandwidth saving. Our simulations show that the technique offers significant reduction in writebacks of up to 39.3% with a performance loss of no more than 7.5%. In some cases the performance gain of up to 12.15% was observed. This means that we can trade some cache performance loss with decrease in off-chip bandwidth, which can lead to an overall system performance enhancement due to decrease in off-chip contention on buses and memory. There are still several open questions that we are currently exploring. When is it beneficial to violate LRU policy? Can we do this violation dynamically? How does it work in a multiprogramming environment? How does the trade of bandwidth with performance change as we increase the number cores? References [1] Doug Burger, James R. Goodman, and Alain Kägi. Memory bandwidth limitations of future microprocessors. In ISCA 96: Proceedings of the 23rd annual international symposium on Computer architecture, pages 78 89, New York, NY, USA, 1996. ACM. [2] Hassan Ghasemzadeh, Sepideh Sepideh Mazrouee, and Mohammad Reza Kakoee. Modified pseudo lru replacement algorithm. In ECBS 06: Proceedings of the 13th Annual IEEE International Symposium and Workshop on Engineering of Computer Based Systems, pages 368 376, Washington, DC, USA, 2006. IEEE Computer Society. [3] Hsien-Hsin S. Lee, Gary S. Tyson, and Matthew K. Farrens. Eager writeback - a technique for improving bandwidth utilization. In MICRO 33: Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pages 11 21, 2000. [4] P. M. Ortego and P. Sack. SESC: SuperESCalar Simulator. Dec. 2004. [5] Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. Temporal streaming of shared memory. In ISCA 05: Proceedings of the 32nd annual international symposium on Computer Architecture, pages 222 233, Washington, DC, USA, 2005. IEEE Computer Society. [6] Michael Zhang and Krste Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. SIGARCH Comput. Archit. News, 33(2):336 345, 2005.