Effective Prefetching for Multicore/Multiprocessor Systems

Size: px
Start display at page:

Download "Effective Prefetching for Multicore/Multiprocessor Systems"

Transcription

1 Effective Prefetching for Multicore/Multiprocessor Systems Suchita Pati and Pratyush Mahapatra Abstract Prefetching has been widely used by designers to hide memory access latency for applications with predictable memory access patterns. But in the age of multi-cores, using only per-core information to prefetch data into the cache hierarchy could lead to unnecessary cache pollution. Moreover, the additional coherence traffic and frequent operations such as downgrades can further reduce the overall system throughput. Additionally, in multi-threaded applications with synchronization, performance is determined by a few critical threads.prefetching aimed at a non-critical thread could throttle critical thread progress and thus, impact the system as a whole. In this project, we propose revisiting the prefetcher design questions of what, when, where and how to prefetch, for multiprocessor architectures which we believe will be extremely important for producer-consumer applications. Through our proposal targeting coherence invalidations due to prefetchers, we were able to significantly reduce coherence messages by augmenting the coherence protocol. We found that it did not impact performance of the prefetcher while also saving power. In our second proposal on Global Prefetch Control we characterized usefulness of local prefetchers using parameters like Accuracy, Cache Pollution and Thread Criticality. Combining information from all these parameters, a global decision can be made regarding when to and when not to throttle the local prefetchers. I. INTRODUCTION With the end of Dennard scaling, we have shifted from building faster, beefier cores to having multiple smaller cores working cooperatively together. While there has been a shift towards a multi-core era, many artifacts from the single core era still remain and are still being used as they were earlier. In our project, we wanted to revisit this and find out if architecture techniques borrowed from the single core era are still relevant in a multi-core scenario. This work focuses on Prefetchers and their adoption in a Multicore/Multiprocessor System. We have focused on two problems with prefetchers in multicore systems. First is the problem of additional coherence messages introduced due to prefetching. In a single core system, we prefetch aggressively since the data would either be in the private caches or in the main memory. However, in multicore systems, we also need to consider the possibility of data being present in other core s caches. In this work we carry out studies to answer the question of whether it is worth fetching data into the private cache while inducing additional coherence messages, or are there other optimizations. We propose augmenting the MESI coherency protocol to ignore lines which are already present in the cache hierarchy. Second is the problem of cache pollution induced at the shared last level cache due to individual local prefetchers. *Department of Computer Sciences, University of Wisconsin-Madison While cache pollution due to prefetching is a problem faced in single-core systems as well, it takes on a completely new dimension in multicore systems with many more permutations and combinations to be explored. We propose the idea of a Global Prefetch Control which takes into consideration cache pollution at the LLC, prefetch accuracy and thread criticality, to identify usefulness of the individual prefetchers and throttles them if found useless. The paper is split as follows. In Section II we briefly describe some related work that has been done in this area. In Section III we introduce the simulator used and the system configuration used for our experiments. We proceed to show experimental evidence on Cache Pollution and additional Coherence Messages introduced by Prefetchers in a Multicore system in Section IV. In Section V we introduce our proposals for improving prefetch efficiency without impacting performance in a Multicore system. In section VI we propose the idea of Global Prefetch Control. In Section VII we present our results and Section VIII lists few ideas for future work and we conclude in Section IX. II. RELATED WORK There has been previous work done looking at intercore prefetching. We list a few of them here. Inter-core prefetching [2] uses the idea of Helper Threads to speed up compute. They use idle threads to prefetch and use thread migration to switch to the thread with the prefetched data. The paper also mentions the problem of cache invalidations due to prefetching but did not propose a solution for the problem. [3] also uses helper threads to prefetch useful data. [4] uses a spare core to prefetch data by executing the nth instance of all non-control flow instructions. [5] considers the coherency traffic invalidations and solves this by identifying unshared regions of memory and prefetching from them. [6] proposes new metrics to identify critical threads and new hardware additions which allow critical threads to identified at runtime. [1] uses inbuilt counters to predict critical threads. The idea of global prefetch control was first introduced in [9] which proposes controlling local prefetchers in a multiprogrammed environment using local accuracy, global pollution and memory contention. Ours is the first work that proposes a global control that also includes the aspect of thread criticality while prefetching, making it a truly global and application performance sensitive technique.

2 TABLE I SYSTEM CONFIGURATION Core L1 I Cache L1 D Cache L2 Cache L3 Cache (Shared) Coherency Protocol Prefetchers Out-of-Order 32 KB, 4-way Set Associative 32 KB, 2-way Set Associative 256 KB, 16-way Set Associative 32 MB, 16-way Set Associative MESI Directory Based 2 L2 Stream Prefetchers for Instr. and Data Fig. 1. system % of prefetches in L3 leading to invalidations for a dual core III. ANALYSIS TOOLS AND SYSTEM CONFIGURATION A. SIMULATOR For our evaluation, we used ZSIM[7] which is a fast x86-64 simulator developed by Daniel Sanchez, et al. Its a PIN based simulator which focuses on simulating memory hierarchies and large, heterogeneous systems. ZSIM allowed us to scale to large core counts and it also came with an inbuilt stream prefetcher which we could use out of the box. B. BENCHMARKS We used the Parsec Multithreaded Benchmarks [8]. We ran Blackscholes, Dedup, Facesim, Ferret, Raytrace and Swaptions. These benchmarks were chosen since they had a mix of high and low sharing. We also enabled MAGIC OPS Hooks in Parsec to only simulate benchmarks in the Region of Interest. C. SYSTEM CONFIGURATION We wanted to ensure our system mimics a modern processor and hence we set up our configuration similar to that of Intel s Broadwell Server. The configuration is described in Table 1. IV. INITIAL STUDIES Coherence Traffic Due to Prefetchers: In this study, we analyzed the additional coherence traffic induced by prefetchers in a multicore system. We are particularly interested in looking at GETS prefetch requests which leads to a downgrade in other shared caches. That scenario could potentially lead to further coherence traffic with the cache whose data is downgraded due to the prefetch request, again asking for the data in Exclusive state and thus invalidating the prefetch copy. Thus the prefetch would be a wasted prefetch and it would have also lead to additional coherence traffic while also delaying execution in one of the cores. Figure 1 shows that a significant % of prefetch accesses in L3 lead to downgrades of cache lines in other cores. Figure 2 shows the % of total number of prefetch accesses leading to downgrades over all L3 accesses combined. The number stays significant especially for high sharing workloads like dedup and ferret. We decided to explore the impact on systems with varying number of cores. Fig. 2. % of prefetches over all accesses to L3, leading to invalidations for a dual core system Figure 3 shows that the problem exists even in systems with multiple cores. Moreover we find that the number of coherence invalidations due to prefetches becomes a significant factor even for benchmarks with low sharing. Cache Pollution Due to Prefetchers: We next study the problem of cache pollution by analyzing the performance of benchmarks with L2 prefetchers turned on and off. The study was done over many parsec benchmarks but we chose to focus on ferret as it s per-thread IPC degraded with increasing threads/cores implying high memory access contention. As shown in Figure 4, the per-thread IPC of ferret does not improve with prefetchers turned on implying there is no performance gain on using prefetchers. However, from Figures 5 and 6, it can be seen that both L3 cache misses and accesses increase on turning on prefetchers. This implies that even though the prefecthers are fetching more lines from L3 and memory, they are either evicting useful cachelines or are simply getting evicted by other core s prefetches, leading to larger miss rates and no performance gain. Given this observation, we were motivated to control local prefetchers using global information, i.e., cache pollution at L3, relative accuracy and thread criticality to not only reduce this interference but to also prefetch data that is useful for improving overall application performance. V. COHERENCE TRAFFIC REDUCTION FOR PREFETCHERS Our initial studies clearly indicate that Coherence downgrades due to prefetch requests is significant and should be looked at in greater detail. As systems scale and they incorporate more and more cores, we find that the number

3 Fig. 5. L3 Misses of ferret with increasing thread/core count with prefetchers turned off (left) and on (right) Fig. 3. % of prefetches over all accesses to L3, leading to invalidations for multicore systems Fig. 6. L3 Accesses of ferret with increasing thread/core count with prefetchers turned off (left) and on (right) Fig. 4. IPC of ferret with increasing thread/core count with prefetchers turned off (left) and on (right) of such invalidations increases further. The problem could potentially worsen in systems with high number of cores since the cache to cache latency will be much higher due to a much more complex interconnect and also a huge waste of energy. We propose two solutions to encounter this problem: A. No Inter Core Cache Line Fetching Ignore GETS request generated by a prefetcher if data is present in the cache hierarchy. B. No Modified/Exclusive Lines Fetching Ignore GETS request generated by a prefetcher if data is present in Exclusive or Modified state in the cache hierarchy. In both the above proposals, we augment the cache coherence protocol. In the first proposal, we disallow any prefetch data transfer when the data is already present in the cache hierarchy. Prefetches are only allowed from memory. This helps avoid the additional cache invalidations and downgrades. To implement this proposal, we track all coherence requests sent to L3 and add a special flag that allows us to identify requests originating from prefetchers. If we find that the data requested is not in Invalid state, i.e. Data is present in one of the other caches in either Shared, Modified or Exclusive, we drop the coherence request and send a nack to the requesting core. In the second proposal, we follow a similar approach to the first but optimize it further. Since, prefetches are a GETS request, we can safely allow data that is present in Shared state to be sent to the prefetcher without downgrading permissions in other cores. The implementation is similar to the first proposal. VI. GLOBAL PREFETCH CONTROL In order to reduce the pollution of multiple local prefetchers at the last-level cache and to provide an illusion of a global prefetch unit with an objective of improving the overall application performance instead of local core performance, we put forth the idea of a Global Prefetch Throttler - a unit next to the LLC which checks for each individual prefetchers pollution at the LLC, its accuracy and the criticality of the thread running on the core it is prefetching for and throttles local preftechers which aren t fetching useful data for the application. The Global Prefetch Throttler periodically measures the following: Prefetchers Accuracy: Percentage of prefetches which were indeed accessed by the core. Prefetchers Pollution: Number of useful cachelines evicted by the prefetcher. If a demand access for a prefetch evicted (normal or a previously prefetched) line comes before the prefetched line is accessed, the prefetcher is penalized for pollution. Threads Criticality: If the thread running on the core is in a critical section or is a trailing barrier thread Global Control Rules We implement the global throttler to check the three parameters mentioned above for all the local prefetchers at certain intervals (on every 1000 L3 misses) and control global prefetching by throtlling the local prefetchers based on the rules mentioned in Figure 11. The thresholds for pollution and accuracy are calculated at each interval based on the median of pollution and accuracy values of all the local prefetchers, instead of fixing them to certain values. Out of all the rules mentioned in the table the two highlighted rows are for the best and worst case prefetcher performance. The first is when a less critical thread with low accuracy is causing high polluting at the LLC, in which case it is throttles. The best case is when a critical thread with high accuracy is prefetching without causing any pollution at the LLC.

4 Fig. 7. Number of Prefetches served by L3 for a dual core and a 32 core system. Comparing Normal Approach and Optimization 1 Fig. 9. Comparing L3 Hit Rates of Normal Approach and Optimization 1 Fig. 8. Number of Prefetches served by L3 for a dual core and a 32 core system. Comparing both optimizations VII. EVALUATION A. RESULTS: COHERENCE TRAFFIC REDUCTION Our results show that, we were able to reduce the prefetch accesses served by L3 (Figure 7 and Figure 8). This is because we were ignoring prefetch requests to lines already present in the cache hierarchy which leads to a reduction in the coherence traffic. In addition, we should also note that as well scale cores, the number of prefetch requests sent to L3 increases exponentially and in such cases, our proposal bears major fruit since we find most of the requests are for lines already present in the cache hierarchy and we thus end up significantly reducing coherence traffic. While we didn t find any discernible impact on IPC due to our proposal, we found an improvement in L3 hit rates for some benchmarks(figure 9 and Figure 10). We believe due to the streamlining of the prefetcher, we have been able to reduce coherence pollution by a small amount leading to higher L3 hit rates for some scenarios. Understanding the exact nature of the results and if we could leverage it further would be part of our future work. B. RESULTS: GLOBAL PREFETCH CONTROL The implementation of the Global Prefetch Throttler is ongoing work. We have been able to implement a unit which tracks every prefetcher s accuracy and pollution at the LLC, however, identification of critical thread and throttling of local prefetchers is yet to be implemented. There were several deadlock and simulation slowdown issues with ZSIM which we have overcome and the implementation looks feasible now. VIII. FUTURE WORK There is scope for additional work in this domain and streamline prefetching in multicore systems further. Here are few items that we intend to look at in the future: Augmenting Coherence Protocol further by setting a hint bit when a prefetch request is made for a line in Fig. 10. Comparing L3 Hit Rates of both optimizations Exclusive or Modified State. The cache line is eventually sent to the requesting prefetcher on an invalidation, downgrade or when a demand request is made for it, whichever occurs first. Augment cache and interconnect model to analyze cache to cache transfer latency and also have a more accurate memory simulator to simulate memory contention. For global throttling, memory contention can be the fourth parameter to consider as aggressive prefetching can impact demand accesses at high memory contention. Evaluate the proposal using different types of prefetchers IX. CONCLUSIONS In this project, we put forth two proposals to improve prefetching in a multicore system. We first looked at the impact of prefetchers on additional coherence traffic and invalidations/downgrades of shared cache lines. We proposed augmenting the coherence protocol to ignore requests for cache lines in Exclusive and Modified State. This allowed us to reduce coherence traffic by a significant amount while also having the unintended effect of improving L3 hit rates in some scenarios. We also propose Global Prefetch Control, which we believe by having a holistic view of the system state (pollution) and requirements (criticality) can significantly improve overall system and application performance while also saving power wasted on useless prefetches. REFERENCES [1] Bhattacharjee, Abhishek, and Margaret Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, [2] Kamruzzaman, Md, Steven Swanson, and Dean M. Tullsen. Inter-core prefetching for multicore processors using migrating helper threads. ACM SIGPLAN Notices 46.3 (2011): [3] Kim, Dongkeun, et al. Physical experimentation with prefetching helper threads on Intel s hyper-threaded processors. Code Generation and Optimization, CGO International Symposium on. IEEE, 2004.

5 Fig. 11. Global Prefetch Control Rules [4] Ganusov, Ilya, and Martin Burtscher. Future execution: A hardware prefetching technique for chip multiprocessors. Parallel Architectures and Compilation Techniques, PACT th International Conference on. IEEE, [5] Cantin, Jason F., Mikko H. Lipasti, and James E. Smith. Stealth prefetching. ACM SIGOPS Operating Systems Review. Vol. 40. No. 5. ACM, [6] Du Bois, Kristof, et al. Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior. ACM SIGARCH Computer Architecture News 41.3 (2013): [7] Sanchez, Daniel, and Christos Kozyrakis. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Computer architecture news. Vol. 41. No. 3. ACM, [8] Bienia, Christian, et al. The PARSEC benchmark suite: Characterization and architectural implications. Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, [9] Lee, J., Lakshminarayana, N. B., Kim, H., & Vuduc, R. (2010, December). Many-thread aware prefetching mechanisms for GPGPU applications. In Microarchitecture (MICRO), rd Annual IEEE/ACM International Symposium on (pp ). IEEE.

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

Stride- and Global History-based DRAM Page Management

Stride- and Global History-based DRAM Page Management 1 Stride- and Global History-based DRAM Page Management Mushfique Junayed Khurshid, Mohit Chainani, Alekhya Perugupalli and Rahul Srikumar University of Wisconsin-Madison Abstract To improve memory system

More information

Characterizing Multi-threaded Applications based on Shared-Resource Contention

Characterizing Multi-threaded Applications based on Shared-Resource Contention Characterizing Multi-threaded Applications based on Shared-Resource Contention Tanima Dey Wei Wang Jack W. Davidson Mary Lou Soffa Department of Computer Science University of Virginia Charlottesville,

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

A Cache Utility Monitor for Multi-core Processor

A Cache Utility Monitor for Multi-core Processor 3rd International Conference on Wireless Communication and Sensor Network (WCSN 2016) A Cache Utility Monitor for Multi-core Juan Fang, Yan-Jin Cheng, Min Cai, Ze-Qing Chang College of Computer Science,

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems

A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems Zhenjiang Dong, Jun Wang, George Riley, Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Australian Journal of Basic and Applied Sciences Journal home page: www.ajbasweb.com Adaptive Replacement and Insertion Policy for Last Level Cache 1 Muthukumar S. and 2 HariHaran S. 1 Professor,

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores

Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores Anthony Gutierrez Adv. Computer Architecture Lab. University of Michigan EECS Dept. Ann Arbor, MI, USA atgutier@umich.edu

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

NAME: Problem Points Score. 7 (bonus) 15. Total

NAME: Problem Points Score. 7 (bonus) 15. Total Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 NAME: Problem Points Score 1 40

More information

Introducing Thread Criticality Awareness in Prefetcher Aggressiveness Control

Introducing Thread Criticality Awareness in Prefetcher Aggressiveness Control Introducing Thread Criticality Awareness in Prefetcher Aggressiveness Control Biswabandan Panda, Shankar Balachandran Dept. of Computer Science and Engineering Indian Institute of Technology Madras, Chennai

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Virtual Snooping: Filtering Snoops in Virtualized Multi-cores

Virtual Snooping: Filtering Snoops in Virtualized Multi-cores Appears in the 43 rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43) Virtual Snooping: Filtering Snoops in Virtualized Multi-cores Daehoon Kim, Hwanju Kim, and Jaehyuk Huh Computer

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Tyler Viswanath Krishnamurthy, and Hridesh Laboratory for Software Design Department of Computer Science Iowa State University

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Pull based Migration of Real-Time Tasks in Multi-Core Processors

Pull based Migration of Real-Time Tasks in Multi-Core Processors Pull based Migration of Real-Time Tasks in Multi-Core Processors 1. Problem Description The complexity of uniprocessor design attempting to extract instruction level parallelism has motivated the computer

More information

An Analytical Model for Optimum Off- Chip Memory Bandwidth Partitioning in Multi-core Architectures

An Analytical Model for Optimum Off- Chip Memory Bandwidth Partitioning in Multi-core Architectures Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads Ran Xu (Purdue), Subrata Mitra (Adobe Research), Jason Rahman (Facebook), Peter Bai (Purdue),

More information

What s Virtual Memory Management. Virtual Memory Management: TLB Prefetching & Page Walk. Memory Management Unit (MMU) Problems to be handled

What s Virtual Memory Management. Virtual Memory Management: TLB Prefetching & Page Walk. Memory Management Unit (MMU) Problems to be handled Virtual Memory Management: TLB Prefetching & Page Walk Yuxin Bai, Yanwei Song CSC456 Seminar Nov 3, 2011 What s Virtual Memory Management Illusion of having a large amount of memory Protection from other

More information

Multi-core Programming Evolution

Multi-core Programming Evolution Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Akshay Chander, Aravind Narayanan, Madhan R and A.P. Shanti Department of Computer Science & Engineering, College of Engineering Guindy,

More information

740: Computer Architecture, Fall 2013 Midterm I

740: Computer Architecture, Fall 2013 Midterm I Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 Midterm I October 23, 2013 Make sure that your exam has 17 pages and is not missing any sheets, then write your

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Performance and Power Solutions for Caches Using 8T SRAM Cells

Performance and Power Solutions for Caches Using 8T SRAM Cells Performance and Power Solutions for Caches Using 8T SRAM Cells Mostafa Farahani Amirali Baniasadi Department of Electrical and Computer Engineering University of Victoria, BC, Canada {mostafa, amirali}@ece.uvic.ca

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

CS 838 Chip Multiprocessor Prefetching

CS 838 Chip Multiprocessor Prefetching CS 838 Chip Multiprocessor Prefetching Kyle Nesbit and Nick Lindberg Department of Electrical and Computer Engineering University of Wisconsin Madison 1. Introduction Over the past two decades, advances

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

Locality-Aware Data Replication in the Last-Level Cache

Locality-Aware Data Replication in the Last-Level Cache Locality-Aware Data Replication in the Last-Level Cache George Kurian, Srinivas Devadas Massachusetts Institute of Technology Cambridge, MA USA {gkurian, devadas}@csail.mit.edu Omer Khan University of

More information

Lock Elision and Transactional Memory Predictor in Hardware. William Galliher, Liang Zhang, Kai Zhao. University of Wisconsin Madison

Lock Elision and Transactional Memory Predictor in Hardware. William Galliher, Liang Zhang, Kai Zhao. University of Wisconsin Madison Lock Elision and Transactional Memory Predictor in Hardware William Galliher, Liang Zhang, Kai Zhao University of Wisconsin Madison Email: {galliher, lzhang432, kzhao32}@wisc.edu ABSTRACT Shared data structure

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking Bekim Cilku, Daniel Prokesch, Peter Puschner Institute of Computer Engineering Vienna University of Technology

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering

More information

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013 Computer Systems Research in the Post-Dennard Scaling Era Emilio G. Cota Candidacy Exam April 30, 2013 Intel 4004, 1971 1 core, no cache 23K 10um transistors Intel Nehalem EX, 2009 8c, 24MB cache 2.3B

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I October 23, 2013 Make sure that your exam has 15 pages and is not missing any sheets, then

More information

Dynamic Performance Tuning for Speculative Threads

Dynamic Performance Tuning for Speculative Threads Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Advanced Caches. ECE/CS 752 Fall 2017

Advanced Caches. ECE/CS 752 Fall 2017 Advanced Caches ECE/CS 752 Fall 2017 Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti Read on your own: Review:

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

SCALING HARDWARE AND SOFTWARE

SCALING HARDWARE AND SOFTWARE SCALING HARDWARE AND SOFTWARE FOR THOUSAND-CORE SYSTEMS Daniel Sanchez Electrical Engineering Stanford University Multicore Scalability 1.E+06 10 6 1.E+05 10 5 1.E+04 10 4 1.E+03 10 3 1.E+02 10 2 1.E+01

More information

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013 Summary NUCA is giving us more capacity, but further away 40 Applications

More information

ichat: Inter-Cache Hardware-Assistant Data Transfer for Heterogeneous Chip Multiprocessors

ichat: Inter-Cache Hardware-Assistant Data Transfer for Heterogeneous Chip Multiprocessors 2014 9th IEEE International Conference on Networking, Architecture, and Storage ichat: Inter-Cache Hardware-Assistant Data Transfer for Heterogeneous Chip Multiprocessors Junli Gu #1, Bradford M. Beckmann

More information

Cache Coherence Protocols for Chip Multiprocessors - I

Cache Coherence Protocols for Chip Multiprocessors - I Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016 Context Thus far chip multiprocessors

More information

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer Contents advanced anced computer architecture i FOR m.tech (jntu - hyderabad & kakinada) i year i semester (COMMON TO ECE, DECE, DECS, VLSI & EMBEDDED SYSTEMS) CONTENTS UNIT - I [CH. H. - 1] ] [FUNDAMENTALS

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Improving Cloud Application Performance with Simulation-Guided CPU State Management

Improving Cloud Application Performance with Simulation-Guided CPU State Management Improving Cloud Application Performance with Simulation-Guided CPU State Management Mathias Gottschlag, Frank Bellosa April 23, 2017 KARLSRUHE INSTITUTE OF TECHNOLOGY (KIT) - OPERATING SYSTEMS GROUP KIT

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS

SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS 1 JUNG KYU PARK, 2* JAEHO KIM, 3 HEUNG SEOK JEON 1 Department of Digital Media Design and Applications, Seoul Women s University,

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

SEESAW: Set Enhanced Superpage Aware caching

SEESAW: Set Enhanced Superpage Aware caching SEESAW: Set Enhanced Superpage Aware caching http://synergy.ece.gatech.edu/ Set Associativity Mayank Parasar, Abhishek Bhattacharjee Ω, Tushar Krishna School of Electrical and Computer Engineering Georgia

More information

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks : Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

arxiv: v1 [cs.ar] 13 Aug 2017

arxiv: v1 [cs.ar] 13 Aug 2017 Sensitivity Analysis of Core Specialization s Prathmesh Kallurkar Microarchitecture Research Lab Intel Corporation e-mail: prathmesh.kallurkar@intel.com Smruti R. Sarangi Department of Computer Science

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions

Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions Naoto Fukumoto, Kenichi Imazato, Koji Inoue, Kazuaki Murakami Department of Advanced Information Technology,

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Towards Fair and Efficient SMP Virtual Machine Scheduling

Towards Fair and Efficient SMP Virtual Machine Scheduling Towards Fair and Efficient SMP Virtual Machine Scheduling Jia Rao and Xiaobo Zhou University of Colorado, Colorado Springs http://cs.uccs.edu/~jrao/ Executive Summary Problem: unfairness and inefficiency

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols. Dana Vantrease, Mikko Lipasti, Nathan Binkert

Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols. Dana Vantrease, Mikko Lipasti, Nathan Binkert Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols Dana Vantrease, Mikko Lipasti, Nathan Binkert 1 Executive Summary Problem: Cache coherence races make protocols complicated

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information