Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
|
|
- Oliver Goodman
- 6 years ago
- Views:
Transcription
1 Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed Caches cache slice Data placement determines performance Goal: place data on chip close to where they are used 2 1
2 Prior Work Several proposals for CMP cache management ASR, cooperative caching, victim replication, CMP-NuRapid, D-NUCA...but suffer from shortcomings complex, high-latency lookup/coherence don t scale lower effective cache capacity optimize only for subset of accesses We need: Simple, scalable mechanism for fast access to all data 3 Our Proposal: Reactive NUCA Cache accesses can be classified at run-time Each class amenable to different placement Per-class block placement Simple, scalable, transparent No need for HW coherence mechanisms at LLC Avg. speedup of 6% & 14% over shared & private Up to 32% speedup -5% on avg. from ideal cache organization i Rotational Interleaving Data replication and fast single-probe lookup 4 2
3 Outline Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion 5 Terminology: Data Types Read or Write Read Read Read Write Private Shared Read-Only Shared Read-Write 6 3
4 Conventional Multi Caches Shared Private dir Addr-interleave blocks + High effective capacity Slow access Each block cached locally + Fast access (local) Low capacity (replicas) Coherence: via indirection (distributed directory) We want: high capacity (shared) + fast access (priv.) 7 Where to Place the Data? Close to where they are used! Accessed by single : migrate locally Accessed by many s: replicate (?) If read-only, replication is OK If read-write, coherence a problem Low reuse: evenly distribute across sharers read write read-only migrate eread-write share replicate sharers# 8 4
5 Methodology Flexus: Full-system cycle-accurate timing simulation Workloads OLTP: TPC-C WH IBM DB2 v8 Oracle 10g DSS: TPC-H Qry 6, 8, 13 IBM DB2 v8 SPECweb99 on Apache 2.0 Multiprogammed: Spec2K Scientific: em3d Model Parameters Tiled, LLC = Server/Scientific wrkld. 16-s, 1MB/ Multi-programmed wrkld. 8-s, 3MB/ OoO, O 2GHz, 96-entry ROB Folded 2D-torus 2-cycle router 1-cycle link 45ns memory 9 Cache Access Classification Example Each bubble: cache blocks shared by x s Size of bubble proportional to % accesses y axis: % blocks in bubble that are read-write % % RW Read-W Block Write ks in Blocks Bubble Instructions Data-Private Data-Shared % accesses Number of Sharers 10 5
6 % RW Read-Write Blocks in Blo Bu ubble Cache Access Clustering Instructions Data-Private Data-Shared S 10 ocks Number of Sharers migrate locally Server Apps share (addr-interleave) % % RW Read-Write Blocks in Blo Bu ocks ubble Instructions Data-Private Data-Shared R/W R/O migrate share replicate sharers# Number of Sharers Scientific/MP Apps replicate Accesses naturally form 3 clusters 11 Instruction Replication Instruction working set too large for one cache slice Distribute in cluster of neighbors, replicate across 12 6
7 Outline Introduction Access Classification and Block Placement Reactive NUCA Mechanisms Evaluation Conclusion 13 Rotational Interleaving RID +log 2 (k) size-4 clusters: local slice + 3 neighbors Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices 14 7
8 Rotational Interleaving RID +log 2 (k) PC: 0xfa480 ( Addr + RID +1) & ( 1) Destination = n Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices 15 Rotational Interleaving RID +log 2 (k) each slice caches the same blocks on behalf of any cluster Fast access (nearest-neighbor, simple lookup) Balance access latency with capacity constraints Equal capacity pressure at overlapped slices 16 8
9 Classification Mechanisms Instructions classification: all accesses from L1-I Per-page classification for data: at TLB miss Utilize OS page table & TLB for book-keeping k info Page classification is accurate (<0.5% error) On 1 st access On access by another Core i Ld A TLB Miss Ld A TLB Miss Core j OS A: Private to i OS A: Private to i A: Shared 17 Coherence: No Need for HW Mechanisms at LLC Reactive NUCA placement guarantee Each R/W datum in unique & known location Shared data: addr-interleave Private data: local slice Fast access, eliminates HW overhead 18 9
10 over Private Speedup Evaluation ASR I ASR I ASRI ASR I ASR I ASR I ASR I ASR I ASR (A) Shared (S) R-NUCA (R) Ideal (I) OLTP Apache DSS DSS DSS em3d OLTP DB2 Qry6 Qry8 Qry13 Oracle MIX Private-averse workloads Shared-averse workloads Delivers robust performance across workloads Shared: same for Web, DSS; 17% for OLTP, MIX Private: 17% for OLTP, Web, DSS; same for MIX 19 Conclusions Reactive NUCA: near-optimal block placement and replication in distributed caches Cache accesses can be classified at run-time Each class amenable to different placement Reactive NUCA: placement of each class Simple, scalable, low-overhead, transparent Obviates HW coherence mechanisms for LLC Rotational Interleaving Replication + fast lookup (neighbors, single probe) Robust performance across server workloads Near-optimal placement (-5% avg. from ideal) 20 10
11 Questions? Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University Flexus full-system simulator available at 21 Backup Slides Classification and Lookup 22 11
12 Classification Granularity: OS Page Per-block classification: high area/latency overhead Per-page classification (utilize OS page table) Core accesses it for every access anyway (TLB) Page classification is accurate (<0.5% error) TLB entry: P/S vpage ppage 1 bit Page table entry: P/S/I id vpage ppage 2 bits log(n) Page granularity allows simple + practical HW 23 Data Class Bookkeeping private data: place in local slice Page table entry: P id vpage ppage TLB entry: P vpage ppage shared data: place in aggregate (addr interleave) Page table entry: S id vpage ppage TLB entry: S vpage ppage Physical Addr.: tag id cache index offset 24 12
13 Data Classification and Lookup Core i Ld A TLB Miss allocate A P i OS vpage ppage 25 Data Classification and Lookup Core i Core j evict A inval A TLBi Ld A TLB Miss reply A Core k OS allocate A P i vpage ppage i j S x vpage ppage 26 13
14 Data Classification and Lookup Core i Core j Core k OS S x vpage ppage Fast & simple lookup for data 27 Total Accesses A Misclassifications at Page Granularity One Class Instructions+Data Private+Shared Data OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX Total Accesses Private Data as Shared Correct Accesses from pages with Access misclassifications multiple li l access types A page may service multiple access types But, one type always dominates accesses Classification at page granularity is accurate OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX 28 14
15 Backup Slides Placement 29 Private Data Placement Total Acc cesses (CDF) ,024 4,096 16,384 65, ,144 1,048,576 Private Data (KB) OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d MIX Spill to neighbors if working set too large? NO!!! Each runs similar threads Store in local slice (like in private cache) 30 15
16 Shared Data Placement Total Access ses (CDF) ,024 4,096 16,384 65, ,144 1,048,576 Shared Data (KB) OLTP DB2 OLTP Oracle 10 Apache 8 DSS Qry6 6 DSS Qry8 4 DSS Qry13 em3d 2 MIX Total Accesses OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 DSS Qry13 1st access 2nd access 3rd-4th access 5th-8th access 9+ access em3d MIX Read-write + large working set + low reuse Unlikely to be in local slice for reuse Also, next sharer is random [WMPI 04] Address-interleave in aggregate (like shared cache) 31 Instruction Placement Total Acc cesses (CDF) ,024 Instructions (KB) 4,096 OLTP DB2 OLTP Oracle 10 Apache 8 DSS Qry6 6 DSS Qry8 Qy 4 DSS Qry13 2 em3d MIX Total Accesses OLTP DB2 OLTP Oracle Apache DSS Qry6 DSS Qry8 1st access 2nd access 3rd-4th access 5th-8th access 9+ access DSS Qry13 em3d MIX Working set too large for one slice Slices store private & shared data too! Sufficient capacity with 4 slices Share in clusters of neighbors, replicate across 32 16
17 Backup Slides Detailed Evaluation 33 Cache Accesses Breakdown Instr. + shared-rw dominate server workloads Private dominate scientific/mix 34 17
18 CPI Breakdown 35 Impact of Eliminating Coherence 36 18
19 Impact of Private Allocation 37 Impact of Instruction Replication 38 19
20 Instruction Clustering 39 Off-chip Misses 0.8 Normalized CPI Off chip atomic Off chip load Off chip instructions 0 PASR PASR PASR PASR PASR PASR PASR PASR OLTP DB2 Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d OLTP Oracle MIX Private-averse workloads Shared-averse workloads 40 20
21 1 Evaluation: Speedup over Ideal Speedup ov ver Ideal PASR I PASR I PASR I PASR I PASR I PASR I PASR I PASR I Private (P) ASR (A) Shared (S) R-NUCA (R) OLTP DB2 Apache DSS Qry6 DSS Qry8 DSS Qry13 em3d OLTP Oracle MIX Private-averse workloads Shared-averse workloads Near-optimal placement: -5% on avg. from ideal 41 21
Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS
Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019
More informationCS521 CSE IITG 11/23/2012
LSPS: Logically Shared Physically Shared LSPD: Logically Shared Physically Distributed Same size Unequalsize ith out replication LSPD : Logically Shared Physically Shared with replication Access scheduling
More informationR-NUCA: Data Placement in Distributed Shared Caches
: Data Placement in Distributed Caches Nikos Hardavellas 1, Michael Ferdman 1,2, Babak Falsafi 1,2 and Anastasia Ailamaki 3,1 1 Computer Architecture Lab (CALCM), Carnegie Mellon University, Pittsburgh,
More informationOptimizing Replication, Communication, and Capacity Allocation in CMPs
Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important
More informationMemory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL
Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL Forecast This research studies the performance of memory ordering mechanisms on Chip Multi- Processors (CMPs) for modern
More informationLecture 11: Large Cache Design
Lecture 11: Large Cache Design Topics: large cache basics and An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS 02 Distance Associativity for High-Performance
More informationRethinking Last-Level Cache Management for Multicores Operating at Near-Threshold
Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut Power Efficiency Performance/Watt Multicores enable efficiency Power-performance
More informationFPGA-Accelerated Instrumentation
ROTOFLEX: FGA-Accelerated Instrumentation Michael K. apamichael, Eric S. Chung, James C. Hoe, Babak Falsafi, Ken Mai papamix@cs.cmu.edu, {echung, jhoe, babak, kenmai}@ece.cmu.edu ROTOFLEX Computer Architecture
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationLecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers
Lecture 12: Large ache Design Topics: Shared vs. private, centralized vs. decentralized, UA vs. NUA, recent papers 1 Shared Vs. rivate SHR: No replication of blocks SHR: Dynamic allocation of space among
More informationJIGSAW: SCALABLE SOFTWARE-DEFINED CACHES
JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013 Summary NUCA is giving us more capacity, but further away 40 Applications
More informationCCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward
More informationSGI Challenge Overview
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationDatabase Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:
Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive
More informationSWAP: EFFECTIVE FINE-GRAIN MANAGEMENT
: EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1
More informationPrefetching. Fall 2007 Prof. Thomas Wenisch. Correlating Prediction Table. Latest. Prefetch A3.
History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2007 Prof. Thomas Wenisch A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs. Austin,
More informationEFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES
EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University
More informationAn Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors
ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,
More informationAnalyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009
Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.
More informationArchitecture-Conscious Database Systems
Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More informationSpring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand
Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled
More informationTurboTag: Lookup Filtering to Reduce Coherence Directory Power
Appears in the proceedings of the 16th International Symposium on Low Power Electronics and Design (ISLPED 1) TurboTag: Lookup Filtering to Reduce Coherence Power Pejman Lotfi-Kamran Michael Ferdman Daniel
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationToken Coherence. Milo M. K. Martin Dissertation Defense
Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software
More informationAdapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]
Lecture 17 Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] SRAM / / Flash / RRAM / HDD SRAM / / Flash / RRAM/ HDD SRAM
More informationCache Coherence Protocols for Chip Multiprocessors - I
Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016 Context Thus far chip multiprocessors
More informationVirtual Memory. Virtual Memory
Virtual Memory Virtual Memory Main memory is cache for secondary storage Secondary storage (disk) holds the complete virtual address space Only a portion of the virtual address space lives in the physical
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Intel Montecito Cache Two cores, each with a private 12 MB L3 cache and 1 MB L2 Naffziger et al., Journal of Solid-State
More informationSOFTWARE-ORIENTED DISTRIBUTED SHARED CACHE MANAGEMENT FOR CHIP MULTIPROCESSORS. by Lei Jin B.S. in Computer Science, Zhejiang University, China, 2004
SOFTWARE-ORIENTED DISTRIBUTED SHARED CACHE MANAGEMENT FOR CHIP MULTIPROCESSORS by Lei Jin B.S. in Computer Science, Zhejiang University, China, 2004 Submitted to the Graduate Faculty of the Department
More informationThe Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!
The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationMeet the Walkers! Accelerating Index Traversals for In-Memory Databases"
Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache
More informationSpatial Memory Streaming (with rotated patterns)
Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationOptimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service
Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications
More informationLocality-Aware Data Replication in the Last-Level Cache
Locality-Aware Data Replication in the Last-Level Cache George Kurian, Srinivas Devadas Massachusetts Institute of Technology Cambridge, MA USA {gkurian, devadas}@csail.mit.edu Omer Khan University of
More informationPortland State University ECE 588/688. Directory-Based Cache Coherence Protocols
Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All
More informationCOMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy
COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationLecture 15: Virtual Memory and Large Caches. Today: TLB design and large cache design basics (Sections )
Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections 5.3-5.4) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical
More informationA Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2012-012 May 22, 2012 A Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas
More informationCache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics
More informationProtoFlex: FPGA Accelerated Full System MP Simulation
ProtoFlex: FPGA Accelerated Full System MP Simulation Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai Computer Architecture Lab at Our work in this area has been supported in part
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More information1. Memory technology & Hierarchy
1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories
More informationChapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.
More informationDark Silicon Accelerators for Database Indexing
Dark Silicon Accelerators for Database Indexing Onur Kocberber, Kevin Lim, Babak Falsafi, Partha Ranganathan, Stavros Harizopoulos Dark Silicon and Big Data Challenges Data explosion Data growing faster
More informationBandwidth Adaptive Snooping
Two classes of multiprocessors Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,
More informationCSC 631: High-Performance Computer Architecture
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationSpring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand
Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationSTEPS Towards Cache-Resident Transaction Processing
STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I
More informationCOS 318: Operating Systems. Virtual Memory and Address Translation
COS 318: Operating Systems Virtual Memory and Address Translation Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Today s Topics
More informationHARDWARE-ORIENTED CACHE MANAGEMENT FOR LARGE-SCALE CHIP MULTIPROCESSORS
HARDWARE-ORIENTED CACHE MANAGEMENT FOR LARGE-SCALE CHIP MULTIPROCESSORS by Mohammad Hammoud BS, American University of Science and Technology, 2004 MS, University of Pittsburgh, 2010 Submitted to the Graduate
More informationA Study of Cache Organizations for Chip- Multiprocessors
A Study of Cache Organizations for Chip- Multiprocessors Shatrugna Sadhu, Hebatallah Saadeldeen {ssadhu,heba}@cs.ucsb.edu Department of Computer Science University of California, Santa Barbara Abstract
More informationLecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies
Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset
More informationWeaving Relations for Cache Performance
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles
More informationSIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto
SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols
More informationCuckoo Directory: A Scalable Directory for Many-Core Systems
Cuckoo Directory: A Scalable Directory for Many-Core Systems Michael Ferdman Pejman Lotfi-Kamran Ken Balet Babak Falsafi Computer Architecture Lab Carnegie Mellon University http://www.ece.cmu.edu/calcm/
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationVirtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili
Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed
More informationAnastasia Ailamaki. Performance and energy analysis using transactional workloads
Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:
More informationLecture 17: Virtual Memory, Large Caches. Today: virtual memory, shared/pvt caches, NUCA caches
Lecture 17: Virtual Memory, Large Caches Today: virtual memory, shared/pvt caches, NUCA caches 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large address space
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationCANDY: Enabling Coherent DRAM Caches for Multi-Node Systems
CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems Chiachen Chou Aamer Jaleel Moinuddin K. Qureshi School of Electrical and Computer Engineering Georgia Institute of Technology {cc.chou, moin}@ece.gatech.edu
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationTiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu Carnegie Mellon University HPCA - 2013 Executive
More informationScalable Cache Coherence
Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient
More informationScalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:
Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More informationScalable Cache Coherence
arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy
More informationNexus: A New Approach to Replication in Distributed Shared Caches
Nexus: New pproach to Replication in Distributed Shared Caches Po-n Tsai MIT CSIL poantsai@csail.mit.edu Nathan Beckmann CMU SCS beckmann@cs.cmu.edu Daniel Sanchez MIT CSIL sanchez@csail.mit.edu bstract
More informationVariability in Architectural Simulations of Multi-threaded
Variability in Architectural Simulations of Multi-threaded threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison {alaa,david}@cs.wisc.edu http://www.cs.wisc.edu/multifacet
More informationExam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence
Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,
More informationEfficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories
Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu, Norm Jouppi* Carnegie Mellon University * Hewlett-Packard
More informationV. Primary & Secondary Memory!
V. Primary & Secondary Memory! Computer Architecture and Operating Systems & Operating Systems: 725G84 Ahmed Rezine 1 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)
More informationBetter than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin Sangyeun Cho Department of Computer Science University of Pittsburgh jinlei,cho@cs.pitt.edu Abstract Private
More informationLecture 7: Implementing Cache Coherence. Topics: implementation details
Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationHardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT
More informationVirtualization and memory hierarchy
Virtualization and memory hierarchy Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory
More informationProtoFlex Tutorial: Full-System MP Simulations Using FPGAs
rotoflex Tutorial: Full-System M Simulations Using FGAs Eric S. Chung, Michael apamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai ROTOFLEX Computer Architecture Lab at Our work in this
More informationFILTERING DIRECTORY LOOKUPS IN CMPS
Departamento de Informática e Ingeniería de Sistemas Memoria de Tesis Doctoral FILTERING DIRECTORY LOOKUPS IN CMPS Autor: Ana Bosque Arbiol Directores: Pablo Ibáñez José M. Llabería Víctor Viñals Zaragoza
More informationLecture #15: Translation, protection, sharing
Lecture #15: Translation, protection, sharing Review -- 1 min Goals of virtual memory: protection relocation sharing illusion of infinite memory minimal overhead o space o time Last time: we ended with
More informationChip-Multithreading Systems Need A New Operating Systems Scheduler
Chip-Multithreading Systems Need A New Operating Systems Scheduler Alexandra Fedorova Christopher Small Daniel Nussbaum Margo Seltzer Harvard University, Sun Microsystems Sun Microsystems Sun Microsystems
More information