Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho
|
|
- Brandon Hensley
- 5 years ago
- Views:
Transcription
1 Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February 2014
2 Directory Chip mul+processors with many cores. Coherence is needed across private Caches Directory for scalable coherence solu+ons. Shared LLC Directory L2 Cache L2 Cache L2 Cache L2 Cache Core 0 Core 1 Core 2 Core N
3 Directory: Energy VS Area Duplicate- Tags [e.g., Piranha, NiagraT2] Directory Sparse [e.g., AMD Opteron] L2 Cache L2 Cache L2 Cache L2 Cache Core 0 Core 1 Core 2 Core N
4 Directory: Energy VS Area Directory Duplicate- Tags Energy inefficient [e.g., Piranha, NiagraT2] Sparse [e.g., AMD Opteron] Ways (L2 assoc. x N-Cores L2 Tags Core 0 Core 1 Core 2 Core N
5 Directory: Energy VS Area Directory Duplicate- Tags Energy inefficient Sparse Area Inefficient [e.g., AMD Opteron] How big is enough? 2x-4x over-provision L2 Tags Core 0 Core 1 Core 2 Core N
6 Sparse- based Directories Area efficiency Conven+onal Sparse (2-4x) [Gupta:isca90, Conway:micro10] Clever hashing (1.5x) [Ferdman:hpca 11, Sanchez:hpca 12] Course- grain set indexing [Alisafaee:micro12] Disabling coherence for private pages [Cuesta:isca11] How big is enough? Stash Directory
7 Stash Directory Stash is allowed to not track all cached tags. Entries that track private blocks can be silently removed from directory. LLC and the coherence protocol are involved to discover unregistered blocks. Contribu+on: Power Efficient (low associa+vely) Space Efficient (as small as 0.25x provisioning size without performance impact) Transparent (no OS support, simple design). Scalable (largely independent to core count).
8 Outline Introduc+on Directory- Induced Invalida+ons Stash Directory Evalua+on Conclusion 8
9 Directory- Induced Invalida+ons Insert Dir Entry Eviction Directory Forces Invalidation L2 Miss Core 0 Core 1 Core 2 Core N
10 Conflict Rate Cache Miss Rate DIR Conflict Rate Cache Miss Rate Directory Size Directory Size [Benchmark: fluidanimte]
11 Forcing Invalida+on on Private Blocks Insert Dir Entry Eviction HOT Private Block Core 0 Core 1 Core 2 Core N
12 (1) Invalida+on of Hot Blocks MRU LRU Eviction MRU LRU Core 0 Core 1 Core 2 Core N
13 (2) Causing (unnecessary) Addi+onal Misses MRU LRU L2 Miss MRU LRU Core 0 Core 1 Core 2 Core N
14 (3) Pollu+ng the Directory Set MRU LRU L2 Miss MRU LRU Core 0 Core 1 Core 2 Core N
15 Forcing Invalida+on on Private Blocks PARSEC and SPLASH2 Workloads 1/4x Directory Size Provisioning On average: 72% of directory- induced invalida+ons target Private blocks 80% of invalidated blocks will be re- loaded, causing misses.
16 Outline Introduc+on Directory- Induced Invalida+ons Stash Directory Evalua+on Conclusion 16
17 Stash Directory: Overview Directory knows if an entry is tracking a private block. If evicted entry is private, then do not enforce invalida+on. Private blocks remain hidden from the directory. LLC and the coherence protocol are involved to discover hidden blocks if necessary.
18 Stash Directory: Silent Evic+on Mark as Stash-hidden Block Shared LLC Directory P Dir Eviction Do not enforce invalidation L2 Cache Core 0 Core 1 Core 2 Core N
19 Stash Directory: Handling False Misses Marked Stash-hidden Shared LLC (False) Directory Miss L2 Miss Core 0 Core 1 Core 2 Core N
20 Stash Directory: Handling False Misses Unmark Shared LLC L2 Miss Found Core 0 Core 1 Core 2 Core N
21 Outline Introduc+on Directory- Induced Invalida+ons Stash Directory Evalua+on Conclusion 21
22 Evalua+on Methodology Workloads Mul+threaded benchmarks from SPLAS2 and PARSEC.2.1. Trace x86 traces generated using PIN. Feed into cache/noc cycle detailed model 1- IPC in- order core model. Simulated Machine Configura+on 16- core +led based CMP. Distributed shared LLC 16MB, L1/L2 private caches, inclusive. 4- way/8- way 4x4 mesh NoC. Distributed directory (same associa+vity as L2). Varying Size.
23 Comparison Schemes. 1. Sparse: Conven+onal Sparse Directory. 2. PDC: Deac+va+ng Coherence for Private blocks [Cuesta:isca11]. course- grain classifica+on of blocks into private/shared (page granularity). If miss on a private block, do not invoke coherence protocol. => private blocks are not tracked by the directory. Recover mechanism when page goes from private to shared. OS- supported technique. All schemes use the same sharer- vector encoding. All schemes use the same associa+vely (same as L2).
24 Cache Size VS Miss Rate 220 ocean % Miss Rate Change Sparse PDC Stash x 1x 1/2x 1/4x 1/8x 1/16x Directory Provisioning RaKo
25 Cache Size VS Miss Rate bodytrack 200 % Miss Rate Change Sparse PDC Stash x 1x 1/2x 1/4x 1/8x 1/16x Directory Provisioning RaKo
26 Cache Size VS Miss Rate 220 canneal % Miss Rate Change Sparse PDC Stash x 1x 1/2x 1/4x 1/8x 1/16x Directory Provisioning RaKo
27 Cache Performance 2x 1x 1/2x 1/4x Miss Rate (Normalized)
28 Cache Performance 2x 1x 1/2x 1/4x 1/4x STASH Miss Rate (Normalized) For 1/4x Directory Size, improve execu+on +me by 16% on average. Similar in performance to Sparse- 2x, while being 8 +mes smaller. False misses are few (<6% of directory misses).
29 Scalability Area: Can Stash remain small? (1/4x) Bandwidth: Can Stash remain bandwidth efficient? Miss Rate (Normalized) Sparse PDC Stash Bandwidth (Normalized) Core Count Core Count
30 Conclusion Stash inherits the power efficiency of spares directories. Reduces the directory size requirements significantly. Provides a transparent op+miza+on, independent of system somware, core type and count. Leverages a shared, on- chip last level cache.
31 Thank you for your auenkon! 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February 2014
Stash Directory: A Scalable Directory for Many-Core Coherence
Stash Directory: A Scalable Directory for Many-Core Coherence Socrates Demetriades and Sangyeun Cho Computer Science Department, University of Pittsburgh Memory Division, Samsung Electronics Co. {socrates,cho}@cs.pitt.edu
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationRegion-level Tracking for Scalable Directory Cache
Region-level Tracking for Scalable Directory Cache Hongil Yoon and Gurindar S. Sohi Department of Computer Sciences University of Wisconsin-Madison Madison, WI, USA {ongal,sohi}@cs.wisc.edu Abstract Traditional
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationLecture 11: Large Cache Design
Lecture 11: Large Cache Design Topics: large cache basics and An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS 02 Distance Associativity for High-Performance
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationThe Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!
The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationModule 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 17: Multiprocessor Organizations and Cache Coherence. The Lecture Contains:
The Lecture Contains: Shared Memory Multiprocessors Shared Cache Private Cache/Dancehall Distributed Shared Memory Shared vs. Private in CMPs Cache Coherence Cache Coherence: Example What Went Wrong? Implementations
More informationRethinking Last-Level Cache Management for Multicores Operating at Near-Threshold
Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut Power Efficiency Performance/Watt Multicores enable efficiency Power-performance
More informationLecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It
More informationCANDY: Enabling Coherent DRAM Caches for Multi-Node Systems
CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems Chiachen Chou Aamer Jaleel Moinuddin K. Qureshi School of Electrical and Computer Engineering Georgia Institute of Technology {cc.chou, moin}@ece.gatech.edu
More informationThread- Level Parallelism. ECE 154B Dmitri Strukov
Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors
More informationModule 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors
Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationRespin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory
Respin: Rethinking Near- Threshold Multiprocessor Design with Non-Volatile Memory Computer Architecture Research Lab h"p://arch.cse.ohio-state.edu Universal Demand for Low Power Mobility Ba"ery life Performance
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationReview on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala
Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance
More informationvcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments
vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments Daehoon Kim *, Hwanju Kim, Nam Sung Kim *, and Jaehyuk Huh * University of Illinois at Urbana-Champaign,
More informationStudying the Impact of Multicore Processor Scaling on Directory Techniques via Reuse Distance Analysis
Studying the Impact of Multicore Processor Scaling on Directory Techniques via Reuse Distance Analysis Minshu Zhao and Donald Yeung Department of Electrical and Computer Engineering University of Maryland
More informationPage 1. Memory Hierarchies (Part 2)
Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy
More informationFoundations of Computer Systems
18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming
MEMORY HIERARCHY BASICS B649 Parallel Architectures and Programming BASICS Why Do We Need Caches? 3 Overview 4 Terminology cache virtual memory memory stall cycles direct mapped valid bit block address
More informationCache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri
Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache
More informationData Criticality in Network-On-Chip Design. Joshua San Miguel Natalie Enright Jerger
Data Criticality in Network-On-Chip Design Joshua San Miguel Natalie Enright Jerger Network-On-Chip Efficiency Efficiency is the ability to produce results with the least amount of waste. Wasted time Wasted
More informationThree hours. Two academic papers are provided for use with the examination. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE
COMP60621 Three hours Two academic papers are provided for use with the examination. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Designing for Parallelism and Future Multi-core Computing Date:
More informationReactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed
More informationLocality-Aware Data Replication in the Last-Level Cache
Locality-Aware Data Replication in the Last-Level Cache George Kurian, Srinivas Devadas Massachusetts Institute of Technology Cambridge, MA USA {gkurian, devadas}@csail.mit.edu Omer Khan University of
More informationA Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2012-012 May 22, 2012 A Case for Fine-Grain Adaptive Cache Coherence George Kurian, Omer Khan, and Srinivas Devadas
More informationFine- grain Memory Deduplica4on for In- memory Database Systems. Heiner Litz, David Cheriton, Pete Stevenson Stanford University
Fine- grain Memory Deduplica4on for In- memory Database Systems Heiner Litz, David Cheriton, Pete Stevenson Stanford University 1 Memory Capacity Challenge In- memory databases Limited by memory capacity
More informationMRPB: Memory Request Priori1za1on for Massively Parallel Processors
MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches
More informationPARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites
PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites Christian Bienia (Princeton University), Sanjeev Kumar (Intel), Kai Li (Princeton University) Outline Overview What
More informationComputer Systems CSE 410 Autumn Memory Organiza:on and Caches
Computer Systems CSE 410 Autumn 2013 10 Memory Organiza:on and Caches 06 April 2012 Memory Organiza?on 1 Roadmap C: car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c);
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationCCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward
More informationMul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014
Mul$processor Architecture CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014 1 Agenda Announcements (5 min) Quick quiz (10 min) Analyze results of STREAM benchmark (15 min) Mul$processor
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required
More informationABSTRACT. Dissertation directed by: Professor Donald Yeung Department of Electrical and Computer Engineering
ABSTRACT Title of dissertation: Studying the Impact of Multicore Processor Scaling on Cache Coherence Directories via Reuse Distance Analysis Minshu Zhao, Doctor of Philosophy, 2015 Dissertation directed
More informationA Using Multicore Reuse Distance to Study Coherence Directories
A Using Multicore Reuse Distance to Study Coherence Directories MINSHU ZHAO, The MathWorks, Inc. DONALD YEUNG, University of Maryland at College Park Researchers have proposed numerous techniques to improve
More informationTiny Directory: Efficient Shared Memory in Many-core Systems with Ultra-low-overhead Coherence Tracking
Tiny Directory: Efficient Shared Memory in Many-core Systems with Ultra-low-overhead Coherence Tracking Sudhanshu Shukla Mainak Chaudhuri Department of Computer Science and Engineering, Indian Institute
More informationSIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto
SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols
More informationSCALING HARDWARE AND SOFTWARE
SCALING HARDWARE AND SOFTWARE FOR THOUSAND-CORE SYSTEMS Daniel Sanchez Electrical Engineering Stanford University Multicore Scalability 1.E+06 10 6 1.E+05 10 5 1.E+04 10 4 1.E+03 10 3 1.E+02 10 2 1.E+01
More informationCache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem
Cache Coherence Bryan Mills, PhD Slides provided by Rami Melhem Cache coherence Programmers have no control over caches and when they get updated. x = 2; /* initially */ y0 eventually ends up = 2 y1 eventually
More informationFILTERING DIRECTORY LOOKUPS IN CMPS
Departamento de Informática e Ingeniería de Sistemas Memoria de Tesis Doctoral FILTERING DIRECTORY LOOKUPS IN CMPS Autor: Ana Bosque Arbiol Directores: Pablo Ibáñez José M. Llabería Víctor Viñals Zaragoza
More informationLect. 6: Directory Coherence Protocol
Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor
More informationNON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1
NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 Alberto Ros Universidad de Murcia October 17th, 2017 1 A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA,
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationNetSlices: Scalable Mul/- Core Packet Processing in User- Space
NetSlices: Scalable Mul/- Core Packet Processing in - Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee Packet Processors Essen/al for evolving networks Sophis/cated
More informationPerformance study example ( 5.3) Performance study example
erformance study example ( 5.3) Coherence misses: - True sharing misses - Write to a shared block - ead an invalid block - False sharing misses - ead an unmodified word in an invalidated block CI for commercial
More informationSecure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks
: Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia
More informationMemory hierarchy review. ECE 154B Dmitri Strukov
Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal
More informationSTORING DATA: DISK AND FILES
STORING DATA: DISK AND FILES CS 564- Spring 2018 ACKs: Dan Suciu, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? How does a DBMS store data? disk, SSD, main memory The Buffer manager controls how
More informationECSE 425 Lecture 25: Mul1- threading
ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3 Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2 Today Mul1- threading Chapter 3.5 Summary of ILP:
More informationSPACE : Sharing Pattern-based Directory Coherence for Multicore Scalability
SPACE : Sharing Pattern-based Directory Coherence for Multicore Scalability Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas Deptartment of Computer Science, University of Rochester {hozhao,ashriram,
More informationMemory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt
Memory Hierarchy 2/18/2016 CS 152 Sec6on 5 Colin Schmidt Agenda Review Memory Hierarchy Lab 2 Ques6ons Return Quiz 1 Latencies Comparison Numbers L1 Cache 0.5 ns L2 Cache 7 ns 14x L1 cache Main Memory
More informationModule 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.
MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line
More informationVirtual Snooping: Filtering Snoops in Virtualized Multi-cores
Appears in the 43 rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43) Virtual Snooping: Filtering Snoops in Virtualized Multi-cores Daehoon Kim, Hwanju Kim, and Jaehyuk Huh Computer
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationMeet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors
Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Interface with System Architecture. Prof. Natalie Enright Jerger
ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures: Interface with System Architecture Prof. Natalie Enright Jerger Systems and Interfaces Look at how systems interact and interface
More informationSort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot
Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major
More informationPredic've Modeling in a Polyhedral Op'miza'on Space
Predic've Modeling in a Polyhedral Op'miza'on Space Eunjung EJ Park 1, Louis- Noël Pouchet 2, John Cavazos 1, Albert Cohen 3, and P. Sadayappan 2 1 University of Delaware 2 The Ohio State University 3
More informationSpring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand
Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications
More informationC-AMTE: A Location Mechanism for Flexible Cache Management in Chip Multiprocessors
C-AMTE: A Location Mechanism for Flexible Cache Management in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Department of Computer Science University of Pittsburgh Abstract This
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationCSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]
CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user
More informationProblem: Processor Memory BoJleneck
Today Memory hierarchy, caches, locality Cache organiza:on Program op:miza:ons that consider caches CSE351 Inaugural Edi:on Spring 2010 1 Problem: Processor Memory BoJleneck Processor performance doubled
More informationHARDWARE-ORIENTED CACHE MANAGEMENT FOR LARGE-SCALE CHIP MULTIPROCESSORS
HARDWARE-ORIENTED CACHE MANAGEMENT FOR LARGE-SCALE CHIP MULTIPROCESSORS by Mohammad Hammoud BS, American University of Science and Technology, 2004 MS, University of Pittsburgh, 2010 Submitted to the Graduate
More informationCMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3
MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance
More informationCharacterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies
Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan Department of Computer Science and Engineering University of Minnesota
More informationA Framework for Providing Quality of Service in Chip Multi-Processors
A Framework for Providing Quality of Service in Chip Multi-Processors Fei Guo 1, Yan Solihin 1, Li Zhao 2, Ravishankar Iyer 2 1 North Carolina State University 2 Intel Corporation The 40th Annual IEEE/ACM
More informationCS3350B Computer Architecture
CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &
More informationMITIGATING CACHE ASSOCIATIVITY AND COHERENCE SCALABILITY CONSTRAINTS FOR MANY-CORE CHIP MULTIPROCESSORS
MITIGATING CACHE ASSOCIATIVITY AND COHERENCE SCALABILITY CONSTRAINTS FOR MANY-CORE CHIP MULTIPROCESSORS Thesis By Malik Al-Manasia In Partial Fulfilment of the Requirements for the Degree of Doctor of
More informationLecture 17: Transactional Memories I
Lecture 17: Transactional Memories I Papers: A Scalable Non-Blocking Approach to Transactional Memory, HPCA 07, Stanford The Common Case Transactional Behavior of Multi-threaded Programs, HPCA 06, Stanford
More informationAddress Translation. Tore Larsen Material developed by: Kai Li, Princeton University
Address Translation Tore Larsen Material developed by: Kai Li, Princeton University Topics Virtual memory Virtualization Protection Address translation Base and bound Segmentation Paging Translation look-ahead
More informationECE 571 Advanced Microprocessor-Based Design Lecture 10
ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 2 October 2014 Performance Concerns Caches Almost all programming can be
More informationCMP Directory Coherence: One Granularity Does Not Fit All
CMP Directory Coherence: One Granularity Does Not Fit All Arkaprava Basu Bradford M. Beckmann * Mark D. Hill Steven K. Reinhardt * University of Wisconsin-Madison * AMD Research Abstract To support legacy
More informationLecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)
Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache
More informationCSC501 Operating Systems Principles. OS Structure
CSC501 Operating Systems Principles OS Structure 1 Announcements q TA s office hour has changed Q Thursday 1:30pm 3:00pm, MRC-409C Q Or email: awang@ncsu.edu q From department: No audit allowed 2 Last
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationAn Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors
ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence
CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM
More informationCloudCache: Expanding and Shrinking Private Caches
CloudCache: Expanding and Shrinking Private Caches Sangyeun Cho Computer Science Department Credits Parts of the work presented in this talk are from the results obtained in collaboration with students
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationCache Coherence. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.
Coherence Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T. L5- Coherence Avoids Stale Data Multicores have multiple private caches for performance Need to provide the illusion
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationCS377P Programming for Performance Multicore Performance Cache Coherence
CS377P Programming for Performance Multicore Performance Cache Coherence Sreepathi Pai UTCS October 26, 2015 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional
More informationVirtual Memory. Stefanos Kaxiras. Credits: Some material and/or diagrams adapted from Hennessy & Patterson, Hill, online sources.
Virtual Memory Stefanos Kaxiras Credits: Some material and/or diagrams adapted from Hennessy & Patterson, Hill, online sources. Caches Review & Intro Intended to make the slow main memory look fast by
More informationSpecial Course on Computer Architecture
Special Course on Computer Architecture #9 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core
More informationLogTM: Log-Based Transactional Memory
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet
More informationCuckoo Directory: A Scalable Directory for Many-Core Systems
Cuckoo Directory: A Scalable Directory for Many-Core Systems Michael Ferdman Pejman Lotfi-Kamran Ken Balet Babak Falsafi Computer Architecture Lab Carnegie Mellon University http://www.ece.cmu.edu/calcm/
More informationCSC/ECE 506: Computer Architecture and Multiprocessing Program 3: Simulating DSM Coherence Due: Tuesday, Nov 22, 2016
CSC/ECE 506: Computer Architecture and Multiprocessing Program 3: Simulating DSM Coherence Due: Tuesday, Nov 22, 2016 1. Overall Problem Description In this project, you will add new features to a trace-driven
More informationCache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas
Cache Coherence (II) Instructor: Josep Torrellas CS533 Copyright Josep Torrellas 2003 1 Sparse Directories Since total # of cache blocks in machine is much less than total # of memory blocks, most directory
More informationSCORPIO: 36-Core Shared Memory Processor
: 36- Shared Memory Processor Demonstrating Snoopy Coherence on a Mesh Interconnect Chia-Hsin Owen Chen Collaborators: Sunghyun Park, Suvinay Subramanian, Tushar Krishna, Bhavya Daya, Woo Cheol Kwon, Brett
More information