Predictable Cache Coherence for Multi- Core Real-Time Systems
|
|
- Jodie Harper
- 5 years ago
- Views:
Transcription
1 Predictable Cache Coherence for Multi- Core Real-Time Systems Mohamed Hassan, Anirudh M. Kaushik and Hiren Patel RTAS 2017
2 Motivation: Data sharing in multi-core real-time systems Core 1 RT tasks L1 D/I $ P1 Private data Interconnect LLC/Main memory S1 Shared data PAGE 2
3 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] T1 T2 T3 P1 P2 S1 S2 P1 P2 S3 S4 Simpler timing analysis Hardware support Long execution time PAGE 3
4 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] Task scheduling on shared data [ECRTS 10, RTSS 16] Scheduler T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler PAGE 4
5 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] Task scheduling on shared data [ECRTS 10, RTSS 16] Scheduler Software cache coherence [SAMOS 14] T1 T2 T3 P1 P2 T1 P1 T2 S1 T3 P2 T1 P1 T2 S1 /* Access T3 to * private data */ P2 /* Access to * shared data */ S1 P1 S3 S1 P1 S3 S1 P1 S3 S2 P2 S4 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler PAGE 5 Private cache hits on shared data Hardware support Source code changes
6 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] S1 S2 P1 P2 S3 S4 Task scheduling on shared data [ECRTS 10, RTSS 16] S1 S2 P1 P2 S3 S4 Software cache coherence [SAMOS 14] Scheduler /* Access to T1 T2 T3 T1 T2 T3 T1 T2 T3 * private data */ P1Disabling P2 simultaneous P1 S1 cached P2 shared data P1 S1for timing P2 /* Access to analysis at the cost of lower performance and OS and * shared data */ application changes. S1 S2 P1 P2 S3 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler PAGE 6 Private cache hits on shared data Hardware support Source code changes
7 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] S1 Task scheduling on shared data [ECRTS 10, RTSS 16] Software cache coherence [SAMOS 14] Scheduler T1 T2 T3 T1 T2This work: T3 T1 T2 /* Access to T3 * private data */ Allowing P1 predictable P2 P1simultaneous S1 P2 cached P1 S1 shared P2 data /* Access to accesses through hardware cache coherence, and provide * shared data */ significant performance improvements with no S2 P1 P2 S3 S4 changes to S1OS P1and S3 applications. S1 P1 S2 P2 S4 S2 P2 S3 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler PAGE 7 Private cache hits on shared data Hardware support Source code changes
8 Outline Overview of hardware cache coherence Challenges of applying conventional hardware cache coherence on RT multi-core systems Our solution: Predictable Modified-Shared-Invalid (PMSI) cache coherence protocol Latency analysis of PMSI Results Conclusion PAGE 8
9 Overview of hardware cache coherence Modified-Shared-Invalid protocol Core 1 I M OwnGETM OwnUPG OtherGETS S OwnGETS or OtherGETS GETS: Memory load GETM/UPG: Memory store PUTM: Replacement PAGE 9
10 Applying hardware cache coherence to multi-core RT systems Modified-Shared-Invalid protocol Core 1 I L1 D/I $ L1 D/I $ Real-time interconnect + LLC/Main memory M OwnGETM OwnUPG OtherGETS S OwnGETS or OtherGETS PAGE 10
11 Applying hardware cache coherence to multi-core RT systems Modified-Shared-Invalid protocol Core 1 I Multi-core L1 D/I $ RT system L1 D/I $ with predictable arbitration to shared components + + Real-time interconnect Well studied cache coherence protocol = OwnUPG M S Predictable latencies for shared data accesses? LLC/Main memory OwnGETM OtherGETS OwnGETS or OtherGETS PAGE 11
12 Applying hardware cache coherence to multi-core RT systems Modified-Shared-Invalid protocol Core 1 I Multi-core L1 D/I $ RT system L1 D/I $ with predictable arbitration to shared components + + Real-time interconnect Well studied cache coherence protocol = OwnUPG M S Predictable latencies for shared data accesses? LLC $/Main memory OwnGETM OtherGETS OwnGETS or OtherGETS PAGE 12
13 Sources of unpredictable memory access latencies due to hardware cache coherence 1. Inter-core coherence interference on same cache line 2. Inter-core coherence interference on different cache lines 3. Inter-core coherence interference due to write hits 4. Intra-core coherence interference PAGE 13
14 System model TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M A:- I A:- I A:- I Real-time interconnect TDM arbitration A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs PAGE 14
15 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core Event Read A Core 1 Core 3 PWB A:100 M A A:- I A:- I * A:- I : GetS A A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs * Denotes waiting for data PAGE 15
16 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S PWB A A:- I A:- * A:- I Core Event Read A Read B : GetS B A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs PAGE 16
17 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S A:- I A:- * A:- I Core Event Read A Read B Waiting for data A:50 M B:42 S C:31 S * Denotes waiting for data PAGE 17
18 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S PWB A A:- I A:- * A:- I Core Event Read A Read B Waiting for data Read C A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs PAGE 18
19 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S PWB A A:- I A:- * A:- I Core Event Read A Read B Waiting for data Read C A:50 M B:42 S C:31 S Unbounded latency for * Denotes waiting for data PAGE 19
20 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S PWB A A:- I A:- * A:- I Core Event Read A Read B Waiting for data Read C : GetS B/GetS C A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs * Denotes waiting for data PAGE 20
21 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Solution Core1: 1 Architectural modification Core 3 A:100 M Predictable A:- I A:- B:42 S arbitration * A:- I Core Event Read A Read B Waiting for data Read C WB A Read B : GetS B/GetS C A:50 M B:42 S C:31 S PAGE 21
22 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Solution Core1: 1 Architectural modification Core 3 A:100 M Predictable A:- I A:- B:42 S arbitration * A:- I Core Event Read A Read B Waiting for data Read C WB A Read B : GetS B/GetS C A:50 M B:42 S C:31 S PAGE 22
23 Intra-core coherence interference Solution Core1: 1 Architectural modification Core 3 A:100 M B:42 S TDM schedule Core 3 Core 1 Core 3 Predictable A:- I arbitration A:- X A:- I Core Event Read A Read B Waiting for data Read C WB A Read B : GetS B/GetS C A:50 M C:31 S B:42 S Solution 2: Coherence protocol modifications Need to maintain state to indicate s pending WB of A PAGE 23
24 Sources of unpredictable memory access latencies due to cache coherence 1. Inter-core coherence interference on same cache line ü Solution: Architectural modification to shared LLC/memory 2. Inter-core coherence interference on different cache lines ü Architectural modification + Coherence modifications = Predictable Modified-Shared-Invalidate cache coherence protocol (PMSI) Solution: Architectural modification to the core + Coherence modifications 3. Inter-core coherence interference due to write hits ü Predictable latencies for simultaneous shared data access ü ü No application/os changes 4. Intra-core coherence üinterference Significant performance benefits Solution: Architectural modification to the core + Coherence modifications ü Solution: Architectural modification to the core + Coherence modifications PAGE 24
25 PMSI cache coherence protocol: Protocol modifications State Core events Bus events Load Store Replacement OwnData OwnUpg OwnPutM OtherGetS OtherGetM I Issue GetS/IS d Issue GetM/IM d X X X X N/A N/A N/A N/A S Hit Issue Upg/SM w X N/A X X N/A I I X M Hit Hit Issue PutM/MI wb X X X Issue PutM/MS wb Issue PutM/MI wb X X IS d X X X Read/S X X N/A IS d I IS d I N/A IM d X X X Write/S X X IM d S IM d I X N/A IM d I X X X Write/MI wb X X N/A X X N/A IS d I X X X Read/I X X N/A X X N/A IM d S X X X Write/MS wb X X N/A X X N/A SM w X X X N/A Store/ M X N/A I I X MI wb Hit Hit N/A X X Send data/i N/A N/A X X MS wb Hit Hit MI wb X X Send data/s N/A MI wb X X Other Upg Other PutM PAGE 25
26 PMSI cache coherence protocol: Protocol modifications State Core events Bus events Load Store Replacement OwnData OwnUpg OwnPutM OtherGetS OtherGetM I Issue GetS/IS d Issue GetM/IM d X X X X N/A N/A N/A N/A S Hit Issue Upg/SM w X N/A X X N/A I I X M Hit Hit Issue PutM/MI wb X X X Issue PutM/MS wb Issue PutM/MI wb X X IS d X X X Read/S X X N/A IS d I IS d I N/A IM d X X X Write/S X X IM d S IM d I X N/A IM d I X X X Write/MI wb X X N/A X X N/A IS d I X X X Read/I X X N/A X X N/A IM d S X X X Write/MS wb X X N/A X X N/A SM w X X X N/A Store/ M X N/A I I X MI wb Hit Hit N/A X X Send data/i N/A N/A X X MS wb Hit Hit MI wb X X Send data/s N/A MI wb X X Other Upg Other PutM PAGE 26
27 Intra-core coherence interference A:100 TDM schedule Core 3 Core 1 Core 3 TDM Core 1 Core 3 PR PWB M A:- I A:- MS wb Read B A I * State Core Bus events Event Read A A:- IOtherGets OtherGetM M Issue PutM/MS wb Issue PutM/MI wb : GetS A A:50 M B:42 S C:31 S PR LUT Address Core ID Request A 2 GetS PAGE 27
28 Intra-core coherence interference TDM Core 1 Core 3 A:100 MSwb S TDM schedule Core 3 Core 1 Core 3 WB slot PR PWB State Read BA:- AI A:- IS d A:- I MS wb Bus events OwnPutM Send data/s Core Event Read A WB A : Send A A: M S B:42 S C:31 S PR LUT Address Core ID Request A 2 GetS PAGE 28
29 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 S A:- I A:100 * S A:- I Core Event Read A WB A Complete read A Data response A:100 A:100 S B:42 S C:31 S PR LUT Address Core ID Request A 2 GetS PAGE 29
30 Intra-core coherence interference TDM Core 1 Core 3 A:100 S TDM schedule Core 3 Core 1 Core 3 B:42 I S PR PWB Request slot Read BA:- I A:100 S A:- I Read C Core Event Read A WB A Complete read A Read B : GetS B A:100 S B:42 S C:31 S PAGE 30
31 Latency analysis of PMSI WCLarb N: Number of cores Core 3 Core 1 Core 3 TDM schedule Core 1 Core 3 Pending WB Pending request A:- IM d I A:- IM d I A:- IM d A:50 M MI wb L acc Real-time interconnect TDM arbitration TDM Address Core ID Request A 0 GetM A 1 GetM LLC $/Main memory WCL intercoh WCL intracoh A 2 GetM PAGE 31
32 Latency analysis of PMSI PAGE 32 WCL of memory request = L acc + WCL arb ( N)+ WCL intracoh ( N) + WCL intercoh ( N 2 )
33 Results Cycle accurate Gem5 simulator 2, 4, 8 cores, 16KB direct mapped private L1-D/I caches In-order cores, 2GHz operating frequency TDM bus arbitration with slot width = 50 cycles Benchmarks: SPLASH-2, synthetic benchmarks to stress worst-case scenarios Simulation framework available at PAGE 33
34 Results: Observed worst case latency per request PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 PAGE 34
35 Results: Observed worst case latency per request Latency bound PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 PAGE 35
36 Results: Observed worst case latency per request Latency bound PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 PAGE 36
37 Results: Slowdown relative to MESI cache coherence Uncache all Single core Uncache shared PMSI MSI MESI PAGE 37
38 Results: Slowdown relative to MESI cache coherence Uncache all Single core Uncache shared PMSI MSI MESI PAGE 38
39 Conclusion Enumerate sources of unpredictability when applying conventional hardware cache coherence on multi-core real-time systems Propose PMSI cache coherence that allows predictable simultaneous shared data accesses with no changes to application/os Hardware overhead of PMSI is ~ 128 bytes WCL of memory access with PMSI square of number of cores ( N 2 ) PMSI outperforms prior techniques for handling shared data accesses (upto 4x improvement over uncache-shared), and suffers on average 46% slowdown compared to conventional MESI cache coherence protocol PAGE 39
40 THANK YOU & QUESTIONS? PAGE 40
Presented by Anh Nguyen
Presented by Anh Nguyen u When a cache has a block in state M or E and receives a GetS from another core, if using the MSI protocol or the MESI protocol, the cache must u change the block state from M
More informationWMC. MPSoCs for Mixed-Criticality Systems: Challenges and Opportunities. Mohamed Hassan
WMC MPSoCs for Mixed-Criticality Systems: Challenges and Opportunities Mohamed Hassan 1 IBM s Acorn Smart Phones Automotive 1943 1989 2010s 1981 2000s Now-Near Colossus NEC s UltaLite Wearables IoT/Smart
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationCase Study 1: Single-Chip Multicore Multiprocessor
44 Solutions to Case Studies and Exercises Chapter 5 Solutions Case Study 1: Single-Chip Multicore Multiprocessor 5.1 a. P0: read 120 P0.B0: (S, 120, 0020) returns 0020 b. P0: write 120 80 P0.B0: (M, 120,
More informationDeterministic Memory Abstraction and Supporting Multicore System Architecture
Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi $, Prathap Kumar Valsan^, Renato Mancuso *, Heechul Yun $ $ University of Kansas, ^ Intel, * Boston University
More informationManaging Memory for Timing Predictability. Rodolfo Pellizzoni
Managing Memory for Timing Predictability Rodolfo Pellizzoni Thanks This work would not have been possible without the following students and collaborators Zheng Pei Wu*, Yogen Krish Heechul Yun* Renato
More informationVariability Windows for Predictable DDR Controllers, A Technical Report
Variability Windows for Predictable DDR Controllers, A Technical Report MOHAMED HASSAN 1 INTRODUCTION In this technical report, we detail the derivation of the variability window for the eight predictable
More informationShared Cache Aware Task Mapping for WCRT Minimization
Shared Cache Aware Task Mapping for WCRT Minimization Huping Ding & Tulika Mitra School of Computing, National University of Singapore Yun Liang Center for Energy-efficient Computing and Applications,
More informationReal-Time Cache Management for Multi-Core Virtualization
Real-Time Cache Management for Multi-Core Virtualization Hyoseung Kim 1,2 Raj Rajkumar 2 1 University of Riverside, California 2 Carnegie Mellon University Benefits of Multi-Core Processors Consolidation
More informationWorst Case Analysis of DRAM Latency in Multi-Requestor Systems. Zheng Pei Wu Yogen Krish Rodolfo Pellizzoni
orst Case Analysis of DAM Latency in Multi-equestor Systems Zheng Pei u Yogen Krish odolfo Pellizzoni Multi-equestor Systems CPU CPU CPU Inter-connect DAM DMA I/O 1/26 Multi-equestor Systems CPU CPU CPU
More informationTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems
More informationLecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization
Lecture 25: Multiprocessors Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 Snooping-Based Protocols Three states for a block: invalid,
More informationAtomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols. Dana Vantrease, Mikko Lipasti, Nathan Binkert
Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols Dana Vantrease, Mikko Lipasti, Nathan Binkert 1 Executive Summary Problem: Cache coherence races make protocols complicated
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P
More informationInterconnect Routing
Interconnect Routing store-and-forward routing switch buffers entire message before passing it on latency = [(message length / bandwidth) + fixed overhead] * # hops wormhole routing pipeline message through
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationWCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache
WCET Analysis of Parallel Benchmarks using On-Demand Coherent Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund Hugues Cassé, Haluk Ozaktas and Christine Rochange Université Paul Sabatier,
More informationLecture 7: Implementing Cache Coherence. Topics: implementation details
Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)
1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory
More informationDepartment of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.
Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline
More informationA Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders
A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs Marco Bekooij & Frank Ophelders Outline Context What is cache coherence Addressed challenge Short overview of related work Related
More informationOptimizing Replication, Communication, and Capacity Allocation in CMPs
Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important
More informationLecture 25: Multiprocessors
Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed
More informationMemory Hierarchy in a Multiprocessor
EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected
More informationSIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core
SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core Sebastian Hahn and Jan Reineke RTSS, Nashville December, 2018 saarland university computer science SIC: Provably Timing-Predictable
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationScalable Cache Coherence
Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More information-- the Timing Problem & Possible Solutions
ARTIST Summer School in Europe 2010 Autrans (near Grenoble), France September 5-10, 2010 Towards Real-Time Applications on Multicore -- the Timing Problem & Possible Solutions Wang Yi Uppsala University,
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationEECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun
EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,
More informationAuthor's personal copy
J. Parallel Distrib. Comput. 68 (2008) 1413 1424 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Two proposals for the inclusion of
More informationAgenda. System Performance Scaling of IBM POWER6 TM Based Servers
System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies
More informationIntroducing Multi-core Computing / Hyperthreading
Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationShared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB
Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationPerformance study example ( 5.3) Performance study example
erformance study example ( 5.3) Coherence misses: - True sharing misses - Write to a shared block - ead an invalid block - False sharing misses - ead an unmodified word in an invalidated block CI for commercial
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationProblem Points 1 /20 2 /20 3 /20 4 /20 5 /20 Total /100
EE382: Processor Design Final Examination March 20, 1998 Please do not open the exam book or begin work on the exam until instructed to do so. You have a total of 3 hours to complete this exam. You will
More informationCS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationCS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017
CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationCS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017
CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More information12 Cache-Organization 1
12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty
More informationMeet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors
Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationWorst Case Analysis of DRAM Latency in Hard Real Time Systems
Worst Case Analysis of DRAM Latency in Hard Real Time Systems by Zheng Pei Wu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Applied
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationFoundations of Computer Systems
18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:
More informationLecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations
Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in
More informationImproving Real-Time Performance on Multicore Platforms Using MemGuard
Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study
More informationCANDY: Enabling Coherent DRAM Caches for Multi-Node Systems
CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems Chiachen Chou Aamer Jaleel Moinuddin K. Qureshi School of Electrical and Computer Engineering Georgia Institute of Technology {cc.chou, moin}@ece.gatech.edu
More informationSpeculative Versioning Cache: Unifying Speculation and Coherence
Speculative Versioning Cache: Unifying Speculation and Coherence Sridhar Gopal T.N. Vijaykumar, Jim Smith, Guri Sohi Multiscalar Project Computer Sciences Department University of Wisconsin, Madison Electrical
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationCache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem
Cache Coherence Bryan Mills, PhD Slides provided by Rami Melhem Cache coherence Programmers have no control over caches and when they get updated. x = 2; /* initially */ y0 eventually ends up = 2 y1 eventually
More informationECE 551 System on Chip Design
ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs
More informationSudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread
Intra-Warp Compaction Techniques Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Goal Active thread Idle thread Compaction Compact threads in a warp to coalesce (and eliminate)
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationCSC526: Parallel Processing Fall 2016
CSC526: Parallel Processing Fall 2016 WEEK 5: Caches in Multiprocessor Systems * Addressing * Cache Performance * Writing Policy * Cache Coherence (CC) Problem * Snoopy Bus Protocols PART 1: HARDWARE Dr.
More informationCache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014
Cache Coherence Introduction to High Performance Computing Systems (CS1645) Esteban Meneses Spring, 2014 Supercomputer Galore Starting around 1983, the number of companies building supercomputers exploded:
More informationCPE 631 Advanced Computer Systems Architecture: Homework #3,#4
Issued: 04/09/2007 Due: 04/18/2007 CPE 631 Advanced Computer Systems Architecture: Homework #3,#4 Name: Q1 Q2 Q3 Q4 Q5 Total 20 20 20 60 40 160 Question #1: (20 points) 1.1 (20 points) Consider the following
More informationCaches in Real-Time Systems. Instruction Cache vs. Data Cache
Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance)
More informationSIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto
SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols
More informationProcessor Architecture
Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationDesign and Analysis of Time-Critical Systems Timing Predictability and Analyzability + Case Studies: PTARM and Kalray MPPA-256
Design and Analysis of Time-Critical Systems Timing Predictability and Analyzability + Case Studies: PTARM and Kalray MPPA-256 Jan Reineke @ saarland university computer science ACACES Summer School 2017
More informationSecure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks
: Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia
More informationLecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers
William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved
More informationReplacement policies for shared caches on symmetric multicores : a programmer-centric point of view
1 Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view Pierre Michaud INRIA HiPEAC 11, January 26, 2011 2 Outline Self-performance contract Proposition for
More informationESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence
Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy
More informationManaging Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems
Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Zimeng Zhou, Lei Ju, Zhiping Jia, Xin Li School of Computer Science and Technology Shandong University, China Outline
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationAlternate IPC Mechanisms
Alternate IPC Mechanisms A Comparison of Their Use Within An ORB Framework Chuck Abbott Objective Interface Systems Overview Rationale Goals Introduction Analysis Conclusions 2 2 1 Rationale All IPC Mechanisms
More informationLow-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM
Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data
More informationUniversity of Toronto Faculty of Applied Science and Engineering
Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science
More informationA Server-based Approach for Predictable GPU Access Control
A Server-based Approach for Predictable GPU Access Control Hyoseung Kim * Pratyush Patel Shige Wang Raj Rajkumar * University of California, Riverside Carnegie Mellon University General Motors R&D Benefits
More informationCOMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation
COP 633 - Parallel Computing Lecture 10 September 27, 2018 CC-NUA (1) CC-NUA implementation Reading for next time emory consistency models tutorial (sections 1-6, pp 1-17) COP 633 - Prins CC-NUA (1) Topics
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationMemory Access Pattern-Aware DRAM Performance Model for Multi-core Systems
Memory Access Pattern-Aware DRAM Performance Model for Multi-core Systems ISPASS 2011 Hyojin Choi *, Jongbok Lee +, and Wonyong Sung * hjchoi@dsp.snu.ac.kr, jblee@hansung.ac.kr, wysung@snu.ac.kr * Seoul
More informationEFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES
EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University
More informationSpecial Course on Computer Architecture
Special Course on Computer Architecture #9 Simulation of Multi-Processors Hiroki Matsutani and Hideharu Amano Outline: Simulation of Multi-Processors Background [10min] Recent multi-core and many-core
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 11: Reducing Hit Time Cache Coherence Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer Science Source: Lecture based
More informationCaches in Real-Time Systems. Instruction Cache vs. Data Cache
Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance)
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationSingle Chip Heterogeneous Multiprocessor Design
Single Chip Heterogeneous Multiprocessor Design JoAnn M. Paul July 7, 2004 Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 The Cell Phone, Circa 2010 Cell
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationDynamic Flow Regulation for IP Integration on Network-on-Chip
Dynamic Flow Regulation for IP Integration on Network-on-Chip Zhonghai Lu and Yi Wang Dept. of Electronic Systems KTH Royal Institute of Technology Stockholm, Sweden Agenda The IP integration problem Why
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationCEC 450 Real-Time Systems
CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More information