RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors
|
|
- Hubert Page
- 5 years ago
- Views:
Transcription
1 RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors Nima Honarmand and Josep Torrellas University of Illinois at Urbana-Champaign 1
2 RnR: Record and Deterministic Replay Recreate execution of a (parallel) program Record: capture nondeterministic events in a log Replay: use the log to recreate the exact same execution Use cases: Debugging, Security, Fault Tolerance, Sources of non-determinism rogram inputs Capture using operating system support Memory access interleaving Memory Race Recording (MRR) Important in multiprocessor systems Capture using hardware support My Focus: Hardware-based memory race recording in relaxed-consistency multiprocessors Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 2
3 RelaxReplay Contributions 1) A hardware-based scheme for memory race recording in multi-processors with relaxed memory models 2) Design independent of memory model details Only relies on conventional cache coherence 3) Compact log format Log size comparable to previous, less general, schemes 4) Log format enables efficient replay Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 3
4 Memory Race Recording (MRR) in Hardware Observation: Shared-memory communications manifest as cache coherence transactions Can do MRR by monitoring transactions Chunk-Based Recording 1) Dynamically, break processor s execution into chunks of instructions Chunk: the work done between consecutive communications with other processors Chunk Chunk Comm. Comm. Comm. 2) Order chunks across processors such that Chunk containing source of a dependence is ordered before chunk containing destination 3) Replay chunks in recorded order to reproduce dependences Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 4
5 Chunk-Based Recording Example TS=1 TS=2 Core 1 LD A LD B TID=1 CNT=1 TS=3 LD B TID=1 CNT=2 TS=1 inval Thread inval Interleaving Order Read req. Core 2 ST B ST A TID=2 CNT=2 Each processor keeps a Read Set and a Write Set Read and Write Sets are cleared after chunk creation TS=3 TS=2 # of Chunks << # of dependences Chunk content logged efficiently as <# of insts> Compact log Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 5
6 roblem: Access Re-ordering in Relaxed Models Chunk-based recording assumes in-program-order execution of memory accesses (Sequential Consistency) Real processors re-order memory accesses Recording Number of instructions not enough to capture work done between consecutive communications revious work support Total Store Ordering (TSO) roblem open for more aggressive relaxed model (ARM, IBM, ) How to efficiently record heavily re-ordered accesses? Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 6
7 Notion of Interval in RelaxReplay Core 1 Core 2 Core 3 TS=2 TS=1 TS=3 TS=4 TS=6 TS=5 Intervals Interval: eriod of time between consecutive communications with other processors Replaces notion of chunk Instructions in an interval are not in program order How to compactly log content of an interval? Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 7
8 Key Idea: utting Instructions Back in rogram Order Time inst 1 inst 2 inst 3 inst 4 inst 5 inst 6 Interval 1 Interval 2 inst 1 inst 2 inst 3 inst 4 inst 5 inst 6 Interval 1 Interval 2 Record an alternative order of execution with equivalent behavior and same set of inter-thread dependences Vast majority of instructions in program order Compact log Some instructions cannot be put in program order augment the log with additional information for correct play Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 8
9 Design: Decouple Instruction Execution and Logging Deferred logging: instructions go through a logging step, in program order Detects instructions that can be put back in program order erform () Event: When instruction performs in memory ROB... head Counting (C) Event: When instruction is added to the log At the head of Shadow ROB Inst should be performed Happens in program order erformed Shadow ROB... Not Retired Not performed Retired but Not Counted For each inst, try to move to C Log an interval in terms of its Counting (C) events Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 9
10 Moving Events to C Events inst 1 inst 2 inst 3 Time C C Case 1 (very common): No conflicting coherence request between and C safe to move to C We put s back in program order in-order inst Log a sequence of such insts as <# of insts> C Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 10
11 Moving Events to C Events Case 2 (very rare): Conflicting request between and C can t move to C Reordered inst Special log entry Conflicting Request Interval 10 Interval 10+n... C X Reordered loads Log the value returned by load During replay, read the value from the log (instead of memory) Reordered store Log the value, address and ID of interval Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 11
12 Detecting Conflicting Requests Naïve solution: Associatively search in-flight instructions High hardware overhead Basic Solution (BASE): Reordered if and C in different intervals, otherwise in-order Optimized Solution (OT): Snoop Table Table of counters When snoop a request, index its address into the table and increment the counter Line Address Reordered if different snoop counts at and C Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 12
13 Evaluation Setup rocessor: 8 cores, 4-way out-of-order, 2 GHz Memory subsystem: private L1, shared L2, Ring interconnect Memory model: Release Consistency Benchmarks: 10 SLASH-2 programs Record: Max interval sizes: 4K insts and INF 4 configs: 4K-BASE, 4K-OT, INF-BASE and INF-OT Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 13
14 % of Memory Insts Number of Reordered Accesses Reordered Loads Reordered Stores K-BASE 4K-OT INF-BASE INF-OT Less than 0.04% of mem. insts re-ordered with OT Regardless of Max Interval Size Compact log Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 14
15 Bits per 1K instructions Log Generation Rate K-BASE 4K-OT INF-BASE INF-OT Small log size: 48 MB/sec (4K-OT) and 25 MB/sec (INF-OT) on 8 processors Comparable to (1-4x) previous techniques Very small compared to existing memory BW Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 15
16 Also in the aper Can combine RelaxReplay with any existing chunk-based recording scheme In fact, I presented RelaxReplay on top of QuickRec_[ISCA 13] Details of the replay algorithm More performance numbers Replay performance Scalability analysis (different numbers of processors) Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 16
17 Conclusions RelaxReplay: Memory race recording for relaxed memory models Design independent of memory model Supports Intel, ARM, IBM, Tile, etc. Can be combined with existing chunk-based recorders Compact log format that enables efficient replay In practice, log size comparable to the ideal case of no reordering Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 17
18 THANK YOU! Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 18
19 Replaying RelaxReplay Logs Before replaying an interval wait for its predecessors While (interval has entries) { ent next entry of interval if (ent is Block) { execute next ent.size instructions } else if (ent is ReorderedLoad) { Read the load value from the log } else if (ent is ReorderedStore) { Read value and address from the log Execute the store } } Signal the successors of interval Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 19
20 Normalized Execution Time Replay Overhead Sequential replay using OS to enforce interval ordering and emulate re-ordered instructions Application Cycles Overhead Cycles K-BASE 4K-OT INF-BASE INF-OT Efficient replay: 46% (4K-OT) and 18% (INF-OT) overhead cycles Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 20
21 Example Interval = 10 Interval = 15 i 1 i 2 LD i 4 i 5 ST i 7 i 8... C C C ReorderedLoad? C C C ReorderedStore? C C Base Log for Interval 15: Block(Size=2) ReorderedLoad(Val) Block(Size=2) ReorderedStore(Addr,Val,Interval 10) Block(Size=2) Opt Log for Interval 15: Block(Size=8) Nima Honarmand RelaxReplay: Record and Replay for Relaxed Slide 21
QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs
QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs Intel: Gilles Pokam, Klaus Danne, Cristiano Pereira, Rolf Kassa, Tim Kranich, Shiliang Hu, Justin Gottschlich
More informationDeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently
: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationThe Bulk Multicore Architecture for Programmability
The Bulk Multicore Architecture for Programmability Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Acknowledgments Key contributors: Luis Ceze Calin
More informationRECORD AND DETERMINISTIC REPLAY OF PARALLEL PROGRAMS ON MULTIPROCESSORS NIMA HONARMAND DISSERTATION
2014 Nima Honarmand RECORD AND DETERMINISTIC REPLAY OF PARALLEL PROGRAMS ON MULTIPROCESSORS BY NIMA HONARMAND DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor
More informationSamsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions
Samsara: Efficient Deterministic Replay in Multiprocessor Environments with Hardware Virtualization Extensions Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song June 24, 2016 Table of Contents 1
More informationCOREMU: a Portable and Scalable Parallel Full-system Emulator
COREMU: a Portable and Scalable Parallel Full-system Emulator Haibo Chen Parallel Processing Institute Fudan University http://ppi.fudan.edu.cn/haibo_chen Full-System Emulator Useful tool for multicore
More informationLecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models
Lecture 13: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models 1 Coherence Vs. Consistency Recall that coherence guarantees
More informationMultiprocessor Synchronization
Multiprocessor Systems Memory Consistency In addition, read Doeppner, 5.1 and 5.2 (Much material in this section has been freely borrowed from Gernot Heiser at UNSW and from Kevin Elphinstone) MP Memory
More informationDeterministic Shared Memory Multiprocessing
Deterministic Shared Memory Multiprocessing Luis Ceze, University of Washington joint work with Owen Anderson, Tom Bergan, Joe Devietti, Brandon Lucia, Karin Strauss, Dan Grossman, Mark Oskin. Safe MultiProcessing
More informationLearning Curve for Parallel Applications. 500 Fastest Computers
Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor
More informationVulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically
Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Abdullah Muzahid, Shanxiang Qi, and University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu MICRO December
More informationCMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today
More informationModule 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency
Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional
More informationDeterministic Process Groups in
Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble University of Washington A Nondeterministic Program global x=0 Thread 1 Thread 2 t := x x := t + 1 t := x x := t +
More informationEfficient Sequential Consistency via Conflict Ordering
Efficient Sequential Consistency via Conflict Ordering Changhui Lin University of California Riverside linc@cs.ucr.edu Vijay Nagarajan University of Edinburgh vijay.nagarajan@ed.ac.uk Rajiv Gupta University
More informationComputer Architecture
18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University
More informationSecure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks
: Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia
More informationCache Coherence. Todd C. Mowry CS 740 November 10, Topics. The Cache Coherence Problem Snoopy Protocols Directory Protocols
Cache Coherence Todd C. Mowry CS 740 November 10, 1998 Topics The Cache Coherence roblem Snoopy rotocols Directory rotocols The Cache Coherence roblem Caches are critical to modern high-speed processors
More informationCS533: Speculative Parallelization (Thread-Level Speculation)
CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21 Concepts
More informationPhoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware Smruti R. Sarangi Abhishek Tiwari Josep Torrellas University of Illinois at Urbana-Champaign Can a Processor
More informationSpeculative Locks. Dept. of Computer Science
Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency
More informationPerformance study example ( 5.3) Performance study example
erformance study example ( 5.3) Coherence misses: - True sharing misses - Write to a shared block - ead an invalid block - False sharing misses - ead an unmodified word in an invalidated block CI for commercial
More informationQuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs
QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs Gilles Pokam, Klaus Danne, Cristiano Pereira, Rolf Kassa, Tim Kranich, Shiliang Hu, Justin Gottschlich
More informationDMP Deterministic Shared Memory Multiprocessing
DMP Deterministic Shared Memory Multiprocessing University of Washington Joe Devietti, Brandon Lucia, Luis Ceze, Mark Oskin A multithreaded voting machine 2 thread 0 thread 1 while (more_votes) { load
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations
Lecture 12: TM, Consistency Models Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations 1 Paper on TM Pathologies (ISCA 08) LL: lazy versioning, lazy conflict detection, committing
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationCPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner
CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster rocessors So much to do, so little time... How can we make computers that
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationToken Coherence. Milo M. K. Martin Dissertation Defense
Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software
More informationAdvanced Memory Management
Advanced Memory Management Main Points Applications of memory management What can we do with ability to trap on memory references to individual pages? File systems and persistent storage Goals Abstractions
More informationThe Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas
The Design Complexity of Program Undo Support in a General Purpose Processor Radu Teodorescu and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Processor with program
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationOverview: Memory Consistency
Overview: Memory Consistency the ordering of memory operations basic definitions; sequential consistency comparison with cache coherency relaxing memory consistency write buffers the total store ordering
More informationAtomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols. Dana Vantrease, Mikko Lipasti, Nathan Binkert
Atomic Coherence: Leveraging Nanophotonics to Build Race-Free Cache Coherence Protocols Dana Vantrease, Mikko Lipasti, Nathan Binkert 1 Executive Summary Problem: Cache coherence races make protocols complicated
More informationParallel Computer Architecture Spring Memory Consistency. Nikos Bellas
Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency
More informationToward Efficient and Robust Software Speculative Parallelization on Multiprocessors
Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra School of Informatics University of Edinburgh Edinburgh, UK mc@inf.ed.ac.uk Diego R. Llanos Departamento
More informationFlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes
FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes Rishi Agarwal and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationLessons Learned During the Development of the CapoOne Deterministic Multiprocessor Replay System
Lessons Learned During the Development of the CapoOne Deterministic Multiprocessor System Pablo Montesinos, Matthew Hicks, Wonsun Ahn, Samuel T. King and Josep Torrellas Department of Computer Science
More informationCMSC 411 Computer Systems Architecture Lecture 21 Multiprocessors 3
MS 411 omputer Systems rchitecture Lecture 21 Multiprocessors 3 Outline Review oherence Write onsistency dministrivia Snooping Building Blocks Snooping protocols and examples oherence traffic and performance
More informationReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard
More informationNoise Injection Techniques to Expose Subtle and Unintended Message Races
Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau
More informationLecture notes for CS Chapter 4 11/27/18
Chapter 5: Thread-Level arallelism art 1 Introduction What is a parallel or multiprocessor system? Why parallel architecture? erformance potential Flynn classification Communication models Architectures
More informationHardware and Software Approaches for Deterministic Multi-processor Replay of Concurrent Programs
Hardware and Software Approaches for Deterministic Multi-processor Replay of Concurrent Programs Gilles Pokam, Intel Labs, Intel Corporation Cristiano Pereira, Software and Services Group, Intel Corporation
More informationCOSC4201 Multiprocessors
COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore
More informationCoreRacer: A Practical Memory Race Recorder for Multicore x86 TSO Processors
CoreRacer: A Practical Memory Race Recorder for Multicore x86 TSO Processors Gilles Pokam gilles.a.pokam@intel.com Ali-Reza Adl-Tabatabai ali-reza.adltabatabai@intel.com Cristiano Pereira cristiano.l.pereira@intel.com
More informationTransactional Memory. How to do multiple things at once. Benjamin Engel Transactional Memory 1 / 28
Transactional Memory or How to do multiple things at once Benjamin Engel Transactional Memory 1 / 28 Transactional Memory: Architectural Support for Lock-Free Data Structures M. Herlihy, J. Eliot, and
More informationThe Google File System
The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationNOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?
ecap: erformance Trade-offs Shared ory Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley rogrammer s View of erformance Speedup < Sequential Work Max (Work + Synch
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes
More informationApproaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures
Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache
More informationDesigning Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve
Designing Memory Consistency Models for Shared-Memory Multiprocessors Sarita V. Adve Computer Sciences Department University of Wisconsin-Madison The Big Picture Assumptions Parallel processing important
More informationScalable Cache Coherence
arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy
More informationPageForge: A Near-Memory Content- Aware Page-Merging Architecture
PageForge: A Near-Memory Content- Aware Page-Merging Architecture Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas University of Illinois at Urbana-Champaign MICRO-50 @ Boston Motivation: Server
More informationPrototyping Architectural Support for Program Rollback Using FPGAs
Prototyping Architectural Support for Program Rollback Using FPGAs Radu Teodorescu and Josep Torrellas http://iacoma.cs.uiuc.edu University of Illinois at Urbana-Champaign Motivation Problem: Software
More informationA Scalable SAS Machine
arallel omputer Organization and Design : Lecture 8 er Stenström. 2008, Sally. ckee 2009 Scalable ache oherence Design principles of scalable cache protocols Overview of design space (8.1) Basic operation
More informationScalable Multiprocessors
Scalable Multiprocessors [ 11.1] scalable system is one in which resources can be added to the system without reaching a hard limit. Of course, there may still be economic limits. s the size of the system
More informationChasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems
Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu
More informationLecture 13. Shared memory: Architecture and programming
Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More informationSpring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand
Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications
More informationChasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems
Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair *, Johnathan Alsop^, Sarita V. Adve + * University of Wisconsin-Madison ^ AMD Research + University
More informationPage 1. Outline. Coherence vs. Consistency. Why Consistency is Important
Outline ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Memory Consistency Models Copyright 2006 Daniel J. Sorin Duke University Slides are derived from work by Sarita
More informationDo you have to reproduce the bug on the first replay attempt?
Do you have to reproduce the bug on the first replay attempt? PRES: Probabilistic Replay with Execution Sketching on Multiprocessors Soyeon Park, Yuanyuan Zhou University of California, San Diego Weiwei
More informationScalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance
Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu,
More informationNOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.
Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop
More informationPotential violations of Serializability: Example 1
CSCE 6610:Advanced Computer Architecture Review New Amdahl s law A possible idea for a term project Explore my idea about changing frequency based on serial fraction to maintain fixed energy or keep same
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationLect. 6: Directory Coherence Protocol
Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor
More informationScalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions:
Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication assist
More informationA Chip-Multiprocessor Architecture with Speculative Multithreading
866 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 9, SEPTEMBER 1999 A Chip-Multiprocessor Architecture with Speculative Multithreading Venkata Krishnan, Member, IEEE, and Josep Torrellas AbstractÐMuch emphasis
More informationReproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration
Reproducible Simulation of Multi-Threaded Workloads for Architecture Design Exploration Cristiano Pereira, Harish Patil, Brad Calder $ Computer Science and Engineering, University of California, San Diego
More informationLecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory
Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section
More informationS = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)
1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:
More informationVulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically
Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Abdullah Muzahid, Shanxiang Qi, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
More informationLecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks
Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationComputation Abstractions. Processes vs. Threads. So, What Is a Thread? CMSC 433 Programming Language Technologies and Paradigms Spring 2007
CMSC 433 Programming Language Technologies and Paradigms Spring 2007 Threads and Synchronization May 8, 2007 Computation Abstractions t1 t1 t4 t2 t1 t2 t5 t3 p1 p2 p3 p4 CPU 1 CPU 2 A computer Processes
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationBeyond Sequential Consistency: Relaxed Memory Models
1 Beyond Sequential Consistency: Relaxed Memory Models Computer Science and Artificial Intelligence Lab M.I.T. Based on the material prepared by and Krste Asanovic 2 Beyond Sequential Consistency: Relaxed
More information740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess
More informationCoherence and Consistency
Coherence and Consistency 30 The Meaning of Programs An ISA is a programming language To be useful, programs written in it must have meaning or semantics Any sequence of instructions must have a meaning.
More informationAtomic SC for Simple In-order Processors
Atomic SC for Simple In-order Processors Dibakar Gope and Mikko H. Lipasti University of Wisconsin - Madison gope@wisc.edu, mikko@engr.wisc.edu Abstract Sequential consistency is arguably the most intuitive
More informationChecker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India
Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure
More informationMultiprocessor Interconnection Networks
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical
More informationPost-Silicon Validation of Multiprocessor Memory Consistency
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 1 Post-Silicon Validation of Multiprocessor Memory Consistency Biruk W. Mammo, Student Member,, Valeria Bertacco, Senior Member,,
More informationData/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)
Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software
More information