DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently
|
|
- Agatha Stone
- 5 years ago
- Views:
Transcription
1 : Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign *Department of Computer Science and Engineering University of Washington
2 Motivation 2 Debugging parallel applications on a multicore is challenging: Bugs can be very timing dependent Their effects might manifest after many instructions Need effective debugging techniques for parallel applications on MPs One such technique: HW-assisted deterministic replay
3 HW-Assisted Deterministic Replay 3 Phase I: Initial execution Bacon and Goldstein. WPDD 91 HW records into a log the order of dependences between threads This log captures the interleaving of threads Phase II: Replay Re-run the parallel application Enforce the dependence orders stored in the log
4 Contributions 4 Propose : a new scheme for hw-assisted deterministic replay of parallel programs based on chunk-based execution Records initial execution at production-run speeds Requires only small log (0.6% of current schemes) Replays at high-speed Has multiple modes of execution with different speed/logging requirements
5 Outline 5 Background on replay schemes : a Chunk-based replay scheme Advanced Execution Replay Evaluation delorean
6 Point-to-Point Schemes: FDR and RTR 6 Record dependences between individual instructions of different threads Optimizations to reduce log size P1 FDR (ISCA 03) RTR (ASPLOS 06) P2 Transitive reduction n Wb Regulated transitive reduction Rb m Use Sequential Consistency (SC) RTR proposes an algorithm to support TSO: P2 s Log P1 n m Record value of loads that violate SC
7 Vector-based Schemes: Strata Stratum: vector of as many counters as processors: Count # of memory operations executed Log stores sequence of strata P1 P2 P3 7 Strata (ASPLOS 06) Stratum Wa Rd Wc Rb Ra It also uses SC Log size is similar to RTR s Log
8 Episode-Based Schemes: ReRun 8 ReRun (ISCA 08)
9 Outline 9 Background on replay schemes : a Chunk-based replay scheme Advanced Execution Replay Evaluation delorean
10 Chunk-Based Execution 10 P0 P1 P0 P1 P0 P1 Mem st X st Y chunk ld T ld Z st W chunk st X st Y commit ld T ld Z st W commit st X st Y commit collision ld X ld Z ld X re-execute ld X ld Z ld X P0 P1 P0 Group consecutive dynamic instructions Execute chunks atomically Execute chunks in isolation Limit possible interleavings Memory ops can be reordered and overlapped within and across chunks Memory access interleaving happens at chunk boundaries Chunk-based systems: TCC (ISCA 04), IT (ICPS 05), BulkSC (ISCA 07)
11 A Great Substrate for MP Replay 11 Overlapping/reordering enables high-speed execution and replay Performance similar to a system with Release Consistency (RC) Limited memory interleaving minimizes logging requirements No need to record individual dependences
12 12 Chunk-Based Replay: Replaying a chunk-based system means: Generate same chunks during initial execution and replay And commit them in the same order s Log stores total order of chunk commits and their size P0 st X st Y commit st A st X ld Z st Z st T commit ld A ld B P1 ld C commit ld Y st B st A commit s log is very small: Each entry is short: (ProcID, ChunkSize) Log updated infrequently Aggressive designs can make it even smaller Log P0 2 P1 3 P1 3 P0 5
13 13 Basic Operation Proc 0 Proc 1 Chunk A Commit Request Arbiter Commit Request Chunk B Ok Ok A Size CS Log 1 ProcID = 1 B Size CS Log ProcID = 0 0 PI Log Log implemented as two different structures: Chunk Size (CS) Log Processor Interleaving (PI) Log
14 Outline 14 Background on replay schemes : a Chunk-Based replay scheme Advanced Execution Replay Evaluation delorean
15 Reducing s Log 15 PI Log P0 CS Log P1 CS Log 1 Use large chunks Fixed-size chunking Predefined interleaving 0 1 1
16 Reducing s Log 16 1 Use large chunks + Fewer entries in log More collisions and cache overflows 2 Fixed-size chunking + No Almost CS log no Need to handle cache overflows 3 Predefined interleaving + No PI log Performance degradation
17 Execution Modes 17 Nano Pico 1 Use large chunks 1 Use large chunks 2 Fixed-size chunking (tiny CS log) 2 Fixed-size chunking (tiny CS log) 3 Predefined interleaving (no PI Log)
18 Outline 18 Background on replay schemes : Chunk-Based replay scheme Advanced Execution Replay Evaluation delorean
19 Overall System 19 Proc 0 Network Proc 1 Chunk A DMA Arbiter Chunk B CS Log Interrupt Log I/O Log CS Log Interrupt Log I/O Log DMA Log PI Log Interrupt, I/O and DMA logs also appear in all past replay schemes
20 Replay 20 Virtual machine monitor loads initial system checkpoint Interrupt Log provides initial-execution interrupts I/O Log provides initial-execution I/O inputs Processors execute normally (they execute in parallel): Use CS Log to generate chunks with exceptional sizes (from cache overflows) Arbiter uses PI Log to enforce initial-execution commit order (or predefined order in Pico)
21 Also in the paper 21 Rules for deterministic chunking How to deal with interrupts, traps, cache overflows, I/O ops Scalability of Pico A detailed view of s architecture Proof of why s replay is deterministic Combination with other HW replay schemes
22 Outline 22 Background on replay schemes : Chunk-Based replay scheme Advanced Execution Replay Evaluation delorean
23 Evaluation Setup 23 Chunk-based simulator supports both execution and replay 8-processor CMP 1000, 2000, 3000 instructions per chunk PI Log: 4-bit entries CS Log: 32-bit entries Applications: SPLASH, SPECjbb 2000 and SPECweb 2005
24 Log Space Requirements 24 Log size (bits/core/kilo-inst) CS log (Compressed) PI log (Compressed) Instructions per chunk Nano Average RTR compressed log size (estimated) Log size (bits/core/kilo-inst) CS log (Compressed) Pico Average RTR compressed log size (estimated) Instructions per chunk Nano s log is 16% of RTR s (2000-instruction configuration) With one additional optimization, Nano s log is 7.5% of RTR s Picos s log is 0.6% of RTR s (1000-instruction configuration)
25 If We Were To Record a Day of Execution GB/Core/Day FDR RTR Nano Pico Execution and replay of long periods of time is feasible
26 Performance 26 RC Nano Nano Pico Replay SC Pico Replay Speedup R O RP S R 0 Initial and Execution Replay Execution Initial execution: Nano almost as fast as RC, Pico faster than SC Replay: comparable to initial execution speed
27 Conclusions 27 is a new approach to HW replay of parallel applications Advantages over traditional HW replay systems: Executes at high speed (like most aggressive consistency models) Replays at comparable speed Summarizes execution into a very small log (0.6% of current schemes) Great help for programmers that need to debug parallel codes
28 Future Work 28 Adapt to conventional multiprocessors without chunk-based execution Take conventional replay schemes, use very aggressive SC implementations, and compare to Combine the best aspects of the different recording approaches (, RTR and Strata)
29 : Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign *Department of Computer Science and Engineering University of Washington
The Bulk Multicore Architecture for Programmability
The Bulk Multicore Architecture for Programmability Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Acknowledgments Key contributors: Luis Ceze Calin
More informationRelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors
RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors Nima Honarmand and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/ 1 RnR: Record and Deterministic
More informationLessons Learned During the Development of the CapoOne Deterministic Multiprocessor Replay System
Lessons Learned During the Development of the CapoOne Deterministic Multiprocessor System Pablo Montesinos, Matthew Hicks, Wonsun Ahn, Samuel T. King and Josep Torrellas Department of Computer Science
More informationScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment
ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment, Wonsun Ahn, Josep Torrellas University of Illinois University of Illinois http://iacoma.cs.uiuc.edu/ Motivation Architectures
More informationCapo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay
Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay Pablo Montesinos Matthew Hicks Samuel T. King Josep Torrellas University of Illinois at Urbana-Champaign {pmontesi,
More informationDMP Deterministic Shared Memory Multiprocessing
DMP Deterministic Shared Memory Multiprocessing University of Washington Joe Devietti, Brandon Lucia, Luis Ceze, Mark Oskin A multithreaded voting machine 2 thread 0 thread 1 while (more_votes) { load
More informationA Survey in Deterministic Replaying Approaches in Multiprocessors
International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:10 No:04 23 A Survey in Deterministic Replaying Approaches in Multiprocessors Ahmad Heydari, Saeed Azimi Abstract most multithread
More informationSamsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions
Samsara: Efficient Deterministic Replay in Multiprocessor Environments with Hardware Virtualization Extensions Shiru Ren, Le Tan, Chunqi Li, Zhen Xiao, and Weijia Song June 24, 2016 Table of Contents 1
More informationFlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes
FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes Rishi Agarwal and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
More informationRainbow: Efficient Memory Dependence Recording with High Replay Parallelism for Relaxed Memory Model
Rainbow: Efficient Memory Dependence Recording with High Replay Parallelism for Relaxed Memory Model Xuehai Qian, He Huang, Benjamin Sahelices, and Depei Qian University of Illinois Urbana-Champaign AMD
More informationDMP: DETERMINISTIC SHARED- MEMORY MULTIPROCESSING
... DMP: DETERMINISTIC SHARED- MEMORY MULTIPROCESSING... SHARED-MEMORY MULTICORE AND MULTIPROCESSOR SYSTEMS ARE NONDETERMINISTIC, WHICH FRUSTRATES DEBUGGING AND COMPLICATES TESTING OF MULTITHREADED CODE,
More informationBulkCommit: Scalable and Fast Commit of Atomic Blocks in a Lazy Multiprocessor Environment
BulkCommit: Scalable and Fast Commit of Atomic Blocks in a Lazy Multiprocessor Environment, Benjamin Sahelices Josep Torrellas, Depei Qian University of Illinois University of Illinois http://iacoma.cs.uiuc.edu/
More informationExplicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic!
Explicitly Parallel Programming with Shared Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University of Washington Parallel Programming is Hard
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationDOI: /
DOI:10.1145/1610252.1610271 Easing the programmer s burden does not compromise system performance or increase the complexity of hardware implementation. BY JOSEP TORRELLAS, LUIS CEZE, JAMES TUCK, CALIN
More informationDeterministic Shared Memory Multiprocessing
Deterministic Shared Memory Multiprocessing Luis Ceze, University of Washington joint work with Owen Anderson, Tom Bergan, Joe Devietti, Brandon Lucia, Karin Strauss, Dan Grossman, Mark Oskin. Safe MultiProcessing
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More informationPrototyping Architectural Support for Program Rollback Using FPGAs
Prototyping Architectural Support for Program Rollback Using FPGAs Radu Teodorescu and Josep Torrellas http://iacoma.cs.uiuc.edu University of Illinois at Urbana-Champaign Motivation Problem: Software
More informationTransactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech
Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech Group) Motivation Uniprocessor Systems Frequency
More informationVulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically
Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Abdullah Muzahid, Shanxiang Qi, and University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu MICRO December
More informationCoreRacer: A Practical Memory Race Recorder for Multicore x86 TSO Processors
CoreRacer: A Practical Memory Race Recorder for Multicore x86 TSO Processors Gilles Pokam gilles.a.pokam@intel.com Ali-Reza Adl-Tabatabai ali-reza.adltabatabai@intel.com Cristiano Pereira cristiano.l.pereira@intel.com
More informationThe Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas
The Design Complexity of Program Undo Support in a General Purpose Processor Radu Teodorescu and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Processor with program
More informationHardware Support for Software Debugging
Hardware Support for Software Debugging Mohammad Amin Alipour Benjamin Depew Department of Computer Science Michigan Technological University Report Documentation Page Form Approved OMB No. 0704-0188 Public
More informationUniversal Parallel Computing Research Center at Illinois
Universal Parallel Computing Research Center at Illinois Making parallel programming synonymous with programming Marc Snir 08-09 The UPCRC@ Illinois Team BACKGROUND 3 Moore s Law Pre 2004 Number of transistors
More informationSpeculative Locks. Dept. of Computer Science
Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationHardware and Software Approaches for Deterministic Multi-processor Replay of Concurrent Programs
Hardware and Software Approaches for Deterministic Multi-processor Replay of Concurrent Programs Gilles Pokam, Intel Labs, Intel Corporation Cristiano Pereira, Software and Services Group, Intel Corporation
More informationChasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems
Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair *, Johnathan Alsop^, Sarita V. Adve + * University of Wisconsin-Madison ^ AMD Research + University
More informationToday s Outline: Shared Memory Review. Shared Memory & Concurrency. Concurrency v. Parallelism. Thread-Level Parallelism. CS758: Multicore Programming
CS758: Multicore Programming Today s Outline: Shared Memory Review Shared Memory & Concurrency Introduction to Shared Memory Thread-Level Parallelism Shared Memory Prof. David A. Wood University of Wisconsin-Madison
More informationDo you have to reproduce the bug on the first replay attempt?
Do you have to reproduce the bug on the first replay attempt? PRES: Probabilistic Replay with Execution Sketching on Multiprocessors Soyeon Park, Yuanyuan Zhou University of California, San Diego Weiwei
More informationPhoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware Smruti R. Sarangi Abhishek Tiwari Josep Torrellas University of Illinois at Urbana-Champaign Can a Processor
More informationCS533: Speculative Parallelization (Thread-Level Speculation)
CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21 Concepts
More informationExplicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic!
Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin Department of Computer Science and Engineering University
More information740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess
More informationTSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model
TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model Sudheendra Hangal, Durgam Vahia, Chaiyasit Manovit, Joseph Lu and Sridhar Narayanan tsotool@sun.com ISCA-2004 Sun Microsystems
More informationRelaxed Memory Consistency
Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit
More informationAtom-Aid: Detecting and Surviving Atomicity Violations
Atom-Aid: Detecting and Surviving Atomicity Violations Abstract Writing shared-memory parallel programs is an error-prone process. Atomicity violations are especially challenging concurrency errors. They
More informationECMon: Exposing Cache Events for Monitoring
ECMon: Exposing Cache Events for Monitoring Vijay Nagarajan Rajiv Gupta University of California at Riverside, CSE Department, Riverside, CA 92521 {vijay,gupta}@cs.ucr.edu ABSTRACT The advent of multicores
More informationQuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs
QuickRec: Prototyping an Intel Architecture Extension for Record and Replay of Multithreaded Programs Intel: Gilles Pokam, Klaus Danne, Cristiano Pereira, Rolf Kassa, Tim Kranich, Shiliang Hu, Justin Gottschlich
More informationEliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors
Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors Marcelo Cintra and Josep Torrellas University of Edinburgh http://www.dcs.ed.ac.uk/home/mc
More informationChasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems
Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu
More informationDesigning Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve
Designing Memory Consistency Models for Shared-Memory Multiprocessors Sarita V. Adve Computer Sciences Department University of Wisconsin-Madison The Big Picture Assumptions Parallel processing important
More informationAdvanced Memory Management
Advanced Memory Management Main Points Applications of memory management What can we do with ability to trap on memory references to individual pages? File systems and persistent storage Goals Abstractions
More informationDynamically Detecting and Tolerating IF-Condition Data Races
Dynamically Detecting and Tolerating IF-Condition Data Races Shanxiang Qi (Google), Abdullah Muzahid (University of San Antonio), Wonsun Ahn, Josep Torrellas University of Illinois at Urbana-Champaign
More informationData Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun
Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu A Chip Multiprocessor Implementation
More informationThe Design Complexity of Program Undo Support in a General-Purpose Processor
The Design Complexity of Program Undo Support in a General-Purpose Processor Radu Teodorescu and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More informationImproving the Practicality of Transactional Memory
Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter
More informationTransactions & Update Correctness. April 11, 2018
Transactions & Update Correctness April 11, 2018 Correctness Correctness Data Correctness (Constraints) Query Correctness (Plan Rewrites) Correctness Data Correctness (Constraints) Query Correctness (Plan
More informationRECORD AND DETERMINISTIC REPLAY OF PARALLEL PROGRAMS ON MULTIPROCESSORS NIMA HONARMAND DISSERTATION
2014 Nima Honarmand RECORD AND DETERMINISTIC REPLAY OF PARALLEL PROGRAMS ON MULTIPROCESSORS BY NIMA HONARMAND DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor
More informationReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard
More informationMotivations. Shared Memory Consistency Models. Optimizations for Performance. Memory Consistency
Shared Memory Consistency Models Authors : Sarita.V.Adve and Kourosh Gharachorloo Presented by Arrvindh Shriraman Motivations Programmer is required to reason about consistency to ensure data race conditions
More informationDeAliaser: Alias Speculation Using Atomic Region Support
DeAliaser: Alias Speculation Using Atomic Region Support Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign http://iacoma.cs.illinois.edu Memory Aliasing Prevents Good
More informationLecture 13: Consistency Models. Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models
Lecture 13: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models 1 Coherence Vs. Consistency Recall that coherence guarantees
More informationMultiprocessor Synchronization
Multiprocessor Systems Memory Consistency In addition, read Doeppner, 5.1 and 5.2 (Much material in this section has been freely borrowed from Gernot Heiser at UNSW and from Kevin Elphinstone) MP Memory
More informationLight64: Ligh support for data ra. Darko Marinov, Josep Torrellas. a.cs.uiuc.edu
: Ligh htweight hardware support for data ra ce detection ec during systematic testing Adrian Nistor, Darko Marinov, Josep Torrellas University of Illinois, Urbana Champaign http://iacoma a.cs.uiuc.edu
More informationThe Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun
The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu
More informationLamport Clocks: Verifying A Directory Cache-Coherence Protocol. Computer Sciences Department
Lamport Clocks: Verifying A Directory Cache-Coherence Protocol * Manoj Plakal, Daniel J. Sorin, Anne E. Condon, Mark D. Hill Computer Sciences Department University of Wisconsin-Madison {plakal,sorin,condon,markhill}@cs.wisc.edu
More informationLecture 8: Transactional Memory TCC. Topics: lazy implementation (TCC)
Lecture 8: Transactional Memory TCC Topics: lazy implementation (TCC) 1 Other Issues Nesting: when one transaction calls another flat nesting: collapse all nested transactions into one large transaction
More informationComputer Architecture
18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University
More informationTSO_ATOMICITY: Efficient Hardware Primitive for TSO- Preserving Region Optimizations
TSO_ATOMICITY: Efficient Hardware Primitive for TSO- Preserving Region Optimizations Cheng Wang Youfeng Wu Programming Systems Lab Microprocessor and Programming Research / Intel Labs {cheng.c.wang, youfeng.wu}@intel.com
More informationCMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today
More informationLecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations
Lecture 12: TM, Consistency Models Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations 1 Paper on TM Pathologies (ISCA 08) LL: lazy versioning, lazy conflict detection, committing
More informationModule 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency
Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationNOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.
Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which
More informationCOMP Parallel Computing. CC-NUMA (2) Memory Consistency
COMP 633 - Parallel Computing Lecture 11 September 26, 2017 Memory Consistency Reading Patterson & Hennesey, Computer Architecture (2 nd Ed.) secn 8.6 a condensed treatment of consistency models Coherence
More informationImplicit Transactional Memory in Chip Multiprocessors
Implicit Transactional Memory in Chip Multiprocessors Marco Galluzzi 1, Enrique Vallejo 2, Adrián Cristal 1, Fernando Vallejo 2, Ramón Beivide 2, Per Stenström 3, James E. Smith 4 and Mateo Valero 1 1
More informationTransactional Memory
Transactional Memory Architectural Support for Practical Parallel Programming The TCC Research Group Computer Systems Lab Stanford University http://tcc.stanford.edu TCC Overview - January 2007 The Era
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 19 Processor Design Overview Special features in microprocessors provide support for parallel processing Already discussed bus snooping Memory latency becoming
More informationThe Stanford Hydra CMP. Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun
The Stanford Hydra CMP Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mark Willey, Michael Chen, Maciek Kozyrczak*, and Kunle Olukotun Computer Systems Laboratory Stanford University http://www-hydra.stanford.edu
More informationDatabases - Transactions II. (GF Royle, N Spadaccini ) Databases - Transactions II 1 / 22
Databases - Transactions II (GF Royle, N Spadaccini 2006-2010) Databases - Transactions II 1 / 22 This lecture This lecture discusses how a DBMS schedules interleaved transactions to avoid the anomalies
More informationLogTM: Log-Based Transactional Memory
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet
More informationLecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory
Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section
More informationReEnact: Using Thread-Level Speculation Mechanisms to Debug Data Races in Multithreaded Codes
Appears in the Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA-30), June 2003. ReEnact: Using Thread-Level Speculation Mechanisms to Debug Data Races in Multithreaded
More informationMEMORY ORDERING: A VALUE-BASED APPROACH
MEMORY ORDERING: A VALUE-BASED APPROACH VALUE-BASED REPLAY ELIMINATES THE NEED FOR CONTENT-ADDRESSABLE MEMORIES IN THE LOAD QUEUE, REMOVING ONE BARRIER TO SCALABLE OUT- OF-ORDER INSTRUCTION WINDOWS. INSTEAD,
More informationChapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,
Chapter 3 (Cont III): Exploiting ILP with Software Approaches Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Exposing ILP (3.2) Want to find sequences of unrelated instructions that can be overlapped
More informationLecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM
Lecture 6: Lazy Transactional Memory Topics: TM semantics and implementation details of lazy TM 1 Transactions Access to shared variables is encapsulated within transactions the system gives the illusion
More informationProduction-Run Software Failure Diagnosis via Hardware Performance Counters. Joy Arulraj, Po-Chun Chang, Guoliang Jin and Shan Lu
Production-Run Software Failure Diagnosis via Hardware Performance Counters Joy Arulraj, Po-Chun Chang, Guoliang Jin and Shan Lu Motivation Software inevitably fails on production machines These failures
More informationL1 Data Cache Decomposition for Energy Efficiency
L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/flexram Objective Reduce
More informationDefining a High-Level Programming Model for Emerging NVRAM Technologies
Defining a High-Level Programming Model for Emerging NVRAM Technologies Thomas Shull, Jian Huang, Josep Torrellas University of Illinois at Urbana-Champaign September 13, 2018 Shull et al. Defining a High-Level
More informationPage 1. Outline. Coherence vs. Consistency. Why Consistency is Important
Outline ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Memory Consistency Models Copyright 2006 Daniel J. Sorin Duke University Slides are derived from work by Sarita
More informationDeterministic Process Groups in
Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble University of Washington A Nondeterministic Program global x=0 Thread 1 Thread 2 t := x x := t + 1 t := x x := t +
More informationImplementing Kilo-Instruction Multiprocessors
Implementing Kilo-Instruction Multiprocessors Enrique Vallejo 1, Marco Galluzzi 2, Adrián Cristal 2, Fernando Vallejo 1, Ramón Beivide 1, Per Stenström 3, James E. Smith 4 and Mateo Valero 2 1 Grupo de
More informationTransactional Memory Coherence and Consistency
Transactional emory Coherence and Consistency all transactions, all the time Lance Hammond, Vicky Wong, ike Chen, rian D. Carlstrom, ohn D. Davis, en Hertzberg, anohar K. Prabhu, Honggo Wijaya, Christos
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationCS533 Concepts of Operating Systems. Jonathan Walpole
CS533 Concepts of Operating Systems Jonathan Walpole Shared Memory Consistency Models: A Tutorial Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect
More informationA Regulated Transitive Reduction (RTR) for Longer Memory Race Recording
Appears in Twelfth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII) San Jose, California, USA, October 21-25, 2006. A Regulated Transitive
More informationToken Coherence. Milo M. K. Martin Dissertation Defense
Token Coherence Milo M. K. Martin Dissertation Defense Wisconsin Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin Madison (C) 2003 Milo Martin Overview Technology and software
More informationS = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)
1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:
More informationLecture: Consistency Models, TM
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency
More informationRecord and Replay of Multithreaded Applications
Record and Replay of Multithreaded Applications Aliaksandra Sankova aliaksandra.sankova@ist.utl.pt Instituto Superior Técnico (Advisor: Professor Luís Rodrigues) Abstract. With the advent of multi-core
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationLecture 17: Transactional Memories I
Lecture 17: Transactional Memories I Papers: A Scalable Non-Blocking Approach to Transactional Memory, HPCA 07, Stanford The Common Case Transactional Behavior of Multi-threaded Programs, HPCA 06, Stanford
More informationParallel Computer Architecture Spring Memory Consistency. Nikos Bellas
Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency
More informationWhat are Transactions? Transaction Management: Introduction (Chap. 16) Major Example: the web app. Concurrent Execution. Web app in execution (CS636)
What are Transactions? Transaction Management: Introduction (Chap. 16) CS634 Class 14, Mar. 23, 2016 So far, we looked at individual queries; in practice, a task consists of a sequence of actions E.g.,
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More informationVirtues and Obstacles of Hardware-assisted Multi-processor Execution Replay
Virtues and Obstacles of Hardware-assisted Multi-processor Execution Replay Cristiano Pereira, Gilles Pokam, Klaus Danne, Ramesh Devarajan, and Ali-Reza Adl-Tabatabai Intel Corporation {cristiano.l.pereira,
More information