CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION
|
|
- Tyrone Golden
- 6 years ago
- Views:
Transcription
1 CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION Tanmay Gangwani, Adam Morrison *, Josep Torrellas University of Illinois, Urbana- Champaign Technion Israel InsBtute of Technology *
2 Concurrent Data Structures push/pop Core stack- top
3 Concurrent Data Structures push/pop Core stack- top Lock- based SynchronizaJon Core Lock Stack Core Core Core stack- top Time stack- top Unlock Stack
4 Concurrent Data Structures push/pop Core stack- top Lock- free SynchronizaJon X Core 1. read stack- top 2. read- modify- write stack- top CAS (A, X) stack- top stack- top A X A Time Core Core Core
5 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } new_top A new_top B
6 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } new_top A old_top A = stack- top new_top B old_top B = stack- top
7 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } new_top A old_top A = stack- top stack- top new_top A new_top B old_top B = stack- top new_top B
8 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } CAS FAILURE new_top A old_top A = stack- top new_top A stack- top stack- top new_top A new_top B old_top B = stack- top new_top B new_top B
9 Load- to- CAS Atomicity Adding node to shared stack void push () { Node *new_top = malloc(); while(true) { ATOMIC old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) } } return; Cache Line Lock NO CAS FAILURES
10 Scourge of SerializaJon Atomic Compute new Load old CAS(old, new) Core A Compute new Core B Compute new Core C Compute new new top old top new old CAS Stall Stall Time CAS CAS
11 ContribuJon Atomic new Compute new Load old CAS(old, new) CASPAR reduces stalls in lock- free programs using two new top top techniques : old new ü Eager old Forwarding ü Group ValidaUon Core A Compute new CAS Core B Compute new CAS Stall Core C Compute new Stall Time CAS
12 CASPAR Fights SerializaJon Baseline (Prior work) ContribuBons Hardware Queueing Directory chains cores in hardware ContribuUon I : Eager Forwarding Directory facilitates passing of specula9ve values between cores ContribuUon II : Parallel ValidaUon Directory facilitates group- valida9on of mul9ple cores q Uses cache line bit for locking q Dynamic detecbon of contended CAS No source code modificabons q Deadlock- free
13 SerializaJon in Hardware Queueing Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E ld old A ld old B ld old C ld old D ld old E ld old F D C Serial passing of cache line between cores B A Directory
14 SerializaJon in Hardware Queueing Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E D Serial passing of cache line between cores Line C B A Directory
15 SerializaJon in Hardware Queueing Atomic Compute new Load old CAS(old, new) Serial passing of cache line between cores Core A Core B Core C Core D Core E Line Core F F E F D E F D C E F D B C E F D A B C E F Directory
16 ContribuJon I : Eager Forwarding - Insight Adding node to shared stack Core A Core B Core C void push () { Node *new_top = malloc(); while(true) { Compute new Compute new Compute new Atomic old_top = stack; new_top- >next = old_top; if(cas(&stack, old_top, new_top)) CAS Stall Stall Time } } return; new_top generated early on, independent of old_top CAS CAS
17 CASPAR Eager Forwarding Adding node to shared stack Core A Core B Core C void push () { Node *new_top = malloc(); while(true) { Compute new new A Compute new new B Compute new Atomic old_top = stack; new_top- >next = old_top; if(cas(&stack, old_top, new_top)) } } return; new_top generated early on, independent of old_top CAS ValidaUon CAS ValidaUon CAS Time Out- of- window SpeculaBve ExecuBon azer register check- poinbng
18 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E ld old A ld old B ld old C ld old D ld olde ld old F D C B A Directory
19 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E D Line C B A Directory
20 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E new A new B new C new D new E Core F F E D C B A new F new E new D new C new B new A Directory
21 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Serial ValidaBon Core A Core B Core C Core D Core E Line Spec. Core F F E F D E F D C E F D B C E F A D B C E F new F new E F new D E F new C D E F new B D C E F new A D B C E F Directory
22 LimitaJons of Eager Forwarding Core A Core B Core C Compute new Compute new Compute new new A new B Long speculabon Time CAS CAS CAS ValidaUon Compute new SpeculaBve execubon not possible beyond this point ValidaUon Stall
23 ContribuJon II : Parallel ValidaJon Core A Compute new Core B Compute new Core C Compute new TWO- PHASE COMMIT (2PC) new A new B PROTOCOL TO PERFORM PARALLEL VALIDATION Time CAS CAS CAS ValidaUon ValidaUon Idea: Validate in the directory without ever sending the full line to the core Directory
24 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E Parallel passing of speculabve values between cores ld old A ld old B ld old C ld old D ld olde ld old F D C B A Directory
25 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E new A new B new C new D new E Core F F E D C B F A E new F new E new D new C new B F new A E Directory
26 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Line Line == new A?? Core F F E D C B F A E new F new E new D new C new B F new A E Directory
27 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Parallel ValidaBon Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Core F F F E D E D C C B F B A E new F new F E new D E new D C new C B F new B A E Directory
28 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Parallel ValidaBon Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Prep, new B Prep, new C Prep, new D Prep, new E Core F Prep, new F F F E D E D C C B F B A E new F new F E new D E new D C new C B F new B A E Directory
29 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Parallel ValidaBon Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Ack Ack Ack Nack Ack Core F F F E D E D C C B F B A E new F new F E new D E new D C new C B F new B A E Directory
30 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Spec. Core A Core B Core C Core D Core E Spec. Core F Parallel passing of speculabve values between cores Commit Commit Commit Resume Ack ing cores halt. Resume re- starts execubon Parallel ValidaBon F E new F new E Directory
31 Outcome ü Fully Parallel Data Transfer ü Fully Parallel De- queue
32 EvaluaJon Sniper simulator, 64 cores, OOO Kernels : FIFO, LIFO, LIFO- push- only, Mandelbrot, Larson ApplicaUons : Galois graph analybcs suite (3x2), Barcelona OpenMP Tasks Suite (1x2) 100% 100% 30% 30% 10% 10% 5% 5% L
33 EvaluaJon 100% 100% 30% 30% 10% 10% 5% 5% L LF
34 EvaluaJon 100% 100% 30% 30% 10% 10% 5% 5% L LF Q
35 EvaluaJon 100% 100% 30% 30% 10% 10% 5% 5% L LF Q EF
36 EvaluaJon Independent Set (IS) gains from Eager Forwarding since early data transfer exposes work that could be done speculabvely and in parallel 100% 100% 30% 30% 10% 10% 5% 5% L LF Q EF
37 EvaluaJon Connected Components (CC) and Delaunay TriangulaBon (DT) gain from Parallel ValidaBon as it removes stalls in EF using group commit 100% 100% 30% 30% 10% 10% 5% 5% 2.5x L LF Q EF C
38 Also in the paper ü InformaBon on proposed hardware changes ü Handling memory consistency issues ü Detailed working and correctness of Eager Forwarding and 2PC protocols ü Scalability plots, cycle- level breakdown for applicabons ü SupporBng other primibves (e.g. LL/SC)
39 Conclusions CASPAR enables true parallelism in codes using lock- free synchronizabon - Eager Forwarding : Parallel transfer of data - Parallel ValidaBon : Parallel de- queue No source code modificabons Future Work - Can be used to break conflict serializabon in TM designs
40 CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION Tanmay Gangwani, Adam Morrison *, Josep Torrellas University of Illinois, Urbana- Champaign Technion Israel InsBtute of Technology *
41 CASPAR BACKUP
42 Maintaining Consistency Core A Core B ld old new A ld old {new value forwarded to core B} Z = 1 CAS(old, new).. = Z (read 0) Time ValidaUon
43 Maintaining Consistency Core A Core B ld old new A ld old {new value forwarded to core B} Z = 1 CAS(old, new) Coherence mechanism Forwarding causes no consistency violauons.. = Z (read 0) Squash and roll- back Time
44 EvaluaJon (kernels) CAS throughput improvements - EF over Base : 53% - C over Base : 83% - EF over Queue : 10% - C over Queue : 32%
45 EvaluaJon (scalability) LIFO- push- only Mandelbrot LIFO CASPAR greatly improves scalability with number of cores
46 CASPAR in EnJrety EffecBve SerializaBon Module I : Hardware Queueing Directory chains cores in hardware Core A Core B Core C Module II : Eager Forwarding Directory facilitates passing of specula9ve values between cores Core A Core B Core C Breaking SerializaBon Module III : Two- Phase Group Commit Passing SpeculaBve Value Directory facilitates collec9ve- valida9on of mul9ple cores Directory Core A Core B Core C Group ValidaBon
c 2016 Tanmay Gangwani
c 2016 Tanmay Gangwani BREAKING SERIALIZATION IN LOCK-FREE MULTICORE SYNCHRONIZATION BY TANMAY GANGWANI THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in
More informationCASPAR: Breaking Serialization in Lock-Free Multicore Synchronization
ASPAR: Breaking Serialization in ock-free Multicore Synchronization Tanmay Gangwani University of Illinois at Urbana hampaign gangwan2@illinois.edu Adam Morrison Technion Israel Institute of Technology
More informationSpeculative Locks. Dept. of Computer Science
Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency
More informationScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment
ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment, Wonsun Ahn, Josep Torrellas University of Illinois University of Illinois http://iacoma.cs.uiuc.edu/ Motivation Architectures
More informationFlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes
FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes Rishi Agarwal and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationLecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations
Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,
More informationBulkCommit: Scalable and Fast Commit of Atomic Blocks in a Lazy Multiprocessor Environment
BulkCommit: Scalable and Fast Commit of Atomic Blocks in a Lazy Multiprocessor Environment, Benjamin Sahelices Josep Torrellas, Depei Qian University of Illinois University of Illinois http://iacoma.cs.uiuc.edu/
More informationSecure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks
: Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia
More informationSpeculative Synchronization
Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization
More informationLecture: Transactional Memory. Topics: TM implementations
Lecture: Transactional Memory Topics: TM implementations 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock 2 Design Space Data Versioning
More informationThe Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas
The Design Complexity of Program Undo Support in a General Purpose Processor Radu Teodorescu and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Processor with program
More informationLecture 10: TM Implementations. Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation
Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation 1 Eager Overview Topics: Logs Log optimization Conflict examples Handling deadlocks Sticky scenarios
More informationCS 134: Operating Systems
CS 134: Operating Systems Locks and Low-Level Synchronization CS 134: Operating Systems Locks and Low-Level Synchronization 1 / 25 2 / 25 Overview Overview Overview Low-Level Synchronization Non-Blocking
More informationKernel Scalability. Adam Belay
Kernel Scalability Adam Belay Motivation Modern CPUs are predominantly multicore Applications rely heavily on kernel for networking, filesystem, etc. If kernel can t scale across many
More informationMessage Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA
Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA Agenda Background Motivation Remote Memory Request Shared Address Synchronization Remote
More informationUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence
More informationCS377P Programming for Performance Multicore Performance Synchronization
CS377P Programming for Performance Multicore Performance Synchronization Sreepathi Pai UTCS October 21, 2015 Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional
More informationDeAliaser: Alias Speculation Using Atomic Region Support
DeAliaser: Alias Speculation Using Atomic Region Support Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign http://iacoma.cs.illinois.edu Memory Aliasing Prevents Good
More informationCS533: Speculative Parallelization (Thread-Level Speculation)
CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21 Concepts
More informationCS 241 Honors Concurrent Data Structures
CS 241 Honors Concurrent Data Structures Bhuvan Venkatesh University of Illinois Urbana Champaign March 27, 2018 CS 241 Course Staff (UIUC) Lock Free Data Structures March 27, 2018 1 / 43 What to go over
More informationOpenMP and more Deadlock 2/16/18
OpenMP and more Deadlock 2/16/18 Administrivia HW due Tuesday Cache simulator (direct-mapped and FIFO) Steps to using threads for parallelism Move code for thread into a function Create a struct to hold
More informationLogTM: Log-Based Transactional Memory
LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet
More informationSPECULATIVE SYNCHRONIZATION: PROGRAMMABILITY AND PERFORMANCE FOR PARALLEL CODES
SPECULATIVE SYNCHRONIZATION: PROGRAMMABILITY AND PERFORMANCE FOR PARALLEL CODES PROPER SYNCHRONIZATION IS VITAL TO ENSURING THAT PARALLEL APPLICATIONS EXECUTE CORRECTLY. A COMMON PRACTICE IS TO PLACE SYNCHRONIZATION
More information) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)
) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) Transactions - Definition A transaction is a sequence of data operations with the following properties: * A Atomic All
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationConcurrent Preliminaries
Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures
More informationLecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) 1 Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors,
More informationLecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory
Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section
More informationCMP-3440 Database Systems
CMP-3440 Database Systems Concurrency Control with Locking, Serializability, Deadlocks, Database Recovery Management Lecture 10 zain 1 Basic Recovery Facilities Backup Facilities: provides periodic backup
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationEECS 570 Final Exam - SOLUTIONS Winter 2015
EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32
More informationCache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas
Cache Coherence (II) Instructor: Josep Torrellas CS533 Copyright Josep Torrellas 2003 1 Sparse Directories Since total # of cache blocks in machine is much less than total # of memory blocks, most directory
More informationScalable Locking. Adam Belay
Scalable Locking Adam Belay Problem: Locks can ruin performance 12 finds/sec 9 6 Locking overhead dominates 3 0 0 6 12 18 24 30 36 42 48 Cores Problem: Locks can ruin performance the locks
More informationTECHNIQUES TO DETECT AND AVERT ADVANCED SOFTWARE CONCURRENCY BUGS SHANXIANG QI DISSERTATION
2013 Shanxiang Qi TECHNIQUES TO DETECT AND AVERT ADVANCED SOFTWARE CONCURRENCY BUGS BY SHANXIANG QI DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
More informationPrototyping Architectural Support for Program Rollback Using FPGAs
Prototyping Architectural Support for Program Rollback Using FPGAs Radu Teodorescu and Josep Torrellas http://iacoma.cs.uiuc.edu University of Illinois at Urbana-Champaign Motivation Problem: Software
More informationLecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks
Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock
More informationThe need for atomicity This code sequence illustrates the need for atomicity. Explain.
Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock
More informationSYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS. Donald Nguyen, Keshav Pingali
SYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS Donald Nguyen, Keshav Pingali TERMINOLOGY irregular algorithm opposite: dense arrays work on pointered data structures like graphs shape of the
More information) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)
) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) Goal A Distributed Transaction We want a transaction that involves multiple nodes Review of transactions and their properties
More informationFall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012
18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi
More informationAgenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?
Agenda Designing Transactional Memory Systems Part III: Lock-based STMs Pascal Felber University of Neuchatel Pascal.Felber@unine.ch Part I: Introduction Part II: Obstruction-free STMs Part III: Lock-based
More informationLecture: Consistency Models, TM
Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency
More informationEazyHTM: Eager-Lazy Hardware Transactional Memory
EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero Barcelona Supercomputing Center,
More informationImproving the Practicality of Transactional Memory
Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter
More informationSummary: Open Questions:
Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization
More informationAdvance Operating Systems (CS202) Locks Discussion
Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static
More informationCSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review
CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages
More informationDeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently
: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationLecture 25: Multiprocessors
Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed
More information30. Parallel Programming IV
924 30. Parallel Programming IV Futures, Read-Modify-Write Instructions, Atomic Variables, Idea of lock-free programming [C++ Futures: Williams, Kap. 4.2.1-4.2.3] [C++ Atomic: Williams, Kap. 5.2.1-5.2.4,
More informationPotential violations of Serializability: Example 1
CSCE 6610:Advanced Computer Architecture Review New Amdahl s law A possible idea for a term project Explore my idea about changing frequency based on serial fraction to maintain fixed energy or keep same
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationA Qualitative Survey of Modern Software Transactional Memory Systems
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe and Michael L. Scott TR 839 Department of Computer Science University of Rochester Rochester, NY 14627-0226 {vmarathe,
More informationFine-grained synchronization & lock-free programming
Lecture 17: Fine-grained synchronization & lock-free programming Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Minnie the Moocher Robbie Williams (Swings Both Ways)
More informationLock Oscillation: Boosting the Performance of Concurrent Data Structures
Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou FORTH ICS & University of Crete Nikolaos D. Kallimanis FORTH ICS The Multicore Era The dominance of Multicore
More information) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)
) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) Goal A Distributed Transaction We want a transaction that involves multiple nodes Review of transactions and their properties
More informationDesign of Concurrent and Distributed Data Structures
METIS Spring School, Agadir, Morocco, May 2015 Design of Concurrent and Distributed Data Structures Christoph Kirsch University of Salzburg Joint work with M. Dodds, A. Haas, T.A. Henzinger, A. Holzer,
More informationThe Design Complexity of Program Undo Support in a General-Purpose Processor
The Design Complexity of Program Undo Support in a General-Purpose Processor Radu Teodorescu and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
More informationScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment
calableulk: calable Cache Coherence for Atomic locks in a Lazy Environment Xuehai ian, Wonsun Ahn, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Abstract Recently-proposed
More informationSequential Consistency & TSO. Subtitle
Sequential Consistency & TSO Subtitle Core C1 Core C2 data = 0, 1lag SET S1: store data = NEW S2: store 1lag = SET L1: load r1 = 1lag B1: if (r1 SET) goto L1 L2: load r2 = data; Will r2 always be set to
More informationLecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 20: Transactional Memory Parallel Computer Architecture and Programming Slide credit Many of the slides in today s talk are borrowed from Professor Christos Kozyrakis (Stanford University) Raising
More informationMarwan Burelle. Parallel and Concurrent Programming
marwan.burelle@lse.epita.fr http://wiki-prog.infoprepa.epita.fr Outline 1 2 Solutions Overview 1 Sharing Data First, always try to apply the following mantra: Don t share data! When non-scalar data are
More informationINTERFACING REAL-TIME AUDIO AND FILE I/O
INTERFACING REAL-TIME AUDIO AND FILE I/O Ross Bencina portaudio.com rossbencina.com rossb@audiomulch.com @RossBencina Example source code: https://github.com/rossbencina/realtimefilestreaming Q: How to
More informationA Case for Using Value Prediction to Improve Performance of Transactional Memory
A Case for Using Value Prediction to Improve Performance of Transactional Memory Salil Pant Center for Efficient, Scalable and Reliable Computing North Carolina State University smpant@ncsu.edu Gregory
More informationShared Memory Multiprocessors
Shared Memory Multiprocessors Jesús Labarta Index 1 Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2 Concepts? Memory Load/store (@) Containers Time
More information6 Transactional Memory. Robert Mullins
6 Transactional Memory ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Limitations of lock-based programming Transactional memory Programming with TM ( STM ) Software TM ( HTM ) Hardware TM 2
More informationCS 450 Exam 2 Mon. 11/7/2016
CS 450 Exam 2 Mon. 11/7/2016 Name: Rules and Hints You may use one handwritten 8.5 11 cheat sheet (front and back). This is the only additional resource you may consult during this exam. No calculators.
More informationTransactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93
Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 What are lock-free data structures A shared data structure is lock-free if its operations
More informationLecture 10: Avoiding Locks
Lecture 10: Avoiding Locks CSC 469H1F Fall 2006 Angela Demke Brown (with thanks to Paul McKenney) Locking: A necessary evil? Locks are an easy to understand solution to critical section problem Protect
More informationLecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections
Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)
More informationSpeculative Locks for Concurrent Execution of Critical Sections in Shared-Memory Multiprocessors
Wshp. on Memory Performance Issues, Intl. Symp. on Computer Architecture, June 2001. Speculative Locks for Concurrent Execution of Critical Sections in Shared-Memory Multiprocessors José F. Martínez and
More informationCIS Operating Systems Synchronization based on Busy Waiting. Professor Qiang Zeng Spring 2018
CIS 3207 - Operating Systems Synchronization based on Busy Waiting Professor Qiang Zeng Spring 2018 Previous class IPC for passing data Pipe FIFO Message Queue Shared Memory Compare these IPCs for data
More information6.852: Distributed Algorithms Fall, Class 20
6.852: Distributed Algorithms Fall, 2009 Class 20 Today s plan z z z Transactional Memory Reading: Herlihy-Shavit, Chapter 18 Guerraoui, Kapalka, Chapters 1-4 Next: z z z Asynchronous networks vs asynchronous
More informationSELECTED TOPICS IN COHERENCE AND CONSISTENCY
SELECTED TOPICS IN COHERENCE AND CONSISTENCY Michel Dubois Ming-Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA90089-2562 dubois@usc.edu INTRODUCTION IN CHIP
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationLecture 8: Eager Transactional Memory. Topics: implementation details of eager TM, various TM pathologies
Lecture 8: Eager Transactional Memory Topics: implementation details of eager TM, various TM pathologies 1 Eager Overview Topics: Logs Log optimization Conflict examples Handling deadlocks Sticky scenarios
More informationMemory Consistency Models: Convergence At Last!
Memory Consistency Models: Convergence At Last! Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign sadve@cs.uiuc.edu Acks: Co-authors: Mark Hill, Kourosh Gharachorloo,
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationLinked Lists: The Role of Locking. Erez Petrank Technion
Linked Lists: The Role of Locking Erez Petrank Technion Why Data Structures? Concurrent Data Structures are building blocks Used as libraries Construction principles apply broadly This Lecture Designing
More informationCommit-On-Sharing High Level Design
Commit-On-Sharing High Level Design Mike Pershin, Alexander Zam Zarochentsev 8 August 2008 1 Introduction 1.1 Definitions VBR version-based recovery transaction an atomic operation which reads and possibly
More information) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)
) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) Transactions - Definition A transaction is a sequence of data operations with the following properties: * A Atomic All
More informationCOREMU: a Portable and Scalable Parallel Full-system Emulator
COREMU: a Portable and Scalable Parallel Full-system Emulator Haibo Chen Parallel Processing Institute Fudan University http://ppi.fudan.edu.cn/haibo_chen Full-System Emulator Useful tool for multicore
More informationMultiversion Schemes to achieve Snapshot Isolation Timestamped Based Protocols
Multiversion Schemes to achieve Snapshot Isolation Timestamped Based Protocols Database System Concepts 5 th Ed. Silberschatz, Korth and Sudarshan, 2005 See www.db-book.com for conditions on re-use Each
More informationTHE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017
THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017 CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data
More informationMarwan Burelle. Parallel and Concurrent Programming
marwan.burelle@lse.epita.fr http://wiki-prog.infoprepa.epita.fr Outline 1 2 3 OpenMP Tell Me More (Go, OpenCL,... ) Overview 1 Sharing Data First, always try to apply the following mantra: Don t share
More informationFine-grained synchronization & lock-free data structures
Lecture 19: Fine-grained synchronization & lock-free data structures Parallel Computer Architecture and Programming Redo Exam statistics Example: a sorted linked list struct Node { int value; Node* next;
More informationCS5460/6460: Operating Systems. Lecture 11: Locking. Anton Burtsev February, 2014
CS5460/6460: Operating Systems Lecture 11: Locking Anton Burtsev February, 2014 Race conditions Disk driver maintains a list of outstanding requests Each process can add requests to the list 1 struct list
More informationTransactional Memory. Lecture 18: Parallel Computer Architecture and Programming CMU /15-618, Spring 2017
Lecture 18: Transactional Memory Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Credit: many slides in today s talk are borrowed from Professor Christos Kozyrakis (Stanford
More informationNikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris
Early Experiences on Accelerating Dijkstra s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical
More informationMcRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime
McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime B. Saha, A-R. Adl- Tabatabai, R. Hudson, C.C. Minh, B. Hertzberg PPoPP 2006 Introductory TM Sales Pitch Two legs
More informationVulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically
Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Abdullah Muzahid, Shanxiang Qi, and University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu MICRO December
More informationRelaxing Concurrency Control in Transactional Memory. Utku Aydonat
Relaxing Concurrency Control in Transactional Memory by Utku Aydonat A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of The Edward S. Rogers
More informationLog-Based Transactional Memory
Log-Based Transactional Memory Kevin E. Moore University of Wisconsin-Madison Motivation Chip-multiprocessors/Multi-core/Many-core are here Intel has 1 projects in the works that contain four or more computing
More informationChí Cao Minh 28 May 2008
Chí Cao Minh 28 May 2008 Uniprocessor systems hitting limits Design complexity overwhelming Power consumption increasing dramatically Instruction-level parallelism exhausted Solution is multiprocessor
More informationNON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1
NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 Alberto Ros Universidad de Murcia October 17th, 2017 1 A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA,
More informationCurrent Topics in OS Research. So, what s hot?
Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general
More informationLecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations
Lecture 12: TM, Consistency Models Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations 1 Paper on TM Pathologies (ISCA 08) LL: lazy versioning, lazy conflict detection, committing
More informationAchieving Out-of-Order Performance with Almost In-Order Complexity
Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost
More information