CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION

Size: px
Start display at page:

Download "CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION"

Transcription

1 CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION Tanmay Gangwani, Adam Morrison *, Josep Torrellas University of Illinois, Urbana- Champaign Technion Israel InsBtute of Technology *

2 Concurrent Data Structures push/pop Core stack- top

3 Concurrent Data Structures push/pop Core stack- top Lock- based SynchronizaJon Core Lock Stack Core Core Core stack- top Time stack- top Unlock Stack

4 Concurrent Data Structures push/pop Core stack- top Lock- free SynchronizaJon X Core 1. read stack- top 2. read- modify- write stack- top CAS (A, X) stack- top stack- top A X A Time Core Core Core

5 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } new_top A new_top B

6 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } new_top A old_top A = stack- top new_top B old_top B = stack- top

7 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } new_top A old_top A = stack- top stack- top new_top A new_top B old_top B = stack- top new_top B

8 CAS Failures Adding node to shared stack Core A stack- top Core B Time void push () { Node *new_top = malloc(); while(true) { old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) return; } } CAS FAILURE new_top A old_top A = stack- top new_top A stack- top stack- top new_top A new_top B old_top B = stack- top new_top B new_top B

9 Load- to- CAS Atomicity Adding node to shared stack void push () { Node *new_top = malloc(); while(true) { ATOMIC old_top = stack- top; new_top- >next = old_top; if(cas(&stack- top, old_top, new_top)) } } return; Cache Line Lock NO CAS FAILURES

10 Scourge of SerializaJon Atomic Compute new Load old CAS(old, new) Core A Compute new Core B Compute new Core C Compute new new top old top new old CAS Stall Stall Time CAS CAS

11 ContribuJon Atomic new Compute new Load old CAS(old, new) CASPAR reduces stalls in lock- free programs using two new top top techniques : old new ü Eager old Forwarding ü Group ValidaUon Core A Compute new CAS Core B Compute new CAS Stall Core C Compute new Stall Time CAS

12 CASPAR Fights SerializaJon Baseline (Prior work) ContribuBons Hardware Queueing Directory chains cores in hardware ContribuUon I : Eager Forwarding Directory facilitates passing of specula9ve values between cores ContribuUon II : Parallel ValidaUon Directory facilitates group- valida9on of mul9ple cores q Uses cache line bit for locking q Dynamic detecbon of contended CAS No source code modificabons q Deadlock- free

13 SerializaJon in Hardware Queueing Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E ld old A ld old B ld old C ld old D ld old E ld old F D C Serial passing of cache line between cores B A Directory

14 SerializaJon in Hardware Queueing Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E D Serial passing of cache line between cores Line C B A Directory

15 SerializaJon in Hardware Queueing Atomic Compute new Load old CAS(old, new) Serial passing of cache line between cores Core A Core B Core C Core D Core E Line Core F F E F D E F D C E F D B C E F D A B C E F Directory

16 ContribuJon I : Eager Forwarding - Insight Adding node to shared stack Core A Core B Core C void push () { Node *new_top = malloc(); while(true) { Compute new Compute new Compute new Atomic old_top = stack; new_top- >next = old_top; if(cas(&stack, old_top, new_top)) CAS Stall Stall Time } } return; new_top generated early on, independent of old_top CAS CAS

17 CASPAR Eager Forwarding Adding node to shared stack Core A Core B Core C void push () { Node *new_top = malloc(); while(true) { Compute new new A Compute new new B Compute new Atomic old_top = stack; new_top- >next = old_top; if(cas(&stack, old_top, new_top)) } } return; new_top generated early on, independent of old_top CAS ValidaUon CAS ValidaUon CAS Time Out- of- window SpeculaBve ExecuBon azer register check- poinbng

18 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E ld old A ld old B ld old C ld old D ld olde ld old F D C B A Directory

19 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E D Line C B A Directory

20 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E new A new B new C new D new E Core F F E D C B A new F new E new D new C new B new A Directory

21 CASPAR Eager Forwarding Atomic Compute new Load old CAS(old, new) Serial ValidaBon Core A Core B Core C Core D Core E Line Spec. Core F F E F D E F D C E F D B C E F A D B C E F new F new E F new D E F new C D E F new B D C E F new A D B C E F Directory

22 LimitaJons of Eager Forwarding Core A Core B Core C Compute new Compute new Compute new new A new B Long speculabon Time CAS CAS CAS ValidaUon Compute new SpeculaBve execubon not possible beyond this point ValidaUon Stall

23 ContribuJon II : Parallel ValidaJon Core A Compute new Core B Compute new Core C Compute new TWO- PHASE COMMIT (2PC) new A new B PROTOCOL TO PERFORM PARALLEL VALIDATION Time CAS CAS CAS ValidaUon ValidaUon Idea: Validate in the directory without ever sending the full line to the core Directory

24 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Core A Core B Core C Core D Core E Core F F E Parallel passing of speculabve values between cores ld old A ld old B ld old C ld old D ld olde ld old F D C B A Directory

25 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E new A new B new C new D new E Core F F E D C B F A E new F new E new D new C new B F new A E Directory

26 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Line Line == new A?? Core F F E D C B F A E new F new E new D new C new B F new A E Directory

27 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Parallel ValidaBon Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Core F F F E D E D C C B F B A E new F new F E new D E new D C new C B F new B A E Directory

28 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Parallel ValidaBon Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Prep, new B Prep, new C Prep, new D Prep, new E Core F Prep, new F F F E D E D C C B F B A E new F new F E new D E new D C new C B F new B A E Directory

29 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Parallel passing of speculabve values between cores Parallel ValidaBon Spec. Spec. Spec. Spec. Spec. Core A Core B Core C Core D Core E Ack Ack Ack Nack Ack Core F F F E D E D C C B F B A E new F new F E new D E new D C new C B F new B A E Directory

30 CASPAR Parallel ValidaJon Atomic Compute new Load old CAS(old, new) Spec. Core A Core B Core C Core D Core E Spec. Core F Parallel passing of speculabve values between cores Commit Commit Commit Resume Ack ing cores halt. Resume re- starts execubon Parallel ValidaBon F E new F new E Directory

31 Outcome ü Fully Parallel Data Transfer ü Fully Parallel De- queue

32 EvaluaJon Sniper simulator, 64 cores, OOO Kernels : FIFO, LIFO, LIFO- push- only, Mandelbrot, Larson ApplicaUons : Galois graph analybcs suite (3x2), Barcelona OpenMP Tasks Suite (1x2) 100% 100% 30% 30% 10% 10% 5% 5% L

33 EvaluaJon 100% 100% 30% 30% 10% 10% 5% 5% L LF

34 EvaluaJon 100% 100% 30% 30% 10% 10% 5% 5% L LF Q

35 EvaluaJon 100% 100% 30% 30% 10% 10% 5% 5% L LF Q EF

36 EvaluaJon Independent Set (IS) gains from Eager Forwarding since early data transfer exposes work that could be done speculabvely and in parallel 100% 100% 30% 30% 10% 10% 5% 5% L LF Q EF

37 EvaluaJon Connected Components (CC) and Delaunay TriangulaBon (DT) gain from Parallel ValidaBon as it removes stalls in EF using group commit 100% 100% 30% 30% 10% 10% 5% 5% 2.5x L LF Q EF C

38 Also in the paper ü InformaBon on proposed hardware changes ü Handling memory consistency issues ü Detailed working and correctness of Eager Forwarding and 2PC protocols ü Scalability plots, cycle- level breakdown for applicabons ü SupporBng other primibves (e.g. LL/SC)

39 Conclusions CASPAR enables true parallelism in codes using lock- free synchronizabon - Eager Forwarding : Parallel transfer of data - Parallel ValidaBon : Parallel de- queue No source code modificabons Future Work - Can be used to break conflict serializabon in TM designs

40 CASPAR: BREAKING SERIALIZATION IN LOCK- FREE MULTICORE SYNCHRONIZATION Tanmay Gangwani, Adam Morrison *, Josep Torrellas University of Illinois, Urbana- Champaign Technion Israel InsBtute of Technology *

41 CASPAR BACKUP

42 Maintaining Consistency Core A Core B ld old new A ld old {new value forwarded to core B} Z = 1 CAS(old, new).. = Z (read 0) Time ValidaUon

43 Maintaining Consistency Core A Core B ld old new A ld old {new value forwarded to core B} Z = 1 CAS(old, new) Coherence mechanism Forwarding causes no consistency violauons.. = Z (read 0) Squash and roll- back Time

44 EvaluaJon (kernels) CAS throughput improvements - EF over Base : 53% - C over Base : 83% - EF over Queue : 10% - C over Queue : 32%

45 EvaluaJon (scalability) LIFO- push- only Mandelbrot LIFO CASPAR greatly improves scalability with number of cores

46 CASPAR in EnJrety EffecBve SerializaBon Module I : Hardware Queueing Directory chains cores in hardware Core A Core B Core C Module II : Eager Forwarding Directory facilitates passing of specula9ve values between cores Core A Core B Core C Breaking SerializaBon Module III : Two- Phase Group Commit Passing SpeculaBve Value Directory facilitates collec9ve- valida9on of mul9ple cores Directory Core A Core B Core C Group ValidaBon

c 2016 Tanmay Gangwani

c 2016 Tanmay Gangwani c 2016 Tanmay Gangwani BREAKING SERIALIZATION IN LOCK-FREE MULTICORE SYNCHRONIZATION BY TANMAY GANGWANI THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in

More information

CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization

CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization ASPAR: Breaking Serialization in ock-free Multicore Synchronization Tanmay Gangwani University of Illinois at Urbana hampaign gangwan2@illinois.edu Adam Morrison Technion Israel Institute of Technology

More information

Speculative Locks. Dept. of Computer Science

Speculative Locks. Dept. of Computer Science Speculative Locks José éf. Martínez and djosep Torrellas Dept. of Computer Science University it of Illinois i at Urbana-Champaign Motivation Lock granularity a trade-off: Fine grain greater concurrency

More information

ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment

ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Envionment, Wonsun Ahn, Josep Torrellas University of Illinois University of Illinois http://iacoma.cs.uiuc.edu/ Motivation Architectures

More information

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes

FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes FlexBulk: Intelligently Forming Atomic Blocks in Blocked-Execution Multiprocessors to Minimize Squashes Rishi Agarwal and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

More information

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow

More information

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations Lecture 21: Transactional Memory Topics: Hardware TM basics, different implementations 1 Transactions New paradigm to simplify programming instead of lock-unlock, use transaction begin-end locks are blocking,

More information

BulkCommit: Scalable and Fast Commit of Atomic Blocks in a Lazy Multiprocessor Environment

BulkCommit: Scalable and Fast Commit of Atomic Blocks in a Lazy Multiprocessor Environment BulkCommit: Scalable and Fast Commit of Atomic Blocks in a Lazy Multiprocessor Environment, Benjamin Sahelices Josep Torrellas, Depei Qian University of Illinois University of Illinois http://iacoma.cs.uiuc.edu/

More information

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks : Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia

More information

Speculative Synchronization

Speculative Synchronization Speculative Synchronization José F. Martínez Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/martinez Problem 1: Conservative Parallelization No parallelization

More information

Lecture: Transactional Memory. Topics: TM implementations

Lecture: Transactional Memory. Topics: TM implementations Lecture: Transactional Memory Topics: TM implementations 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock 2 Design Space Data Versioning

More information

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas The Design Complexity of Program Undo Support in a General Purpose Processor Radu Teodorescu and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Processor with program

More information

Lecture 10: TM Implementations. Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation

Lecture 10: TM Implementations. Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation 1 Eager Overview Topics: Logs Log optimization Conflict examples Handling deadlocks Sticky scenarios

More information

CS 134: Operating Systems

CS 134: Operating Systems CS 134: Operating Systems Locks and Low-Level Synchronization CS 134: Operating Systems Locks and Low-Level Synchronization 1 / 25 2 / 25 Overview Overview Overview Low-Level Synchronization Non-Blocking

More information

Kernel Scalability. Adam Belay

Kernel Scalability. Adam Belay Kernel Scalability Adam Belay Motivation Modern CPUs are predominantly multicore Applications rely heavily on kernel for networking, filesystem, etc. If kernel can t scale across many

More information

Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA

Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA Message Passing Improvements to Shared Address Space Thread Synchronization Techniques DAN STAFFORD, ROBERT RELYEA Agenda Background Motivation Remote Memory Request Shared Address Synchronization Remote

More information

Understanding Hardware Transactional Memory

Understanding Hardware Transactional Memory Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence

More information

CS377P Programming for Performance Multicore Performance Synchronization

CS377P Programming for Performance Multicore Performance Synchronization CS377P Programming for Performance Multicore Performance Synchronization Sreepathi Pai UTCS October 21, 2015 Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional

More information

DeAliaser: Alias Speculation Using Atomic Region Support

DeAliaser: Alias Speculation Using Atomic Region Support DeAliaser: Alias Speculation Using Atomic Region Support Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign http://iacoma.cs.illinois.edu Memory Aliasing Prevents Good

More information

CS533: Speculative Parallelization (Thread-Level Speculation)

CS533: Speculative Parallelization (Thread-Level Speculation) CS533: Speculative Parallelization (Thread-Level Speculation) Josep Torrellas University of Illinois in Urbana-Champaign March 5, 2015 Josep Torrellas (UIUC) CS533: Lecture 14 March 5, 2015 1 / 21 Concepts

More information

CS 241 Honors Concurrent Data Structures

CS 241 Honors Concurrent Data Structures CS 241 Honors Concurrent Data Structures Bhuvan Venkatesh University of Illinois Urbana Champaign March 27, 2018 CS 241 Course Staff (UIUC) Lock Free Data Structures March 27, 2018 1 / 43 What to go over

More information

OpenMP and more Deadlock 2/16/18

OpenMP and more Deadlock 2/16/18 OpenMP and more Deadlock 2/16/18 Administrivia HW due Tuesday Cache simulator (direct-mapped and FIFO) Steps to using threads for parallelism Move code for thread into a function Create a struct to hold

More information

LogTM: Log-Based Transactional Memory

LogTM: Log-Based Transactional Memory LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, & David A. Wood 12th International Symposium on High Performance Computer Architecture () 26 Mulitfacet

More information

SPECULATIVE SYNCHRONIZATION: PROGRAMMABILITY AND PERFORMANCE FOR PARALLEL CODES

SPECULATIVE SYNCHRONIZATION: PROGRAMMABILITY AND PERFORMANCE FOR PARALLEL CODES SPECULATIVE SYNCHRONIZATION: PROGRAMMABILITY AND PERFORMANCE FOR PARALLEL CODES PROPER SYNCHRONIZATION IS VITAL TO ENSURING THAT PARALLEL APPLICATIONS EXECUTE CORRECTLY. A COMMON PRACTICE IS TO PLACE SYNCHRONIZATION

More information

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) ) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) Transactions - Definition A transaction is a sequence of data operations with the following properties: * A Atomic All

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6) Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) 1 Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors,

More information

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section

More information

CMP-3440 Database Systems

CMP-3440 Database Systems CMP-3440 Database Systems Concurrency Control with Locking, Serializability, Deadlocks, Database Recovery Management Lecture 10 zain 1 Basic Recovery Facilities Backup Facilities: provides periodic backup

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas Cache Coherence (II) Instructor: Josep Torrellas CS533 Copyright Josep Torrellas 2003 1 Sparse Directories Since total # of cache blocks in machine is much less than total # of memory blocks, most directory

More information

Scalable Locking. Adam Belay

Scalable Locking. Adam Belay Scalable Locking Adam Belay Problem: Locks can ruin performance 12 finds/sec 9 6 Locking overhead dominates 3 0 0 6 12 18 24 30 36 42 48 Cores Problem: Locks can ruin performance the locks

More information

TECHNIQUES TO DETECT AND AVERT ADVANCED SOFTWARE CONCURRENCY BUGS SHANXIANG QI DISSERTATION

TECHNIQUES TO DETECT AND AVERT ADVANCED SOFTWARE CONCURRENCY BUGS SHANXIANG QI DISSERTATION 2013 Shanxiang Qi TECHNIQUES TO DETECT AND AVERT ADVANCED SOFTWARE CONCURRENCY BUGS BY SHANXIANG QI DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

More information

Prototyping Architectural Support for Program Rollback Using FPGAs

Prototyping Architectural Support for Program Rollback Using FPGAs Prototyping Architectural Support for Program Rollback Using FPGAs Radu Teodorescu and Josep Torrellas http://iacoma.cs.uiuc.edu University of Illinois at Urbana-Champaign Motivation Problem: Software

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

The need for atomicity This code sequence illustrates the need for atomicity. Explain. Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock

More information

SYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS. Donald Nguyen, Keshav Pingali

SYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS. Donald Nguyen, Keshav Pingali SYNTHESIZING CONCURRENT SCHEDULERS FOR IRREGULAR ALGORITHMS Donald Nguyen, Keshav Pingali TERMINOLOGY irregular algorithm opposite: dense arrays work on pointered data structures like graphs shape of the

More information

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) ) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) Goal A Distributed Transaction We want a transaction that involves multiple nodes Review of transactions and their properties

More information

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 18-742 Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II Prof. Onur Mutlu Carnegie Mellon University 10/12/2012 Past Due: Review Assignments Was Due: Tuesday, October 9, 11:59pm. Sohi

More information

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based?

Agenda. Designing Transactional Memory Systems. Why not obstruction-free? Why lock-based? Agenda Designing Transactional Memory Systems Part III: Lock-based STMs Pascal Felber University of Neuchatel Pascal.Felber@unine.ch Part I: Introduction Part II: Obstruction-free STMs Part III: Lock-based

More information

Lecture: Consistency Models, TM

Lecture: Consistency Models, TM Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) No class on Monday (please watch TM videos) Wednesday: TM wrap-up, interconnection networks 1 Coherence Vs. Consistency

More information

EazyHTM: Eager-Lazy Hardware Transactional Memory

EazyHTM: Eager-Lazy Hardware Transactional Memory EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero Barcelona Supercomputing Center,

More information

Improving the Practicality of Transactional Memory

Improving the Practicality of Transactional Memory Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

Advance Operating Systems (CS202) Locks Discussion

Advance Operating Systems (CS202) Locks Discussion Advance Operating Systems (CS202) Locks Discussion Threads Locks Spin Locks Array-based Locks MCS Locks Sequential Locks Road Map Threads Global variables and static objects are shared Stored in the static

More information

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages

More information

DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently

DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently : Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently, Luis Ceze* and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

30. Parallel Programming IV

30. Parallel Programming IV 924 30. Parallel Programming IV Futures, Read-Modify-Write Instructions, Atomic Variables, Idea of lock-free programming [C++ Futures: Williams, Kap. 4.2.1-4.2.3] [C++ Atomic: Williams, Kap. 5.2.1-5.2.4,

More information

Potential violations of Serializability: Example 1

Potential violations of Serializability: Example 1 CSCE 6610:Advanced Computer Architecture Review New Amdahl s law A possible idea for a term project Explore my idea about changing frequency based on serial fraction to maintain fixed energy or keep same

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

A Qualitative Survey of Modern Software Transactional Memory Systems

A Qualitative Survey of Modern Software Transactional Memory Systems A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe and Michael L. Scott TR 839 Department of Computer Science University of Rochester Rochester, NY 14627-0226 {vmarathe,

More information

Fine-grained synchronization & lock-free programming

Fine-grained synchronization & lock-free programming Lecture 17: Fine-grained synchronization & lock-free programming Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Minnie the Moocher Robbie Williams (Swings Both Ways)

More information

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

Lock Oscillation: Boosting the Performance of Concurrent Data Structures Lock Oscillation: Boosting the Performance of Concurrent Data Structures Panagiota Fatourou FORTH ICS & University of Crete Nikolaos D. Kallimanis FORTH ICS The Multicore Era The dominance of Multicore

More information

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) ) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) Goal A Distributed Transaction We want a transaction that involves multiple nodes Review of transactions and their properties

More information

Design of Concurrent and Distributed Data Structures

Design of Concurrent and Distributed Data Structures METIS Spring School, Agadir, Morocco, May 2015 Design of Concurrent and Distributed Data Structures Christoph Kirsch University of Salzburg Joint work with M. Dodds, A. Haas, T.A. Henzinger, A. Holzer,

More information

The Design Complexity of Program Undo Support in a General-Purpose Processor

The Design Complexity of Program Undo Support in a General-Purpose Processor The Design Complexity of Program Undo Support in a General-Purpose Processor Radu Teodorescu and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

More information

ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment

ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment calableulk: calable Cache Coherence for Atomic locks in a Lazy Environment Xuehai ian, Wonsun Ahn, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Abstract Recently-proposed

More information

Sequential Consistency & TSO. Subtitle

Sequential Consistency & TSO. Subtitle Sequential Consistency & TSO Subtitle Core C1 Core C2 data = 0, 1lag SET S1: store data = NEW S2: store 1lag = SET L1: load r1 = 1lag B1: if (r1 SET) goto L1 L2: load r2 = data; Will r2 always be set to

More information

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 20: Transactional Memory Parallel Computer Architecture and Programming Slide credit Many of the slides in today s talk are borrowed from Professor Christos Kozyrakis (Stanford University) Raising

More information

Marwan Burelle. Parallel and Concurrent Programming

Marwan Burelle.   Parallel and Concurrent Programming marwan.burelle@lse.epita.fr http://wiki-prog.infoprepa.epita.fr Outline 1 2 Solutions Overview 1 Sharing Data First, always try to apply the following mantra: Don t share data! When non-scalar data are

More information

INTERFACING REAL-TIME AUDIO AND FILE I/O

INTERFACING REAL-TIME AUDIO AND FILE I/O INTERFACING REAL-TIME AUDIO AND FILE I/O Ross Bencina portaudio.com rossbencina.com rossb@audiomulch.com @RossBencina Example source code: https://github.com/rossbencina/realtimefilestreaming Q: How to

More information

A Case for Using Value Prediction to Improve Performance of Transactional Memory

A Case for Using Value Prediction to Improve Performance of Transactional Memory A Case for Using Value Prediction to Improve Performance of Transactional Memory Salil Pant Center for Efficient, Scalable and Reliable Computing North Carolina State University smpant@ncsu.edu Gregory

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Shared Memory Multiprocessors Jesús Labarta Index 1 Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2 Concepts? Memory Load/store (@) Containers Time

More information

6 Transactional Memory. Robert Mullins

6 Transactional Memory. Robert Mullins 6 Transactional Memory ( MPhil Chip Multiprocessors (ACS Robert Mullins Overview Limitations of lock-based programming Transactional memory Programming with TM ( STM ) Software TM ( HTM ) Hardware TM 2

More information

CS 450 Exam 2 Mon. 11/7/2016

CS 450 Exam 2 Mon. 11/7/2016 CS 450 Exam 2 Mon. 11/7/2016 Name: Rules and Hints You may use one handwritten 8.5 11 cheat sheet (front and back). This is the only additional resource you may consult during this exam. No calculators.

More information

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93 What are lock-free data structures A shared data structure is lock-free if its operations

More information

Lecture 10: Avoiding Locks

Lecture 10: Avoiding Locks Lecture 10: Avoiding Locks CSC 469H1F Fall 2006 Angela Demke Brown (with thanks to Paul McKenney) Locking: A necessary evil? Locks are an easy to understand solution to critical section problem Protect

More information

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

Speculative Locks for Concurrent Execution of Critical Sections in Shared-Memory Multiprocessors

Speculative Locks for Concurrent Execution of Critical Sections in Shared-Memory Multiprocessors Wshp. on Memory Performance Issues, Intl. Symp. on Computer Architecture, June 2001. Speculative Locks for Concurrent Execution of Critical Sections in Shared-Memory Multiprocessors José F. Martínez and

More information

CIS Operating Systems Synchronization based on Busy Waiting. Professor Qiang Zeng Spring 2018

CIS Operating Systems Synchronization based on Busy Waiting. Professor Qiang Zeng Spring 2018 CIS 3207 - Operating Systems Synchronization based on Busy Waiting Professor Qiang Zeng Spring 2018 Previous class IPC for passing data Pipe FIFO Message Queue Shared Memory Compare these IPCs for data

More information

6.852: Distributed Algorithms Fall, Class 20

6.852: Distributed Algorithms Fall, Class 20 6.852: Distributed Algorithms Fall, 2009 Class 20 Today s plan z z z Transactional Memory Reading: Herlihy-Shavit, Chapter 18 Guerraoui, Kapalka, Chapters 1-4 Next: z z z Asynchronous networks vs asynchronous

More information

SELECTED TOPICS IN COHERENCE AND CONSISTENCY

SELECTED TOPICS IN COHERENCE AND CONSISTENCY SELECTED TOPICS IN COHERENCE AND CONSISTENCY Michel Dubois Ming-Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA90089-2562 dubois@usc.edu INTRODUCTION IN CHIP

More information

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Lecture 8: Eager Transactional Memory. Topics: implementation details of eager TM, various TM pathologies

Lecture 8: Eager Transactional Memory. Topics: implementation details of eager TM, various TM pathologies Lecture 8: Eager Transactional Memory Topics: implementation details of eager TM, various TM pathologies 1 Eager Overview Topics: Logs Log optimization Conflict examples Handling deadlocks Sticky scenarios

More information

Memory Consistency Models: Convergence At Last!

Memory Consistency Models: Convergence At Last! Memory Consistency Models: Convergence At Last! Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign sadve@cs.uiuc.edu Acks: Co-authors: Mark Hill, Kourosh Gharachorloo,

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Linked Lists: The Role of Locking. Erez Petrank Technion

Linked Lists: The Role of Locking. Erez Petrank Technion Linked Lists: The Role of Locking Erez Petrank Technion Why Data Structures? Concurrent Data Structures are building blocks Used as libraries Construction principles apply broadly This Lecture Designing

More information

Commit-On-Sharing High Level Design

Commit-On-Sharing High Level Design Commit-On-Sharing High Level Design Mike Pershin, Alexander Zam Zarochentsev 8 August 2008 1 Introduction 1.1 Definitions VBR version-based recovery transaction an atomic operation which reads and possibly

More information

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) ) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons) Transactions - Definition A transaction is a sequence of data operations with the following properties: * A Atomic All

More information

COREMU: a Portable and Scalable Parallel Full-system Emulator

COREMU: a Portable and Scalable Parallel Full-system Emulator COREMU: a Portable and Scalable Parallel Full-system Emulator Haibo Chen Parallel Processing Institute Fudan University http://ppi.fudan.edu.cn/haibo_chen Full-System Emulator Useful tool for multicore

More information

Multiversion Schemes to achieve Snapshot Isolation Timestamped Based Protocols

Multiversion Schemes to achieve Snapshot Isolation Timestamped Based Protocols Multiversion Schemes to achieve Snapshot Isolation Timestamped Based Protocols Database System Concepts 5 th Ed. Silberschatz, Korth and Sudarshan, 2005 See www.db-book.com for conditions on re-use Each

More information

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017 THE FUTURE OF GPU DATA MANAGEMENT Michael Wolfe, May 9, 2017 CPU CACHE Hardware managed What data to cache? Where to store the cached data? What data to evict when the cache fills up? When to store data

More information

Marwan Burelle. Parallel and Concurrent Programming

Marwan Burelle.  Parallel and Concurrent Programming marwan.burelle@lse.epita.fr http://wiki-prog.infoprepa.epita.fr Outline 1 2 3 OpenMP Tell Me More (Go, OpenCL,... ) Overview 1 Sharing Data First, always try to apply the following mantra: Don t share

More information

Fine-grained synchronization & lock-free data structures

Fine-grained synchronization & lock-free data structures Lecture 19: Fine-grained synchronization & lock-free data structures Parallel Computer Architecture and Programming Redo Exam statistics Example: a sorted linked list struct Node { int value; Node* next;

More information

CS5460/6460: Operating Systems. Lecture 11: Locking. Anton Burtsev February, 2014

CS5460/6460: Operating Systems. Lecture 11: Locking. Anton Burtsev February, 2014 CS5460/6460: Operating Systems Lecture 11: Locking Anton Burtsev February, 2014 Race conditions Disk driver maintains a list of outstanding requests Each process can add requests to the list 1 struct list

More information

Transactional Memory. Lecture 18: Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

Transactional Memory. Lecture 18: Parallel Computer Architecture and Programming CMU /15-618, Spring 2017 Lecture 18: Transactional Memory Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Credit: many slides in today s talk are borrowed from Professor Christos Kozyrakis (Stanford

More information

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Early Experiences on Accelerating Dijkstra s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical

More information

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime B. Saha, A-R. Adl- Tabatabai, R. Hudson, C.C. Minh, B. Hertzberg PPoPP 2006 Introductory TM Sales Pitch Two legs

More information

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Abdullah Muzahid, Shanxiang Qi, and University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu MICRO December

More information

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat

Relaxing Concurrency Control in Transactional Memory. Utku Aydonat Relaxing Concurrency Control in Transactional Memory by Utku Aydonat A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of The Edward S. Rogers

More information

Log-Based Transactional Memory

Log-Based Transactional Memory Log-Based Transactional Memory Kevin E. Moore University of Wisconsin-Madison Motivation Chip-multiprocessors/Multi-core/Many-core are here Intel has 1 projects in the works that contain four or more computing

More information

Chí Cao Minh 28 May 2008

Chí Cao Minh 28 May 2008 Chí Cao Minh 28 May 2008 Uniprocessor systems hitting limits Design complexity overwhelming Power consumption increasing dramatically Instruction-level parallelism exhausted Solution is multiprocessor

More information

NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1

NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 NON-SPECULATIVE LOAD LOAD REORDERING IN TSO 1 Alberto Ros Universidad de Murcia October 17th, 2017 1 A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA,

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations

Lecture 12: TM, Consistency Models. Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations Lecture 12: TM, Consistency Models Topics: TM pathologies, sequential consistency, hw and hw/sw optimizations 1 Paper on TM Pathologies (ISCA 08) LL: lazy versioning, lazy conflict detection, committing

More information

Achieving Out-of-Order Performance with Almost In-Order Complexity

Achieving Out-of-Order Performance with Almost In-Order Complexity Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost

More information