CS377P Programming for Performance Multicore Performance Synchronization

Similar documents
CSC2/458 Parallel and Distributed Systems Scalable Synchronization

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

Distributed Operating Systems

Distributed Operating Systems

Advance Operating Systems (CS202) Locks Discussion

Cache-Aware Lock-Free Queues for Multiple Producers/Consumers and Weak Memory Consistency

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

Distributed Systems Synchronization. Marcus Völp 2007

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Concurrent Preliminaries

Lecture 7: Transactional Memory Intro. Topics: introduction to transactional memory, lazy implementation

CS377P Programming for Performance Multicore Performance Cache Coherence

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Chapter 5 Asynchronous Concurrent Execution

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

1) If a location is initialized to 0, what will the first invocation of TestAndSet on that location return?

CMSC421: Principles of Operating Systems

Lecture #7: Implementing Mutual Exclusion

AST: scalable synchronization Supervisors guide 2002

Multi-core Architecture and Programming

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

[537] Locks. Tyler Harter

Mutual Exclusion: Classical Algorithms for Locks

CS5460: Operating Systems

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

1 RCU. 2 Improving spinlock performance. 3 Kernel interface for sleeping locks. 4 Deadlock. 5 Transactions. 6 Scalable interface design

Implementing Locks. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Lock-Free and Practical Doubly Linked List-Based Deques using Single-Word Compare-And-Swap

Today: Synchronization. Recap: Synchronization

Advanced Multiprocessor Programming Project Topics and Requirements

l12 handout.txt l12 handout.txt Printed by Michael Walfish Feb 24, 11 12:57 Page 1/5 Feb 24, 11 12:57 Page 2/5

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Linked Lists: The Role of Locking. Erez Petrank Technion

6.852: Distributed Algorithms Fall, Class 15

C09: Process Synchronization

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Synchronization for Concurrent Tasks

Distributed Operating Systems Memory Consistency

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Advanced Operating Systems (CS 202)

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Multiprocessor Support

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Concurrency: Mutual Exclusion (Locks)

Operating Systems. Synchronisation Part I

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Implementing Mutual Exclusion. Sarah Diesburg Operating Systems CS 3430

Synchronising Threads

A simple correctness proof of the MCS contention-free lock. Theodore Johnson. Krishna Harathi. University of Florida. Abstract

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Synchronization I q Race condition, critical regions q Locks and concurrent data structures q Next time: Condition variables, semaphores and monitors

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Foundations of the C++ Concurrency Memory Model

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

POSIX Threads: a first step toward parallel programming. George Bosilca

6 Transactional Memory. Robert Mullins

Multiprocessor Systems. Chapter 8, 8.1

Threading and Synchronization. Fahd Albinali

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture: Consistency Models, TM

Non-blocking Array-based Algorithms for Stacks and Queues!

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Concurrency: Locks. Announcements

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

6.852: Distributed Algorithms Fall, Class 20

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

Page 1. Challenges" Concurrency" CS162 Operating Systems and Systems Programming Lecture 4. Synchronization, Atomic operations, Locks"

Synchronization for Concurrent Tasks

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

An Analysis of Linux Scalability to Many Cores

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Synchronization I. Jo, Heeseung

OPERATING SYSTEMS DESIGN AND IMPLEMENTATION Third Edition ANDREW S. TANENBAUM ALBERT S. WOODHULL. Chap. 2.2 Interprocess Communication

CPSC/ECE 3220 Summer 2017 Exam 2

An Introduction to Parallel Systems

Last Class: Synchronization

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Intel Thread Building Blocks, Part IV

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ!

M4 Parallelism. Implementation of Locks Cache Coherence

Concurrent Computing

SMP & Locking. CPU clock-rate increase slowing. Multiprocessor System. Cache Coherency. Types of Multiprocessors (MPs)

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Memory Consistency Models

Multiprocessors and Locking

Allocating memory in a lock-free manner

Concurrency, Thread. Dongkun Shin, SKKU

Locks and semaphores. Johan Montelius KTH

CS4021/4521 INTRODUCTION

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

EE382 Processor Design. Processor Issues for MP

Synchronization COMPSCI 386

Transcription:

CS377P Programming for Performance Multicore Performance Synchronization Sreepathi Pai UTCS October 21, 2015

Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional Memory

Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional Memory

Embarrassingly Parallel Programs What are the characteristics of programs that scale linearly?

Embarrassingly Parallel Programs No serial portion. I.e., no communication and synchronization.

Critical Sections Why should critical sections be short? [A critical section is a region of code that must be executed by a single thread at a time.]

Locks tail_lock.lock() tail = tail + 1 list[tail] = newdata tail_lock.unlock() // returns only when lock is obtained

Can this happen? T0 T1 a = -5 a = 10 A later read of a returns 10.

Implementation of Locks All of the below algorithms require only read/write instructions: Peterson s Algorithm (for n = 2 threads) Filter Algorithm (> 2 threads) Lamport s Bakery Algorithm

Limitations for n threads, require n memory locations between a write and a read, another thread may have changed values

Atomic Read Modify Write Instructions Combine a read modify write into a single atomic action Unconditional type sync fetch and add (type *ptr, type value,...) Conditional bool sync bool compare and swap (type *ptr, type oldval, type newval,...) type sync val compare and swap (type *ptr, type oldval, type newval,...) See GCC documentation sync functions are replaced by atomic functions

AtomicCAS (Generic) Compare and Swap atomic cas(ptr, old, new) writes new to ptr if ptr contains old returns old Only atomic primitive really required atomic_add(ptr, addend) { do { old = *ptr; } while(atomic_cas(ptr, old, old + addend)!= old); }

Locks that spin/busy-waiting locks Locks are initialized to UNLOCKED lock(l): while(atomic_cas(l, UNLOCKED, LOCKED)!= UNLOCKED); unlock(l): l = UNLOCKED; This is a poor design Why? Suitable only for very short lock holds Use random backoff otherwise (e.g. sleep or PAUSE)

Locks that yield during spinning Locks are initialized to UNLOCKED lock(l): while(atomic_cas(l, UNLOCKED, LOCKED)!= UNLOCKED) { sched_yield(); // relinquish CPU }

Performance tradeoffs of spin locks Operation Atomics Lock unbounded Unlock 0 Remember every atomic must be processed serially!

An alternative lock ticket lock Each lock has a ticket associated with it Locks and tickets are initialized to 0 lock(l): // atomic_add returns previous value my_ticket = atomic_add(l.ticket, 1); while(l!= my_ticket); unlock(l): l += 1; // could also be an atomic_add

Performance tradeoffs of ticketlocks Operation Atomics Reads/Writes Lock 1 1 Unlock 0 1 Variations on ticket locks are used as high-performance locks today See Mellor-Crummey and Scott, Algorithms for Scalable Synchronization on Shared-Memory Processors

Barriers Synchronizes a group of threads Threads wait until all have reached barrier Creation (by one thread): pthread barrier init(&barrier, NULL, nthreads) Waiting pthread barrier wait(&barrier)

Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional Memory

Parallel data structures Data structures that support concurrent access Examples: Lists Queues Deques Trees Stack, etc.

Ensuring Consistency ( Safe access) Assume structure is a graph Need to modify values on a node concurrently Locking strategies: One lock for the entire graph (coarse-grain) One lock for every node (fine-grain) If atomic-width value, use atomic for single value

Blocking Use locks to ensure mutual exclusion Holder of lock blocks progress of other threads Progress? Fairness? Scalability? Priority inversion? Deadlocks?

Lock-free Guaranteed system-wide progress Non-blocking algorithms threads do not block for each other Individual threads may not make progress But at least one thread will Uses atomic compare and swap (or other atomic functions) instead of locks

Wait-free Guaranteed per-thread progress Obviously non-blocking Individual threads will take a bounded number of steps to make progress Universal constructions available Will turn any serial algorithm into a wait-free algorithm But will it perform?

Performance of concurrent data structures Throughput (Operations/Time) Caution, performance can differ based on conditions: Unloaded (low contention) Loaded (high contention) Non-linear behaviour

Collapse of Ticket Locks in the Linux kernel Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich, Non-scalable Locks are Dangerous

Lock Performance Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich, Non-scalable Locks are Dangerous

Inside a parallel processor What parallel data structures does a processor use?

Summary Parallel Programs require Parallel Data Structures High-performance parallel data structures require hardware support Active area of research (not mine!) Herlihy and Shavit, The Art of Multiprocessor Programming Take a parallel programming course

Outline 1 Synchronization Primitives 2 Blocking, Lock-free and Wait-free Algorithms 3 Transactional Memory

The Promise of Transactional Memory transaction { tail += 1; list[tail] = data; } Wrap critical sections with transaction markers Transactions succeed when no conflicts are detected Conflicts cause transactions to fail Policy differs on who fails and what happens on a failure

Implementation (High-level) Track reads and writes inside transactions (weak atomicity) everywhere (strong atomicity) Conflict when reads and writes shared between transactions these may not correspond to programmer-level reads/writes Eager conflict detection every read and write checked for conflict aborts transaction immediately on conflict Lazy conflict detection check conflicts when transaction end May provide abort path taken when transactions fail

Actual Implementations Deferred to lecture on cache coherence

Conclusion Synchronization limits scalability Need scalable synchronization primitives Hardware implementation can affect scalability More in next 2 lectures Will transactional memory help?