CS758 Synchronization. CS758 Multicore Programming (Wood) Synchronization

Similar documents
Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

Bottom-Up Multicore Programming

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Algorithms for Scalable Synchronization on Shared Memory Multiprocessors by John M. Mellor Crummey Michael L. Scott

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Scalable Locking. Adam Belay

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CSC2/458 Parallel and Distributed Systems Scalable Synchronization

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

6.852: Distributed Algorithms Fall, Class 15

Advance Operating Systems (CS202) Locks Discussion

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

Adaptive Lock. Madhav Iyengar < >, Nathaniel Jeffries < >

Hardware: BBN Butterfly

CS533 Concepts of Operating Systems. Jonathan Walpole

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Algorithms for Scalable Lock Synchronization on Shared-memory Multiprocessors

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Distributed Operating Systems

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

[537] Locks. Tyler Harter

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Implementing Locks. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Distributed Systems Synchronization. Marcus Völp 2007

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Shared Memory Programming. Parallel Programming Overview

Conventions. Barrier Methods. How it Works. Centralized Barrier. Enhancements. Pitfalls

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

Concurrency: Locks. Announcements

Cache Coherence and Atomic Operations in Hardware

Programming Languages

CS377P Programming for Performance Multicore Performance Synchronization

Advanced Operating Systems (CS 202)

Locks Chapter 4. Overview. Introduction Spin Locks. Queue locks. Roger Wattenhofer. Test-and-Set & Test-and-Test-and-Set Backoff lock

Concurrency: Mutual Exclusion (Locks)

Lecture 19: Coherence and Synchronization. Topics: synchronization primitives (Sections )

Parallel Programming: Background Information

Parallel Programming: Background Information

Concurrent Preliminaries

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Synchronization 1. Synchronization

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

CSE 120 Principles of Operating Systems

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Distributed Operating Systems

Spin Locks and Contention

Lecture 10: Multi-Object Synchronization

CS3733: Operating Systems

CS 537 Lecture 11 Locks

CS 111. Operating Systems Peter Reiher

An Introduction to Parallel Systems

Lecture 5 Threads and Pthreads II

Marwan Burelle. Parallel and Concurrent Programming

Memory Consistency Models

Other consistency models

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

What is uop Cracking?

Concurrency: Signaling and Condition Variables

The MESI State Transition Graph

Linux kernel synchronization. Don Porter CSE 506

CMSC421: Principles of Operating Systems

Readers-Writers Problem. Implementing shared locks. shared locks (continued) Review: Test-and-set spinlock. Review: Test-and-set on alpha

Multiprocessor Systems. Chapter 8, 8.1

HPCSE - I. «Introduction to multithreading» Panos Hadjidoukas

RWLocks in Erlang/OTP

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 3 Nov 2017

Advanced Topic: Efficient Synchronization

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

G52CON: Concepts of Concurrency

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

CS510 Operating System Foundations. Jonathan Walpole

Lecture: Coherence and Synchronization. Topics: synchronization primitives, consistency models intro (Sections )

CS 3305 Intro to Threads. Lecture 6

Threads and Synchronization. Kevin Webb Swarthmore College February 15, 2018

CS 241 Honors Concurrent Data Structures

Spin Locks and Contention Management

Review: Test-and-set spinlock

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

Locking. Part 2, Chapter 11. Roger Wattenhofer. ETH Zurich Distributed Computing

CS516 Programming Languages and Compilers II

Multiprocessors and Locking

Synchronising Threads

Synchronization 1. Synchronization

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

CS 5523 Operating Systems: Midterm II - reivew Instructor: Dr. Tongping Liu Department Computer Science The University of Texas at San Antonio

CS377P Programming for Performance Multicore Performance Cache Coherence

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Spinlocks. Spinlocks. Message Systems, Inc. April 8, 2011

Lecture 5: Synchronization w/locks

Concurrency, Thread. Dongkun Shin, SKKU

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Transcription:

CS758 Synchronization 1

Shared Memory Primitives CS758 Multicore Programming (Wood) Performance Tuning 2

Shared Memory Primitives Create thread Ask operating system to create a new thread Threads starts at specified function (via function pointer) Memory regions Shared: globals and heap Per-thread: stack Primitive memory operations Loads & stores Word-sized operations are atomic (if aligned ) Various atomic memory instructions Swap, compare-and-swap, test-and-set, atomic add, etc. Perform a load/store pair that is guaranteed not to interrupted 3

Thread Creation Varies from operating system to operating system POSIX threads (P-threads) is a widely supported standard (C/C++) Lots of other stuff in P-threads we re not using Why? really design for single-core OS-style concurrency pthread_create(id, attr, start_func, arg) id is pointer to a pthread_t object We re going to ignore attr start_func is a function pointer arg is a void *, can pass pointer to anything (ah, C ) 4

Thread Creation Code Example I Caveat: C-style pseudo code Examples like wont work without modification pthread_create(id, attr, start_func, arg) start_func is a function pointer arg is a void *, can pass pointer to anything (ah, C ) void* my_start(void *ptr) { printf( hello world\n ); return NULL; // terminates thread } int main() { pthread_t id; int error = pthread_create(&id, NULL, my_start, NULL); if (error) { } pthread_join(id, NULL); return 0; } 5

Thread Creation Code Example II void* my_start(void *ptr) { int* tid_ptr = (int *) ptr; // cast from void* printf( hello world: %d\n, *tid_ptr); return NULL; // terminates thread } void spawn(int num_threads) { pthread_t* id = new pthread_t[num_threads]; int* tid = new int[num_threads]; for (int i=1; i<num_threads; i++) { tid[i] = i; void *ptr = (void *) &i; int error = pthread_create(&id[i], NULL, my_start, &tid[i]); if (error) { } } tid[0] = 0; my_start(&tid[i]); // start thread zero for (int i=1; i<num_threads; i++) { pthread_join(id[i], NULL); } } 6

Compare and Swap (CAS) Atomic Compare and Swap (CAS) Basic and universal atomic operations Can be used to create all the other variants of atomic operations Supported as instruction on both x86 and SPARC Compare_and_swap(address, test_value, new_value): Load [Address] -> old_value if (old_value == test_value): Store [Address] <- new_value Return old_value Can be included in C/C++ code using inline assembly Becomes a utility function 7

Inline Assembly for Atomic Operations x86 inline assembly for CAS From Intel s TBB source code machine/linux_intel64.h static inline int64 compare_and_swap(volatile void *ptr,! int64 test_value,! int64 new_value)! {! int64 result;! asm! volatile ("lock\ncmpxchgq %2,%1"! : "=a"(result), "=m"(*(int64 *)ptr)! : "q"(new_value), "0"(test_value), "m"(*(int64 *)ptr)! : "memory");! return result;! }! Black magic Use of volatile keyword disable compiler memory optimizations 8

Fetch-and-Add (Atomic Add) Another useful atomic operation atomic_add(address, value) Load [address] -> temp temp2 = temp + value store [address] <- temp2 Some ISAs support this as a primitive operation (x86) Or, can be synthesized out of CAS: int atomic_add(int* ptr, int value) { while(true) { int original_value = *ptr; int new_value = original_value + value; int old = compare_and_swap(ptr, original_value, new_value); } } if (old == original_value) { return old; // swap succeeded } 9

Thread Local Storage (TLS) Sometimes having a non-shared global variable is useful A per-thread copy of a variable Manual approach: Definition: int accumulator[num_threads];! Use: accumulator[thread_id]++;! Limited to NUM_THREADS, need to pass thread_id around false sharing! Compiler supported: Definition: thread int accumulator = 0;! Use: accumulator++;! Implemented as a per-thread globals space Accessed efficiently via %gs segment register on x86-64 More info: http://people.redhat.com/drepper/tls.pdf 10

Simple Parallel Work Decomposition CS758 Multicore Programming (Wood) Performance Tuning 11

Static Work Distribution Sequential code for (int i=0; i<size; i++):! calculate(i,,, )! Parallel code, for each thread: void each_thread(int thread_id):! segment_size = SIZE / number_of_threads! assert(size % number_of_threads == 0)! my_start = thread_id * segment_size! my_end = my_start + segment_size! for (int i=my_start; i<my_end; i++)! calculate(i,,, ) Hey, its a parallel program! 12

Dynamic Work Distribution Sequential code for (int i=0; i<size; i++):! calculate(i,,, ) Parallel code, for each thread: int counter = 0 // global variable! void each_thread(int thread_id):! while (true)! int i = atomic_add(&counter, 1)! if (i >= SIZE)! return! calculate(i,,, ) Dynamic load balancing, but high overhead 13

Coarse-Grain Dynamic Work Distribution Parallel code, for each thread: int num_segments = SIZE / GRAIN_SIZE! int counter = 0 // global variable! void each_thread(int thread_id):! while (true)! int i = atomic_add(&counter, 1)! if (i >= num_segments)! return! my_start = i * GRAIN_SIZE! my_end = my_start + GRAIN_SIZE! for (int j=my_start; j<my_end; j++)! calculate(j,,, ) Dynamic load balancing with lower (adjustable) overhead 14

Barriers CS758 Multicore Programming (Wood) Performance Tuning 15

Common Parallel Idiom: Barriers Physics simulation computation Divide up each timestep computation into N independent pieces Each timestep: compute independently, synchronize Example: each thread executes: segment_size = total_particles / number_of_threads! my_start_particle = thread_id * segment_size! my_end_particle = my_start_particle + segment_size - 1! for (timestep = 0; timestep += delta; timestep < stop_time):! calculate_forces(t, my_start_particle, my_end_particle)! barrier()! update_locations(t, my_start_particle, my_end_particle)! barrier()! Barrier? All threads wait until all threads have reached it 16

Example: Barrier-Based Merge Sort Step 1 t0 t1 t2 t3 Algorithmic Load Imbalance Barrier Step 2 Barrier Step 3 17

Global Synchronization Barrier At a barrier All threads wait until all other threads have reached it Strawman implementation (wrong!) t0 t1 t2 t3 global (shared) count : integer := P procedure central_barrier if fetch_and_decrement(&count) == 1 count := P else repeat until count == P Barrier What is wrong with the above code? 18

Sense-Reversing Barrier Correct barrier implementation: global (shared) count : integer := P global (shared) sense : Boolean := true local (private) local_sense : Boolean := true procedure central_barrier // each processor toggles its own sense local_sense :=!local_sense if fetch_and_decrement(&count) == 1 count := P // last processor toggles global sense sense := local_sense else repeat until sense == local_sense Single counter makes this a centralized barrier 19

Other Barrier Implementations Problem with centralized barrier All processors must increment each counter Each read/modify/write is a serialized coherence action Each one is a cache miss O(n) if threads arrive simultaneously, slow for lots of processors Combining Tree Barrier Build a logk(n) height tree of counters (one per cache block) Each thread coordinates with k other threads (by thread id) Last of the k processors, coordinates with next higher node in tree As many coordination address are used, misses are not serialized O(log n) in best case Static and more dynamic variants Tree-based arrival, tree-based or centralized release Also, hardware support possible (e.g., Cray T3E) 20

Barrier Performance (from 1991) CS758 Multicore Programming (Wood) From Synchronization Mellor-Crummey & Scott, ACM TOCS, 1991 21

Locks CS758 Multicore Programming (Wood) Performance Tuning 22

Common Parallel Idiom: Locking Protecting a shared data structure Example: parallel tree walk, apply f() to each node Global set object, initialized with pointer to root of tree Each thread, while (true):! node* next = set.remove()! if next == NULL: return // terminate thread! func(code->value) // computationally intense function! if (next->right!= NULL):! set.insert(next->right)! if (next->left!= NULL):! set.insert(next->left)! How do we protect the set data structure? Also, to perform well, what element should it remove each step? 23

Common Parallel Idiom: Locking Parallel tree walk, apply f() to each node Global set object, initialized with pointer to root of tree Each thread, while (true): acquire(set.lock_ptr)! node* next = set.pop()! release(set.lock_ptr)! if next == NULL:! return func(node->value) acquire(set.lock_ptr)! // terminate thread! if (next->right!= NULL)! set.insert(next->right)! if (next->left!= NULL)! set.insert(next->left)! release(set.lock_ptr)! Put lock/unlock into pop() method? // computationally intense! Put lock/unlock into insert() method? 24

Lock-Based Mutual Exclusion t0 t1 t2 t3 L O C K W A I T L O C K W A I T L O W A I T C K L O 25 C K

Simple Boolean Spin Locks Simplest lock: Single variable, two states: locked, unlocked When unlocked: atomically transition from unlocked to locked When locked: keep checking (spin) until the lock is unlocked Busy waiting versus blocking In a multicore, busy wait for other thread to release lock Likely to happen soon, assuming critical sections are small Likely nothing better for the processor to do anyway In a single processor, if trying to acquire a held lock, block The only sensible option is to tell the O.S. to context switch O.S. knows not to reschedule thread until lock is released Blocking has high overhead (O.S. call) IMHO, rarely makes sense in multicore (parallel) programs 26

Test-and-Set Spin Lock (T&S) Lock is acquire, Unlock is release acquire(lock_ptr): While (true):! // Perform test-and-set! old = compare_and_swap(lock_ptr, UNLOCKED, LOCKED)! if (old == UNLOCKED):! break // lock acquired!! // keep spinning, back to top of while loop release(lock_ptr): Store[lock_ptr] <- UNLOCKED! Performance problem CAS is both a read and write, spinning causes lots of invalidations 27

Test-and-Test-and-Set Spin Lock (TTS) acquire(lock_ptr): While (true):! // Perform test! Load [lock_ptr] -> original_value! if (original_value == UNLOCKED):! // Perform test-and-set! old = compare_and_swap(lock_ptr, UNLOCKED, LOCKED)! if (old == UNLOCKED):! break // lock acquired!! // keep spinning, back to top of while loop! release(lock_ptr): Store[lock_ptr] <- UNLOCKED Now spinning is read-only, on local cached copy 28

TTS Lock Performance Issues Performance issues remain Every time the lock is released All the processors load it, and likely try to CAS the block Causes a storm of coherence traffic, clogs things up badly One solution: backoff Instead of spinning constantly, check less frequently Exponential backoff works well in practice Another problem with spinning Processors can spin really fast, starve threads on the same core! Solution: x86 adds a PAUSE instruction Tells processor to suspend the thread for a short time (Un)fairness 29

Ticket Locks To ensure fairness and reduce coherence storms Locks have two counters: next_ticket, now_serving Deli counter acquire(lock_ptr): my_ticket = fetch_and_increment(lock_ptr->next_ticket)! while(lock_ptr->now_serving!= my_ticket); // spin release(lock_ptr): lock_ptr->now_serving = lock_ptr->now_serving + 1! (Just a normal store, not an atomic operation, why?) Summary of operation To get in line to acquire the lock, CAS on next_ticket Spin on now_serving 30

Ticket Locks Properties Less of a thundering herd coherence storm problem To acquire, only need to read new value of now_serving No CAS on critical path of lock handoff Just a non-atomic store FIFO order (fair) Padding Good, but only if the O.S. hasn t swapped out any threads! Allocate now_serving and next_ticket on different cache blocks struct { int now_serving; char pad[60]; int next_ticket; } Two locations reduces interference Proportional backoff Estimate of wait time: (my_ticket - now_serving) * average hold time 31

Array-Based Queue Locks Why not give each waiter its own location to spin on? Avoid coherence storms altogether! Idea: slot array of size N: go ahead or must wait Initialize first slot to go ahead, all others to must wait Padded one slot per cache block, Keep a next slot counter (similar to next_ticket counter) Acquire: get in line my_slot = (atomic increment of next slot counter) mod N Spin while slots[my_slot] contains must_wait Reset slots[my_slot] to must wait Release: unblock next in line Set slots[my_slot+1 mod N] to go ahead 32

Array-Based Queue Locks Variants: Anderson 1990, Graunke and Thakkar 1990 Desirable properties Threads spin on dedicated location Just two coherence misses per handoff Traffic independent of number of waiters FIFO & fair (same as ticket lock) Undesirable properties Higher uncontended overhead than a TTS lock Storage O(N) for each lock 128 threads at 64B padding: 8KBs per lock! What if N isn t known at start? List-based locks address the O(N) storage problem Several variants of list-based locks: MCS 1991, CLH 1993/1994 33

List-Based Queue Locks (CLH lock) Link list node: Previous node pointer bool must_wait A lock is a pointer to a link list node Each thread has a local pointer to a node I Acquire(L): I->must_wait = true I->prev = fetch_and_store(l, I) pred = I->prev while (prev->must_wait) // spin Release(L): pred = I->prev I->must_wait = false I = pred // take pred s node // wakeup next waiter, if any 34

Queue Locks Example: Time 0 Lock I1 I2 I3 False False False False Acquire(L): I->must_wait = true I->prev = fetch_and_store(l, I) pred = I->prev while (pred->must_wait) // spin Release (L): pred = I->prev 35

Queue Locks Example: Time 1 t1: Acquire(L) Lock I1 I2 I3 False True False False t1 has lock Acquire(L): I->must_wait = true I->prev = fetch_and_store(l, I) pred = I->prev while (pred->must_wait) // spin Release (L): pred = I->prev 36

Queue Locks Example: Time 2 t1: Acquire(L) t2: Acquire(L) Lock I1 I2 I3 False True True False t1 has lock t2 spin Acquire(L): I->must_wait = true I->prev = fetch_and_store(l, I) pred = I->prev while (pred->must_wait) // spin Release (L): pred = I->prev 37

Queue Locks Example: Time 3 t1: Acquire(L) t2: Acquire(L) t3: Acquire(L) Lock I1 I2 I3 False True True True t1 has lock t2 spin t3 spin Acquire(L): I->must_wait = true I->prev = fetch_and_store(l, I) pred = I->prev while (pred->must_wait) // spin Release (L): pred = I->prev 38

Queue Locks Example: Time 4 t1: Acquire(L) t2: Acquire(L) t3: Acquire(L) t1: Release(L) Lock I1 I2 I3 False False True True t2 has lock t3 spin Acquire(L): I->must_wait = true I->prev = fetch_and_store(l, I) pred = I->prev while (pred->must_wait) // spin Release (L): pred = I->prev 39

Queue Locks Example: Time 5 t1: Acquire(L) t2: Acquire(L) t3: Acquire(L) t1: Release(L) t2: Release(L) Lock I1 I2 I3 False False False True t3 has lock Acquire(L): I->must_wait = true I->prev = fetch_and_store(l, I) pred = I->prev while (pred->must_wait) // spin Release (L): pred = I->prev 40

Queue Locks Example: Time 6 t1: Acquire(L) t2: Acquire(L) t3: Acquire(L) t1: Release(L) t2: Release(L) t3: Release(L) Lock I1 I2 I3 False False False False Acquire(L): I->must_wait = true I->prev = fetch_and_store(l, I) pred = I->prev while (pred->must_wait) // spin Release (L): pred = I->prev 41

Lock Review & Performance Test-and-set Test-and-test-and-set With or without exponential backoff Ticket lock Array-based queue lock Anderson List-based queue lock MCS 42

Lock Performance CS758 Multicore Programming (Wood) From Synchronization Mellor-Crummey & Scott, ACM TOCS, 1991 43

Other Lock Concerns Supporting bool trylock(timeout) Attempt to get the lock, give up after time Easy for simple TTS locks; harder for ticket and queue-based locks Reader/Writer locks lock_for_read(lock_ptr) lock_for_write(lock_ptr) Many readers can all hold lock at the same time Only one writer (with no readers) Reader/Writer locks sound great! Two problems: Acquiring a read lock requires read-modify-write of shared data Acquiring a read/write lock is slower than a simple TTS lock Other options: topology/hierarchy aware locks, adaptive hybrid TTS/queue locks, biased locks, etc. 44