Concurrent Preliminaries

Similar documents
Multiprocessors and Locking

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Portland State University ECE 588/688. Transactional Memory

Other consistency models

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

A Basic Snooping-Based Multi-Processor Implementation

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Transactional Memory: Architectural Support for Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss ISCA 93

A Basic Snooping-Based Multi-processor

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

A Basic Snooping-Based Multi-Processor Implementation

Multiprocessor Systems. COMP s1

Handout 3 Multiprocessor and thread level parallelism

Shared Memory Multiprocessors

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Part 1: Concepts and Hardware- Based Approaches

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

Multi-core Architecture and Programming

CS377P Programming for Performance Multicore Performance Synchronization

CSE 5306 Distributed Systems

Multiprocessor Support

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiprocessor Systems. Chapter 8, 8.1

CSE502: Computer Architecture CSE 502: Computer Architecture

EE382 Processor Design. Processor Issues for MP

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Cache Coherence and Atomic Operations in Hardware

Order Is A Lie. Are you sure you know how your code runs?

Lecture: Consistency Models, TM

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Processor Architecture

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

6 Transactional Memory. Robert Mullins

Advanced Operating Systems (CS 202) Memory Consistency, Cache Coherence and Synchronization

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

CS420: Operating Systems. Process Synchronization

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming

CS5460: Operating Systems

Computer Organization

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

The complete license text can be found at

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Lecture 6: Lazy Transactional Memory. Topics: TM semantics and implementation details of lazy TM

Multiprocessor Synchronization

Process Synchronization. Mehdi Kargahi School of ECE University of Tehran Spring 2008

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Advanced Topic: Efficient Synchronization

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Multi-Core Computing with Transactional Memory. Johannes Schneider and Prof. Roger Wattenhofer

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Lecture 25: Multiprocessors

Advance Operating Systems (CS202) Locks Discussion

Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech

Computer Architecture

Techno India Batanagar Department of Computer Science & Engineering. Model Questions. Multiple Choice Questions:

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization for Concurrent Tasks

M4 Parallelism. Implementation of Locks Cache Coherence

CS370 Operating Systems

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

Chapter 2 Processes and Threads

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Chapter 5 Asynchronous Concurrent Execution

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Locks. Dongkun Shin, SKKU

SMP/SMT. Daniel Potts. University of New South Wales, Sydney, Australia and National ICT Australia. Home Index p.

Main Points of the Computer Organization and System Software Module

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

Spin Locks and Contention. Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Non-blocking Array-based Algorithms for Stacks and Queues. Niloufar Shafiei

Remaining Contemplation Questions

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

Synchronization for Concurrent Tasks

Module 6: Process Synchronization

Chapter 5: Process Synchronization. Operating System Concepts 9 th Edition

Lecture: Transactional Memory. Topics: TM implementations

Advanced OpenMP. Lecture 3: Cache Coherency

Spin Locks and Contention

Chapter 6: Process Synchronization. Operating System Concepts 8 th Edition,

Lecture 4: Directory Protocols and TM. Topics: corner cases in directory protocols, lazy TM

Locks Chapter 4. Overview. Introduction Spin Locks. Queue locks. Roger Wattenhofer. Test-and-Set & Test-and-Test-and-Set Backoff lock

Resource management. Real-Time Systems. Resource management. Resource management

1) If a location is initialized to 0, what will the first invocation of TestAndSet on that location return?

Dept. of CSE, York Univ. 1

Transcription:

Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1

Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures Transactional memory 2

Basic concepts Processor - a unit of hardware that can execute instructions. Multiprocessor- a computer that provides more than one processor Multicore(CMP) - a multiprocessor on a single integrated circuit chip Hyper threading (SMT) - processor that supports more than one logical processor using a single execution pipeline 3

Threads A thread is a sequential program - an execution of a piece of software. A thread can be running (also called scheduled), ready to run, or blocked awaiting some condition. Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources 4

Scheduler An operating system component, chooses which threads to schedule onto which processors. The scheduler is concerned mainly with: Throughput Latency Fairness (according to each process' priority). Processor affinity (in multiprocessor system) 5

Interconnect What distinguishes a multiprocessor from the general case of cluster, cloud or distributed computing? 6

Interconnect What distinguishes a multiprocessor from the general case of cluster, cloud or distributed computing? shared memory accessible to each of the processors. This access is mediated by an interconnection network of some kind. The simplest interconnect is a single shared bus, through which all messages pass between processors and memory. 7

Cache A cache is a faster memory which stores copies of the data from frequently used main memory locations. Caches are critical to modern high-speed processors There are two basic cache writing policies: Write-through Write-back 8

Cache cache hit - an access where the data is found in the cache. cache miss - an access which isn t, which requires accessing the next higher level of cache. cache levels - In CMPs it is typical for some processors to share some higher levels of cache 9

The Cache Coherence Problem Assume just single level caches and main memory Processor writes to location in its cache Other caches may hold shared copies - these will be out of date Updating main memory alone is not enough (write-through) 10

The Cache Coherence Problem Multiple copies of a block can easily get inconsistent P1 P2 Cache A = 7 A = 5 Cache A = 5 Memory 11

MESI Protocol Any cache line can be in one of 4 states (2 bits): Modified - cache line has been modified, is different from main memory - is the only cached copy. Exclusive - cache line is the same as main memory and is the only cached copy Shared - Same as main memory but copies may exist in other caches. Invalid - Line data is not valid (as in simple cache) 12

MESI Protocol Each CPU (cache system) snoops for write activity concerned with data addresses which it has cached. This assumes a bus structure which is global - all communication can be seen by all. 13

P1 P2 P3 PrRd PrRd w0 E w2 E BusRd Snooper Snooper Snooper BusRd Main Memory P1 ld w0 P3 ld w2 14

P1 P2 P3 PrWr PrWr w0 E M w2 M w2 E I Snooper Snooper Snooper BusRdX Flush Main Memory P1 st w0 P2 st w2 15

P1 P2 P3 PrWr PrRd w0 M S w2 M w2 w0 I S Snooper Snooper Snooper Flush BusRd(S) Main Memory P2 st w2 P3 ld w0 16

P1 P2 P3 PrWr w0 S I w2 M w0 S M Snooper Snooper Snooper Main Memory P3 st w0 17

MESI Protocol Invalid Mem Read Read Miss(sh) Shared Read Hit Write Miss RWITM Mem Read Read Miss(ex) Invalidate Write Hit Read Hit Modified Write Hit Exclusive Read Hit Write Hit = bus transaction 18

Hardware infrastructure summary Basic concepts Interconnect Cache The Cache Coherence Problem MESI Protocol 19

Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures Transactional memory 20

Spin lock a spin lock is a lock which causes a thread trying to acquire it to simply wait in a loop ("spin") while repeatedly checking if the lock is available Implementing spin locks correctly is difficult because one must take into account the possibility of simultaneous access to the lock. 21

Implementing Spin lock Using TestAndSet primitive : lock: while (TestAndSet(lock) == 1); unlock: lock = 0; function TestAndSet(lock) { initial = lock lock = 1 return initial } 22

Implementing Spin lock Using TestAndTestAndSet primitive : lock: procedure TestAndTestAndSet(lock) { while (lock == 1 or TestAndSet(lock) == 1) Spin; } unlock: x = 0; 23

Performance 24

Why spin on read is slow Lock is released => all cached copies are invalidated reads on all processors miss in cache =>bus contention Many attempt to set it using TAS All but one attempt fails, causing the CPU to revert to reading The first read misses in the cache! By the time all this is over the lock has been freed again! 25

Exponential Backoff Giving competing threads a chance to finish. Using a random time Prevent contention Good: Easy to implement Beats TTAS lock Bad: time Must choose parameters carefully Not portable across platforms threads TTAS Lock Backoff lock 26

Deadlock Thread 1 Thread 2 lock( M1 ); lock( M2 ); lock( M2 ); lock( M1 ); Thread1 is waiting for the resource(m2) locked by Thread2 and Thread2 is waiting for the resource (M1) locked by Thread1 27

Compare And Swap CAS Compares contents of a memory location to a given value and only if the same, modifies contents of the memory location to a specified value. 28

CompareThenCompareAnd Swap CAS It is common to examine the current state and then try to advance it atomically. Atomically state transition algorithm: 29

ABA Problem ABA problem occurs during synchronization, when a location is read twice, has the same value for both reads, and "value is the same" is used to indicate "nothing has changed. 30

ABA Problem 31

ABA Solution 32

ABA Solution 33

Hardware primitives summary Spin lock Test And Set primitive Compare And Swap primitive ABA Problem LL\SC primitives 34

Progress guarantees Wait-freedom All processes make progress in a finite number of their individual steps Avoid deadlocks and starvation Strongest guarantee but difficult to provide in practice Lock-freedom At least one process makes progress in a finite number of steps Avoids deadlock but not starvation Obstruction-freedom At least one process makes progress in a finite number of its own steps in the absence of contention Exponential back-off Contention management 35

Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures Transactional memory 36

Mutual exclusion One of the most basic problems in concurrent computing desired to guarantee that at most one thread at a time can be executing certain code, called a critical section 37

Critical Section Critical Section RequestCS(int processid) Read x into a register Increment the register Write the register value back to x ReleaseCS(int processid) 38

Problem requirements Mutual exclusion: No more than one process in the critical section Progress: Some one can enter the critical section No starvation: Eventually a process can enter the critical section 39

Mutual Exclusion Problem Context: Shared memory systems Multi-processor computers Multi-threaded programs Shared variable x Initial value 0 Program x = x+1 Thread 0 Thread 1 read x into a register (value read: 0) increment the register (1) write value in register back to x (1) read x into a register (value read: 1) increment the register (2) write value in register 40 back to x (2)

Mutual Exclusion Problem Context: Shared memory systems Multi-processor computers Multi-threaded programs Shared variable x Initial value 0 Program x = x+1 Thread 0 Thread 1 read x into a register (value read: 0) increment the register (1) write value in register back to x (1) read x into a register (value read: 0) increment the register (1) write value in register back to x (1) 41

Solutions Software solutions Peterson s algorithm Bakery algorithm Hardware solutions Disabling interrupts to prevent context switch Special machine-level instructions 42

Peterson s Algorithm 43

Mutual exclusion summary The mutual Exclusion Problem Critical Section Problem requirements Peterson s Algorithm 44

Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures Transactional memory 45

Work sharing and termination detection It is common in parallel or concurrent collection algorithms to need a way to detect termination of a parallel algorithm Detecting termination without work sharing is simple There is two methods for work sharing: Ø Ø Work pushing Work pulling (stealing) 46

Work sharing and termination detection Work movement must be atomic Simple solution - single shared counter of work units updated atomically Better solution background thread whose only job is detection 47

Shared memory termination 48

Shared memory termination 49

Work stealing termination 50

Improvements It is straightforward to refine these detection algorithms so that they wait on a single variable anyidle until a scan might be useful, as shown in the next algorithm. Likewise, in the work stealing case there is a similar refinement so that workers wait on a single anylarge flag 51

Delaying scans until useful 52

Delaying idle workers 53

Symmetric termination detection The algorithms presented so far assume a separate detection thread. It is tempting to use idle threads to check termination, as shown in the next algorithm: 54

Symmetric termination detection The problem is that this algorithm does not work L. Scenario: v Thread A starts detection v Sees that thread B has extra work v A will give up on detection, and may be just about to set its busy flag. v In the meantime, B finishes all of its work, enters detection, sees that all threads are done and declares termination. A simple approach to fix this is to apply mutual exclusion to detection as shown in the next algorithm 55

Symmetric termination detection repaired 56

Rendezvous barriers Synchronization mechanism that requires to wait for all threads to reach a given point. In GC - a point of termination of a phase of collection Assume no work sharing This can use a simplified version of termination detection with a counter. Since a collector is usually invoked more than once as a program runs, these counters must be reset as the algorithm starts. 57

Rendezvous barriers 58

Work sharing and termination detection summary Shared memory termination Work stealing termination Improvements Symmetric termination detection Rendezvous barriers 59

Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures Transactional memory 60

Concurrent data structures There are four main strategies: Coarse-grained locking: One 'large lock is applied to the whole data structure. Fine-grained locking: in this approach an operation locks individual elements of a larger data structures, such as the individual nodes of a linked list or tree Lazy update: make some change that logically accomplishes the operation, but may need further steps to complete it and get the data structure into a normalized form Non-blocking: Lock free strategies that avoid locking altogether and rely on atomic update primitives to accomplish changes to the state of data structures. 61

Lock free linked-list stack 62

Fine-grained linked-list queue 63

Work stealing lock free deque Deque is a double-ended queue. The local worker thread can push and pop work items, while other threads can remove (steal) items. Pushing a value is simple and involves no synchronization. Popping checks to see if it trying to get the last value remaining. If it is, it may be in a race with a non-local remover. Both threads will try to update taii 64

Work stealing lock free deque 65

Work stealing lock free deque 66

Concurrent data structures summary Stack Ø Queue Ø Deque Ø Lock free linked-list stack Fine-grained linked-list queue Work stealing lock free 67

Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures Transactional memory 68

Transactional memory Borrowed from Databases Definition : A transaction is a finite sequence of machine instructions executed by a single process, that satisfies the following properties Atomicity: All effects (writes) of a transaction appear or none do. Isolation: No other thread can perceive an intermediate state of a transaction, only a state before or a state after the transaction. 69

Using transactions What is a transaction? A sequence of instructions that is guaranteed to execute and complete only as an atomic unit Begin Transaction Inst #1 Inst #2 Inst #3 End Transaction 70

Lock Freedom Why lock is bad? Common problems in conventional locking mechanisms in concurrent systems Priority inversion: When low-priority process is preempted while holding a lock needed by a high-priority process Convoying: When a process holding a lock is de-scheduled (e.g. page fault, no more quantum), no forward progress for other processes capable of running Deadlock : Processes attempt to lock the same set of objects in different orders (could be bugs by programmers) 71

Atomicity can be achieved either by buffering or by undoing: The buffering approach: accumulates writes in some kind of scratch memory separate from the memory locations written, and updates those memory location only if the transaction commits (can use the cache in hardware implementation ) Undoing approach: works in a converse way. Using undo logs. 72

Conflict detection Eagerly checks each new access against the currently running transactions to see if it conflicts. Lazily - does the checks when a transaction attempts to commit. 73

Interface TStart() - beginning of a transaction. TCommit() - returns a boolean TAbort() sometimes useful to request programmatically. TLoad(addr) - This adds that address to the transaction's read set TStore(addr, value) - This adds the address to the transaction's write set 74

Transactional memory version of linked-list queue 75

Transactional memory with garbage collection Interference scenario - mutator is attempting a long transaction and collector actions conflict: Ø The mutator transaction may be continually aborted by actions of the collector Ø The collector may effectively block for a long time Particularly severe in hardware implementation: Ø The above scenario can happened just because they touched the same memory word. 76

Transactional memory with garbage collection Another issue to consider: Ø Assume we use the undoing approach with an undo logs of old values to support aborting transactions Ø The log can contain only reference to an object. Ø Those referent objects must be considered still to be reachable. Ø Thus, transaction logs need to be included in the root set for collection 77

Transactional memory Properties Transactions Interface Transactional memory version of linked-list queue Transactional memory with garbage collection 78

Summary We have explored the basics of the concurrency solutions in: Hardware Infra Primitives Software Both: Mutual exclusion Data structures Transactional memory 79

Discussion ABA problem without presence of garbage collector Applications of Deque data structure 80