Other consistency models

Similar documents
19: Multiprocessing and Multicore. Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

CS5460: Operating Systems

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Concurrent Preliminaries

Multiprocessors and Locking

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Computer Architecture

G52CON: Concepts of Concurrency

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

What is uop Cracking?

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence and Atomic Operations in Hardware

CSE502: Computer Architecture CSE 502: Computer Architecture

Multiprocessor Systems. Chapter 8, 8.1

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CSE 120 Principles of Operating Systems

Martin Kruliš, v

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Order Is A Lie. Are you sure you know how your code runs?

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

Modern Processor Architectures. L25: Modern Compiler Design

Symmetric Multiprocessors: Synchronization and Sequential Consistency

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

Using Weakly Ordered C++ Atomics Correctly. Hans-J. Boehm

Mutex Implementation

Kernel Synchronization I. Changwoo Min

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

Chapter 5. Multiprocessors and Thread-Level Parallelism

Memory Consistency Models

Computer Architecture

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lock-Free, Wait-Free and Multi-core Programming. Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap

EE382 Processor Design. Processor Issues for MP

Processor Architecture

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Handout 3 Multiprocessor and thread level parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Multiprocessor Synchronization

Multiprocessors 1. Outline

Multiprocessor Systems. COMP s1

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Multiprocessors & Thread Level Parallelism

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Distributed Operating Systems

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CSC 252: Computer Organization Spring 2018: Lecture 26

Sequential Consistency & TSO. Subtitle

Computer Science 146. Computer Architecture

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

Foundations of the C++ Concurrency Memory Model

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Lecture 9: Multiprocessor OSs & Synchronization. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Lecture 20: Multi-Cache Designs. Spring 2018 Jason Tang

CPSC/ECE 3220 Fall 2017 Exam Give the definition (note: not the roles) for an operating system as stated in the textbook. (2 pts.

Spin Locks and Contention. Companion slides for Chapter 7 The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Reminder from last time

Kaisen Lin and Michael Conley

EECS 570 Final Exam - SOLUTIONS Winter 2015

Lecture 10: Avoiding Locks

Advanced OpenMP. Lecture 3: Cache Coherency

Unit 12: Memory Consistency Models. Includes slides originally developed by Prof. Amir Roth

Introducing Multi-core Computing / Hyperthreading

Dealing with Issues for Interprocess Communication

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

[537] Locks. Tyler Harter

Consistency & Coherence. 4/14/2016 Sec5on 12 Colin Schmidt

RCU in the Linux Kernel: One Decade Later

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Switch Gear to Memory Consistency

M4 Parallelism. Implementation of Locks Cache Coherence

Parallel Architecture. Hwansoo Han

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Distributed Shared Memory and Memory Consistency Models

Implementing Locks. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

Recap: Thread. What is it? What does it need (thread private)? What for? How to implement? Independent flow of control. Stack

Scalable Locking. Adam Belay

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

Chapter 2. OS Overview

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

Transcription:

Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012 RAM Systems Group Department of Computer Science ETH Zürich SMP only works because of caches! Shared memory rapidly becomes bottleneck Last time: Coherency and Consistency 1. Coherency: Values in caches all match each other Processors all see a coherent view of memory 2. Consistency: The order in which changes to memory are seen by different processors Last time: Sequential consistency 1. Operations from a processor appear (to all others) in program order 2. Every processor s visibility order is the same interleaving of all the program orders. Requirements: Each processor issues memory ops in program order RAM totally orders all operations Memory operations are atomic Last time: cache coherence protocols Terminology: PrRd: processor read PrWr: processor write BusRd: bus read BusRdX: bus read excl BusWr: bus write Processor-initiated Snoop-initiated BusRdX discard PrRd Issue BusRd, if shared Invalid PrRd If line not shared PrWr issue BusRdX BusRdX discard PrWr issue BusRdX Shared BusRdX Write back PrRd No transaction BusRd Signal HIT BusRd Write back Other consistency models Alpha PA-RISC Power X86_32 X86_64 ia64 zseries Reads after reads Reads after writes Writes after reads Writes after writes Dependent reads Ifetch after write MESI Protocol PrRd No transaction Exclusive BusRd Signal HIT PrWr No transaction Modified PrRd, PrWr No transaction Icache is incoherent: requires explicit Icache flushes for self-modifying code Read of value can be seen before read of address of value! Not shown: SPARC, which supports 3 different memory models Portable languages like Java must define their own memory model, and enforce it! 1

Program order 12/17/2012 Barriers and Fences General rule: the weaker the consistency model is, the faster/cheaper it goes in hardware We ve seen that visibility order is: Essential for correct functioning of some algorithms Difficult to guarantee with many compilers and memory models Solution is to use barriers (also called fences) Compiler barriers: prevent compiler reordering statements Memory barriers: prevent CPU reordering instructions Compiler barriers Prevent compiler from reordering visible loads and stores May still reorder register access (private) Typically part of compiler intrinsics GCC: asm volatile ("" ::: "memory"); Intel ECC: memory_barrier() Microsoft Visual C & C++: ReadWriteBarrier() Memory barriers on x86 MFENCE instruction Prevents the CPU reordering any loads or stores past it Allowed instruction reorderings Load MFENCE Load Load NOT allowed (crosses fence) Also: LFENCE: loads SFENCE: stores Synchronization Two ways to synchronize: 1. Atomic operations on shared memory e.g.: test-and-set, compare-and-swap Also have ordering constraints specified in the memory model 2. Interprocessor interrupts (IPIs) Invoke interrupt handler on remote CPU Very slow (500+ cycles on Intel), often avoided except in OS Used for different purposes (e.g. locks, vs. asynchronous notification) 2

Test-And-Set (TAS) One of the simplest non-trivial atomic operations 1. Read a memory location s value into a register 2. 1 into the location Read-modify-write cycle required Memory bus must be locked during instruction Can also appear as a register Reading returns value; sets to 1 Writes of zero reset register Not to be confused with TAZ: Using Test-And-Set Acquire a mutex with TAS: void acquire( int *lock) { while ( TAS(lock) == 1) ; This is a spinlock: keep trying in a tight loop Often fastest if lock is not held for long Release is simple: void release( int *lock) { *lock = 0; Performance of TAS-based spinlock How bad can this be? It turns out that TAS can be expensive Memory must be locked while a long operation occurs Must do a read, followed by write, while no-one else can access memory. If spinning: slows things down Can we make it faster? Test And Test-And-Set Replace most of RMW cycles with simple reads: void acquire( int *lock) { do { while (*lock == 1); while ( TAS(lock) == 1); Think about cache traffic: Reads hit in the spinner s cache Write due to release invalidates cache line load from main memory, returns 0 triggers further RMW cycle from spinner Highly likely to succeed (unless contention) Beware of Test And Test-And-Set! You may be tempted to use this pattern in your regular code Try whether a value has changed outside a lock If so, then acquire the lock and check again Don t. It doesn t work. You need to understand: Whether the compiler will reorder reads and writes Whether the processor will reorder reads and writes Whether the memory consistency model will change your code s semantics. In Java, this is almost bound to happen. Test And Test-And-Set in systems code Why does it work in systems code? TAS is a hardware instruction TAS is serializing: Processor won t reorder instructions past a TAS Compiler won t (we hope) Or, at least, the compiler and processor are told not to Memory barriers or fences Use of volatile keyword Even in an OS code depends on the processor architecture ( memory consistency model) 3

Other primitives: Fetch And Add Atomically { Add to a memory location Read the previous value Allows sequencing of operations Event counts and sequencers model Contrast with mutexes and conditions Mutual exclusion with Fetch And Add Atomically { Add to a memory location Read the previous value struct seq_t { int sequencer; // Initially 0 int event_count // Initially 0 ; void acquire( seq_t* s ) { int ticket = FAA( s->sequencer, 1 ); while ( s->event_count!= ticket ) ; procedure UnLock( seq_t* s ) { FAA( s->event_count, 1 ) Compare and Swap CAS( location, old, new) atomically { 1. Load location into value 2. If value == old then store new to location 3. Return value Interesting features: Theoretically more powerful than TAS, FAA, etc. Can implement almost all wait-free data stuctures Requires bus locking, or similar, in the memory system Use of CAS for mutual exclusion CAS for lock-free update CAS is generally used in lock free data structures Essentially, concurrent updates do not require locks Such data structures >> 1 word of memory General pattern: read-copy-update Readers all read the same datastructure Writers take a copy, modify it, then write back the copy Old version is deleted when all the readers are finished CAS used to change pointer to new version As long as nothing else has changed it! Global pointer Data structure Ref = 3 CPU 0 (reader) CPU 1 (writer) 1. Reader follows global pointer, increments ref count. 2. Writer also reads data structure, increments ref count 4

CAS for lock-free update CAS for lock-free update Global pointer Data structure Ref = 1 CPU 0 (reader) 1. Reader follows global pointer, increments ref count. Global pointer 5. Reader finishes; decrements ref count; now 0 CPU 0 frees old data (reader) 4. Writer uses CAS to swap global pointer to update copy Copy of data structure. Ref = 2 CPU 1 (writer) 3. Writer copies data; decrements ref count; updates data 2. Writer also reads data structure, increments ref count Copy of data structure. Ref = 1 CPU 1 (writer) 6. Writer finishes; decrements ref count; now 1. The ABA problem CAS has a problem: Reports when a single location is different Does not report when it is written (with the same value) Leads to the ABA problem: 1. CPU A reads value as x 2. CPU B writes y to value 3. CPU B writes x to value 4. CPU A reads value as x concludes nothing has changed Many problems: E.g., what if the value is a software stack pointer? Solving the ABA problem Basic problem: Value used for CAS comparison has not changed But the data has CAS doesn t say whether a write has occurred, only if a value has changed. Solution: Ensure the value always changes! Split value into: Original value Monotonically increasing counter CAS both halves in a single instruction Double Compare-And-Swap CAS2 or DCAS CAS2( loc1, old1, new1, loc2, old2, new2) atomically: 1. Compares two memory locations with different values 2. If both match, each is updated with a different new value 3. If not, existing values in the locations are loaded Rarely implemented: MC680x0 Was claimed to be more useful than purely CAS Shown recently to be not so useful! Everything can be done fast with CAS, if you re slightly clever What about x86? Bewildering array of options! XCHG: atomic exchange of register and memory location equivalent to Test-And-Set LOCK prefix (e.g. LOCK ADD, LOCK BTS, ) Executes instruction with bus locked LOCK XADD: Atomic Fetch-and-add CMPXCHG: 32-bit compare and swap CMPXCHG8B: 64-bit compare and swap Useful for solving ABA problem Not the same as CAS2! CMPXCHG16B: 128-bit compare and swap (x86_64 only!) 5

Load Locked / Conditional CAS requires bus locking for read-modify-write Complicates memory bus logic May be a memory bottleneck Very CISC-y not a good match for load-store RISC Alternative: Load-locked / store conditional Also called load-linked, or load and reserve Implemented in MIPS, PowerPC, ARM, Alpha, Theoretically as powerful as CAS Provides lock-free atomic read-modify-write Factored into two instructions LL/SC usage Implementation of LL/SC Atomic read-modify-write with multiple instructions: 1. Load-locked <location> register 2. Modify value in <register> 3. -conditional <register> location 4. Test for SC success; retry if not Will probably succeed if no updates to location occur in the meantime. Note: does not suffer from the ABA problem! Remember one address, re-use cache coherence: Processor remembers physical address of locked load Snoops on cache invalidates for that cache line fails if an invalidate has been seen Limitations: -line granularity: danger of false sharing Context switches, interrupts, etc. produce false positives Referred to by theorists as weak LL/SC Further reading Definitive work on shared-memory synchronization and lock-free or wait-free algorithms 6

Performance limits of SMP Performance limits of SMP -coherent SMP still has memory as a bottleneck All accesses to main memory stall the processor Limit scalability: remember Amdahl s law? MOESI allows reads to be serviced from another cache But the cache itself can also be slow -coherent SMP still has memory as a bottleneck All accesses to main memory stall the processor Limit scalability: remember Amdahl s law? MOESI allows reads to be serviced from another cache But the cache itself can also be slow Memory stalls halt the processor And other processors accessing memory And the more processors, the more often this will happen Simultaneous Multithreading Can we do anything useful when the processor is waiting for memory (or another cache)? Most functional units (ALU, etc.) are idle Many instructions don t require the memory unit But ILP is limited: can t execute many other instructions in the thread because of data dependencies. So what about instructions in other threads? Multiple fetch/decode units, registers Reuse superscalar functional units Register Updates Conventional Superscalar Instruction Control Execution Integer/ Branch Retirement Unit Register File Prediction OK? General Integer FP Add Operation Results Fetch Control Instruction Decode FP Mult/Div Operations Addr. Address Instructions Load Addr. Data Data Instruction Data Functional Units SMT (or Hyperthreading) Label instructions in hardware with thread id Logical extension of superscalar techniques Now have multiple independent instruction streams Fine-grained multithreading: Select from threads on a per-instruction basis Coarse-grained multithreading: Switch between threads on a memory stall Multithreaded CPU appears to OS as multiple CPUs! 7

Is it worth it? Well Can be slower than a single thread! Might get no advantage from memory accesses Multiple threads compete for cache Advantage is limited (10-20% is typical) BUT: it s cheap (in transistors) Depends on workloads Not necessarily good for scientific computing Good for lots of memory-intensive requests, e.g. web servers! Target application domain for Sun Niagara, Rock, etc. Tomorrow Non-Uniform Memory Access Multicore processors Performance implications 8