Lock-Free, Wait-Free and Multi-core Programming. Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap

Size: px

Start display at page:

Download "Lock-Free, Wait-Free and Multi-core Programming. Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap"

Jeremy Holland
5 years ago
Views:

1 Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient Concurrent Maps AirMap

2 Lock-Free and Wait-Free Data Structures Overview The Java Maps that use Lock-Free techniques Graphical performance of Map data structures Consensus number concept The Ubiquitous CAS primitive Implementing AtomicInteger using CAS Implementing Java ConcurrentSkipListMap Volatile variables vital but confusing Memory Barriers so esoteric, we buy mutually free beer

3 Lock-Free and Wait-Free Data Structures For multiple threads sharing data Fast Extreme concurrency with many cores active Extreme performance no expensive wait queues Extremely low latency (wait-free) Constructed from very powerful, simple primitives Algorithms difficult, so usually use canned ones Active research on these precious techniques

4 Lock-Free and Wait-Free Data Structures Can implement fast locks with wait queues Mutexes, RW locks, Semaphores, Condition Variables Can implement fast Atomics Integers, Longs, Booleans, References Can implement multi-core data structures HashMaps or Sets, Tree Maps or Sets, Queues, Lists, Stacks

5 Lock-Free and Wait-Free Data Structures Lock-Free Not fair between threads Always has a retry loop Guarantees progress of some thread but not which one Not a spin lock! Spins can almost stall the whole system Wait-Free beats Lock Free Fair between threads Every thread is guaranteed to make progress in finite time Rely on GC for unique ids, can generate much garbage More difficult in C, C++, boost::lockfree (the ABA problem)

6 The standard Java Map Classes The Concurrent* are Lock-Free and AirMap is Mostly Lock-Free

7 Map Feature Comparison HashMap TreeMap put/get/remove ordered access thread safe most memory efficient fastest multicore access ConcurrentHashMap ConcurrentSkipListMap AirMap

8 Lock-Free Map Random Cumulative Put Decreasing exponential speed with Map size

9 Lock-Free Map Concurrent Random Put

10 Map Concurrent Random Access Mixed 4 thread put 4 thread get

11 Lock-Free 8-Thread Remove Speed JVM size versus time shows GC efficiency

12 Lock-Free One-Thread Iterator Speed

13 Lock-Free One-Thread Iterator Speed Log scale shows the entire spectrum

14 Map Entry size vs Map size Size of basic Key/Value entry in bytes given log Map size

15 Consensus Number Any given concurrency primitive has one How many Threads can be synchronized? Consensus 1: Surprisingly, memory is weak Atomic read or write to memory. Dekker s Algorithm Consensus 2: Another surprise many are weak Queues, test-and-set, swap, getandadd, stacks Consensus infinity: A few vital powerful primitives Augmented queue like socket poll Compare And Set CAS type instruction Load-Link and Store-Conditional instruction pair

16 The Ubiquitous CAS Compare and Set Atomic, Infinite consensus number Pseudo-code, normally one instruction: boolean compareandset( ValueType *p, ValueType expectedvalue, ValueType newvalue) { } Java implementation invokes secret native code: class AtomicInteger { public final boolean compareandset( int expect, int update) { return unsafe.compareandswapint(this, valueoffset, expect, update); } }

17 The Ubiquitous CAS Compare and Set Definition: Atomically change a given memory location to a given new value if it has a given expected value, and return true iff the change took place. Consensus infinity is expensive. Memory bus is locked for all cores: slow x86, x64 instruction (with lock prefix byte for SMP): LOCK; CMPXCHG ptr, expected, new Can implement primitives with lower consensus numbers like AtomicInteger.getAndIncrement()

18 AtomicInteger from Java library source code. lock-free (has retry loop) /** * Atomically increments by one the current value. * the previous value */ public final int getandincrement() { for (;;) { int current = get(); int next = current + 1; if (compareandset(current, next)) return current; } }

19 ConcurrentSkipListMap Leaf node structure from Java source code static final class Node<K,V> { final K key; volatile Object value; volatile Node<K,V> next; }

20 ConcurrentSkipListMap from Java source code comments * Here's the sequence of events for a deletion of node n with * predecessor b and successor f, initially: * * *... b > n -----> f... * * * 1. CAS n's value field from non-null to null. * From this point on, no public operations encountering * the node consider this mapping to exist. However, other * ongoing insertions and deletions might still modify * n's next pointer.

21 ConcurrentSkipListMap from source code comments * 2. CAS n's next pointer to point to a new marker node. * From this point on, no other nodes can be appended to n. * which avoids deletion errors in CAS-based linked lists. * * *... b > n -----> marker > f... * *

22 ConcurrentSkipListMap from Java source code comments * 3. CAS b's next pointer over both n and its marker. * From this point on, no new traversals will encounter n, * and it can eventually be GCed. * *... b > f... * * * A failure at step 1 leads to simple retry due to a lost race * with another operation. Steps 2-3 can fail because some other * thread noticed during a traversal a node with null value and * helped out by marking and/or unlinking. This helping-out * ensures that no thread can become stuck waiting for progress of * the deleting thread. The use of marker nodes slightly * complicates helping-out code because traversals must track * consistent reads of up to four nodes (b, n, marker, f), not * just (b, n, f), although the next field of a marker is * immutable, and once a next field is CAS'ed to point to a * marker, it never again changes, so this requires less care.

23 Volatile Variables Vital, little understood. We consider Java volatile here Necessary for inter-thread visibility (also in C#) class MyClass { // only one thread necessarily sees this int i; // vi can be seen by any thread volatile int vi; // Java array elements are not volatile! volatile int[] va = new int[size]; // only the reference is volatile volatile ArrayList val = new ArrayList(); // synchronized loads, stores all variables public synchronized void set(int newi) { i = newi; } }

24 Volatile Variables Vital, little understood. Some architectures re-order loads/stores to memory! As if no change to the code but slower. Ensure loads and stores reach memory for interthread visibility (except for C,C++ it s only for I/O) Locks and synchronized blocks do too, but they are slower and not lock-free. Not Atomic! myvolatile++ by two threads may lose a count. Use AtomicInteger instead. Generally much faster than CAS, atomics, locks. Very fast, or free. (on x86, load is free on hardware) Consensus number 1

25 Volatile Reordering Some architectures re-order loads/stores to memory! No reordering of volatile loads/stores to memory : program order is followed By ahead-of-time compiler (javac) By just-in-time compiler (the JVM) By core (all necessary implied or explicit memory barriers ) Mixed with Non-volatiles: Non-volatile loads and stores can mix together in any way. Non-volatile ops can float below a volatile load ( acquire ) Non-volatile ops can float above a volatile store ( release ) (Doug Lea - Java)

26 Java Volatile Variables Vital, little understood. Broken before Java 1.5 fixed the memory model Array elements are always non-volatile Primitives or references can be volatile Defined by happens-before binary relation. Seems almost nobody understands it this way.

27 C# Volatile Variables Vital, little understood. volatile variables likejava System.Threading.Volatile.Read(var) System.Threading.Volatile.Write(var) Volatile variable load/store just implies volatile read/write as above Can be applied to array elements, unlike Java

28 C, C++ Volatile Variables Vital, little understood. C, C++ already has a volatile keyword like const (dangerous) gcc: c++11: Forces access to occur, in program order Intended only for memory mapped I/O! Not for threading but may work, e.g. in MicroSoft, probably gcc No re-ordering control of non-volatiles at all! Illegal in Linux kernel! Use kernel native barriers and locks. Full sw-only barrier: asm volatile( ::: memory ); Full sw/hw barrier: gcc and above: sync_synchronize() Volatile load/store: asm volatile(hw_specific_instruction) Nice sw-only barrier: atomic_signal_fence(std:memory_order) Nice hw/sw barrier: atomic_thread_fence(std:memory_order) Atomic variables: std:atomic<..>

29 Hardware Memory Barriers prevent core from swapping its loads/stores to memory Four conceptual primitive kinds of barriers, can be combined: load-store: default in Sparc TSO, X86, ARM, POWER load-load: default in Sparc TSO, X86, ARM, POWER store-store: default in Sparc TSO, x86 store-load: default in none. Slow memory barriers affect all other loads or stores by the generating core, and not just a memory location of the load or store in the instruction. E.g. a load-store barrier prevents any earlier load from being swapped with any later store by that core. A volatile load causes a succeeding load-store and load-load, i.e. acquire barrier. Other stores can float down. A volatile store causes a preceding load-store and store-store, i.e. release barrier. (plus a store-load in x86 OpenJDK).

30 Reordering Comparison Increasing Processor Relaxed Memory Ordering Levels Total Store reordering. Can switch Store-Load Sparc: TSO total store ordering mode AMD: x86, x64 instruction set architecture Intel X86, x64: Partial ordering. Can switch Store-Store and Store-Load Sparc PSO (obsolete) Full reordering: Can switch everything including atomic load/store Sparc: RMO (obsolete) ARM v7 or later: depends on implementation architecture? POWER IA-64 (Intel Itanium) MIPS: hw implementation environment dependent Full reordering plus dependent loads reordered DEC Alpha:

31 Memory Barrier Instructions from OpenJDK orderaccess.hpp // sparc RMO ia64 x86 // // fence membar #LoadStore mf lock addl 0,(sp) // #StoreStore // #LoadLoad // #StoreLoad // // release membar #LoadStore st.rel [sp]=r0 movl $0,<dummy> // #StoreStore // st %g0,[] // // acquire ld [%sp],%g0 ld.acq <r>=[sp] movl (sp),<r> // membar #LoadLoad // #LoadStore // // release_store membar #LoadStore st.rel <store> // #StoreStore // st // // store_fence st st lock xchg // fence mf // // load_acquire ld ld.acq <load> // membar #LoadLoad // #LoadStore

Other consistency models

Other consistency models Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012