Concurrent programming: From theory to practice. Concurrent Algorithms 2016 Tudor David

Size: px

Start display at page:

Download "Concurrent programming: From theory to practice. Concurrent Algorithms 2016 Tudor David"

Sophie Mason
5 years ago
Views:

1 oncurrent programming: From theory to practice oncurrent Agorithms 2016 Tudor David

2 From theory to practice Theoretica (design) Practica (design) Practica (impementation) 2

3 From theory to practice Theoretica (design) Practica (design) Practica (impementation) Impossibiities Upper/Lower bounds Techniques System modes orrectness proofs orrectness Design (pseudo-code) 3

4 From theory to practice Theoretica (design) Practica (design) Practica (impementation) Impossibiities Upper/Lower bounds Techniques System modes orrectness proofs orrectness System modes shared memory message passing Finite memory Practicaity issues re-usabe objects Performance Design (pseudo-code) Design (pseudo-code, prototype) 4

5 From theory to practice Theoretica (design) Practica (design) Practica (impementation) Impossibiities Upper/Lower bounds Techniques System modes orrectness proofs orrectness System modes shared memory message passing Finite memory Practicaity issues re-usabe objects Performance Hardware Which atomic ops Memory consistency ache coherence Locaity Performance Scaabiity Design (pseudo-code) Design (pseudo-code, prototype) Impementation (code) 5

6 Exampe: inked ist impementations Throughput (Mop/s) # ores pessimistic "bad" inked ist "good" inked ist 6

7 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 7

8 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 8

9 Why do we use caching? ore ore freq: 2GHz = 0.5 ns / instr ore Disk = ~ms Disk 9

10 Why do we use caching? ore ore freq: 2GHz = 0.5 ns / instr ore Disk = ~ms ore Memory = ~100ns Memory Disk 10

11 Why do we use caching? ore ache Memory ore freq: 2GHz = 0.5 ns / instr ore Disk = ~ms ore Memory = ~100ns ache - Large = sow - Medium = medium - Sma = fast Disk 11

12 Why do we use caching? ore L3 Memory ore freq: 2GHz = 0.5 ns / instr ore Disk = ~ms ore Memory = ~100ns ache - ore L3 = ~20ns - ore = ~7ns - ore = ~1ns Disk 12

4GHz - : 32KB - : 64KB - : 256KB - : 512KB -

13 Typica server configurations Inte Xeon AMD Opteron GHz GHz - : 32KB - : 64KB - : 256KB - : 512KB - L3: 24MB - L3: 12MB - Memory: 256GB - Memory: 256GB 13

14 Experiment Throughput of accessing some memory, depending on the memory size 14

15 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 15

16 Unti ~2004: Singe-cores ore Memory ore freq: 3+GHz ore Disk ore Memory ache - ore L3 - ore - ore Disk 16

17 After ~2004: Muti-cores ore 0 ore 1 ore freq: ~2GHz ore Disk ore Memory ache L3 - ore shared L3 - ore Memory - ore Disk 17

18 Muti-cores with private caches ore 0 ore 1 Private = mutipe copies L3 Memory Disk 18

19 ache coherence for consistency ore 0 X ore 1 ore 0 has X and ore 1 - wants to write on X - wants to read X - did ore 0 write or read X? L3 Memory Disk 19

20 ache-coherence principes ore 0 ore 1 To perform a write X - invaidate a readers, or - previous writer To perform a read - find the atest copy L3 Memory Disk 20

21 ache coherence with MESI A state diagram State (per cache ine) - Modified: the ony dirty copy - Excusive: the ony cean copy - Shared: a cean copy - Invaid: useess data 21

22 The utimate goa for scaabiity Possibe states - Modified: the ony dirty copy - Excusive: the ony cean copy - Shared: a cean copy - Invaid: useess data Which state is our favorite? 22

23 The utimate goa for scaabiity Possibe states - Modified: the ony dirty copy - Excusive: the ony cean copy -Shared: a cean copy - Invaid: useess data = threads can keep the data cose ( cache) = faster 23

24 Experiment The effects of fase sharing 24

25 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 25

26 Uniformity vs. non-uniformity Typica desktop machine Memory aches = Uniform Typica server machine Memory aches aches Memory = non-uniform (NUMA) 26

27 Latency (ns) to access data Memory Memory L3 L3 27

28 Latency (ns) to access data Memory 1 Memory L3 L3 28

29 Latency (ns) to access data Memory 1 Memory 7 L3 L3 29

30 Latency (ns) to access data Memory 1 Memory 7 20 L3 L3 30

31 Latency (ns) to access data Memory 1 40 Memory 7 20 L3 L3 31

32 Latency (ns) to access data Memory Memory 7 20 L3 L3 32

33 Latency (ns) to access data Memory Memory 7 20 L3 L3 33

34 Latency (ns) to access data Memory Memory 20 L3 L3 34

35 Latency (ns) to access data Memory Memory 20 L3 oncusion: we need to take care of ocaity L3 35

36 Experiment The effects of ocaity 36

37 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 37

38 The Programmer s Toobox: Hardware synchronization instructions Depends on the processor AS generay provided J TAS and atomic increment not aways provided x86 processors (Inte, AMD): Atomic exchange, increment, decrement provided Memory barrier aso avaiabe Inte as of 2014 provides transactiona memory 38

39 Exampe: Atomic ops in G type sync_fetch_and_op(type *ptr, type vaue); type sync_op_and_fetch(type *ptr, type vaue); // OP in {add,sub,or,and,xor,nand} type sync_va_compare_and_swap(type *ptr, type odva, type newva); boo sync_boo_compare_and_swap(type *ptr, type odva, type newva); sync_synchronize(); // memory barrier 39

40 Inte s transactiona synchronization extensions (TSX) 1. Hardware ock eision (HLE) Instruction prefixes: XAQUIRE XRELEASE Exampe (G): he_{acquire,reease}_compare_exchange_n{1,2,4,8} Try to execute critica sections without acquiring/reeasing the ock If confict detected, abort and acquire the ock before re-doing the work 40

41 Inte s transactiona synchronization extensions (TSX) 2. Restricted Transactiona Memory (RTM) _xbegin(); _xabort(); _xtest(); _xend(); Limitations: Not starvation free Transactions can be aborted various reasons Shoud have a non-transactiona back-up Limited transaction size 41

42 Inte s transactiona synchronization extensions (TSX) 2. Restricted Transactiona Memory (RTM) Exampe: if (_xbegin() == _XBEGIN_STARTED){ counter = counter + 1; _xend(); } ese { sync_fetch_and_add(&counter,1); } 42

43 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 43

44 oncurrent agorithm correctness Designing correct concurrent agorithms: 1. Theoretica part 2. Practica part à invoves impementation The processor and the compier optimize assuming no concurrency! L 44

45 The memory consistency mode //A, B shared variabes, initiay 0; //r1, r2 oca variabes; P1 P2 A = 1; B = 1; r1 = B; r2 = A; What vaues can r1 and r2 take? (assume x86 processor) Answer: (0,1), (1,0), (1,1) and (0,0) 45

46 The memory consistency mode à The order in which memory instructions appear to execute What woud the programmer ike to see? Sequentia consistency A operations executed in some sequentia order; Memory operations of each thread in program order; Intuitive, but imits performance; 46

47 The memory consistency mode How can the processor reorder instructions to different memory addresses? x86 (Inte, AMD): TSO variant Reads not reordered w.r.t. reads Writes not reordered w.r.t writes Writes not reordered w.r.t. reads Reads may be reordered w.r.t. writes to different memory addresses //A,B, //gobas int x,y,z; x = A; y = B; B = 3; A = 2; y = A; = 4; z = B; 47

48 The memory consistency mode Singe thread reorderings transparent; Avoid reorderings: memory barriers x86 impicit in atomic ops; voatie in Java; Expensive - use ony when reay necessary; Different processors different memory modes e.g., ARM reaxed memory mode (anything goes!); VMs (e.g. JVM, LR) have their own memory modes; 48

49 Beware of the compier void ock(int * some_ock) { whie (AS(some_ock,0,1)!= 0) {} asm voatie( ::: memory ); //compier barrier } void unock(int * some_ock) { asm voatie( ::: memory ); //compier barrier *some_ock = 0; } voatie int the_ock=0; ock(&the_ock); unock(&the_ock); voatie!= Java voatie The compier can: reorder instructions remove instructions not write vaues to memory 49

50 Outine PU caches ache coherence Pacement of data Hardware synchronization instructions orrectness: Memory mode & compier Performance: Programming techniques 50

51 oncurrent Programming Techniques What techniques can we use to speed up our concurrent appication? Main idea: Minimize contention on cache ines Use case: Locks acquire() = ock() reease() = unock() 51

52 TAS The simpest ock Test-and-Set Lock typedef voatie uint ock_t; void acquire(ock_t * some_ock) { whie (TAS(some_ock)!= 0) {} asm voatie( ::: memory ); } void reease(ock_t * some_ock) { asm voatie( ::: memory ); *some_ock = 0; } 52

53 How good is this ock? A simpe benchmark Have 48 threads continuousy acquire a ock, update some shared data, and unock Measure how many operations we can do in a second Test-and-Set ock: 190K operations/second 53

54 How can we improve things? Avoid cache-ine ping-pong: Test-and-Test-and-Set Lock void acquire(ock_t * some_ock) { whie(1) { whie (*some_ock!= 0) {} if (TAS(some_ock) == 0) { return; } } asm voatie( ::: memory ); } void reease(ock_t * some_ock) { asm voatie( ::: memory ); *some_ock = 0; } 54

55 Performance comparison Ops/second (thousands) Test-and-Set Test-and-Test-and-Set 55

56 But we can do even better Avoid thundering herd: Test-and-Test-and-Set with Back-off void acquire(ock_t * some_ock) { uint backoff = INITIAL_BAKOFF; whie(1) { whie (*some_ock!= 0) {} if (TAS(some_ock) == 0) { return; } ese { ock_seep(backoff); backoff=min(backoff*2,maximum_bakoff); } } asm voatie( ::: memory ); } void reease(ock_t * some_ock) { asm voatie( ::: memory ); *some_ock = 0; } 56

57 Performance comparison Ops/second (thousands) Test-and-Set Test-and-Test-and-Set Test-and-Test-and-Set w. backoff 57

58 Are these ocks fair? Processed requests per thread, Test-and-Set ock Number of processed requests Thread number 58

59 What if we want fairness? Use a FIFO mechanism: Ticket Locks typedef ticket_ock_t { voatie uint head; voatie uint tai; } ticket_ock_t; void acquire(ticket_ock_t * a_ock) { uint my_ticket = fetch_and_inc(&(a_ock->tai)); whie (a_ock->head!= my_ticket) {} asm voatie( ::: memory ); } void reease(ticket_ock_t * a_ock) { asm voatie( ::: memory ); a_ock->head++; } 59

60 What if we want fairness? Processed requests per thread, Ticket Locks Number of processed requests Thread number 60

61 Performance comparison Ops/second (thousands)

62 an we back-off here as we? Yes, we can: Proportiona back-off void acquire(ticket_ock_t * a_ock) { uint my_ticket = fetch_and_inc(&(a_ock->tai)); uint distance, current_ticket; whie (1) { current_ticket = a_ock->head; if (current_ticket == my_ticket) break; distance = my_ticket current_ticket; if (distance > 1) ock_seep(distance * BASE_SLEEP); } asm voatie( ::: memory ); } void reease(ticket_ock_t * a_ock) { asm voatie( ::: memory ); a_ock->head++; } 62

63 Performance comparison Ops/second (thousands)

64 Sti, everyone is spinning on the same variabe. Use a different address for each thread: Queue Locks 1 run eaving 2 run spin 3 spin 4 arriving spin Use with care: 1. storage overheads 2. compexity 64

65 Performance comparison Ops/second (thousands)

66 To summarize on ocks 1. Reading before trying to write 2. Pausing when it s not our turn 3. Ensuring fairness (does not aways bring ++) 4. Accessing disjoint addresses (cache ines) More than 10x performance gain! 66

67 oncusion oncurrent agorithm design Theoretica design Practica design (may be just as important) Impementation You need to know your hardware For correctness For performance 67

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis

Concurrent programming: From theory to practice. Concurrent Algorithms 2015 Vasileios Trigonakis oncurrent programming: From theory to practice oncurrent Algorithms 2015 Vasileios Trigonakis From theory to practice Theoretical (design) Practical (design) Practical (implementation) 2 From theory to