Concurrent Programming Introduction

Size: px

Start display at page:

Download "Concurrent Programming Introduction"

Jade Richardson
5 years ago
Views:

1 Concurrent Programming Introduction Frédéric Haziza Department of Computer Systems Uppsala University Ericsson - Fall 2007

2 Outline 1 Good to know 2 Scenario 3 Definitions 4 Hardware 5 Classical Paradigms Iterative Parallelism Recursive Parallelism Producer/Consumer Client/Server Interacting Peers

3 Literature Gregory Andrews. Foundations of Multithreaded, Parallel and Distributed Programming. Addison-Wesley, 1999 (ISBN: )

4 Schedule Date, Time, Comments Tue 6 Nov Setting the decor Tue 13 Nov Locks, Barriers + Lab Tue 27 Nov Remainder...

5 Scenario Several cars want to drive from point A to point B. They can compete for space on the same road and end up either: following each other or competing for positions (and having accidents!). Or they could drive in parallel lanes, thus arriving at about the same time without getting in each other s way. Or they could travel different routes, using separate roads.

6 Scenario Several cars want to drive from point A to point B. Sequential Programming They can compete for space on the same road and end up either: following each other or competing for positions (and having accidents!). Parallel Programming Or they could drive in parallel lanes, thus arriving at about the same time without getting in each other s way. Distributed Programming Or they could travel different routes, using separate roads.

7 Definitions Concurrent Program 2 + processes working together to perform a task. Each process is a sequential program (= sequence of statements executed one after another) Single thread of control vs multiple thread of control Communication Shared Variables Message Passing Synchronization Mutual Exclusion Condition Synchronization

8 Correctness Wanna write a concurrent program? What kinds of processes? How many processes? How should they interact? Correctness Ensure that processes interaction is properly synchronized Mutual Exclusion Ensuring the critical sections of statements do not execute at the same time Condition Synchronization Delaying a process until a given condition is true Our focus: imperative programs and asynchronous execution

9 Amdhal s law P is the fraction of a calculation that can be parallelized (1 P) is the fraction that is sequential (i.e. cannot benefit from parallelization) N processors maximum speedup = 1 (1 P)+P/N. Example If P = 90% max speedup of 10 no matter how large the value of N used (ie N )

10 Single-Processor Machine

11 Memory Hierarchy Main Memory Level 2 cache Level 1 cache CPU

12 Why do we miss in the cache? Compulsory miss Touching the data for the first time Capacity miss The cache is too small Conflict miss Non-ideal cache implementation (data hash to the same cache line) Main Memory Miss Cache Hit CPU

13 Locality Temporal locality Spatial locality Inner loop stepping through an array A, B, C, A+1, B, C, A+2, B, C, spatial temporal

14 MultiProcessor world - Taxonomy SIMD MIMD Message Passing Shared Memory Fine-grained Coarse-grained UMA NUMA COMA

15 Shared-Memory Multiprocessors Memory... Memory Interconnection network / Bus Cache CPU... Cache CPU

16 Programming Model Shared Memory $ $ $ $ $ $ $ $ $ Thread Thread Thread Thread Thread Thread Thread Thread Thread

17 Cache coherency A: B: Shared Memory $ $ $ Thread Read A Read A Read A Thread... Read A... Write A Thread Read B... Read A

18 Summing up Coherence There can be many copies of a datum, but only one value Too strong!!! There is a single global order of value changes to each datum

19 Memory Ordering The coherence defines a per-datum order of value changes. The memory model defines the order of value changes for all the data. What ordering does the memory system guarantees? Contract between the HW and the SW developers Without it, we can t say much about the result of a parallel execution

20 What order for these threads? A denotes a modified value to the datum at address A Thread 1 LD A ST B LD C ST D LD E LD A happens before ST A Thread 2 ST A LD B ST C LD D ST E......

21 Other possible orders? Thread 1 LD A ST B LD C ST D LD E Thread 2 ST A LD B ST C LD D ST E Thread 1 LD A ST B LD C ST D LD E Thread 2 ST A LD B ST C LD D ST E......

22 Memory model flavors Sequentially Consistent: Programmer s intuition Total Store Order: Almost Programmer s intuition Weak/Release Consistency: No guaranty

23 Dekker s algorithm Does the write become globally visible before the Initially A = 0,B = 0 fork read is performed? A := 1 if(b==0)print( A wins ); B := 1 if(a==0)print( B wins ); Can both A and B win? Left: The read (ie, test if B==0) can bypass the store (A := 1) Right: The read (ie, test if A==0) can bypass the store (B := 1) Both loads can be performed before any of the stores Yes, it is possible that both win!

24 Dekker s algorithm for Total Store Order Does the write become globally visible before the read is performed? Initially A=0,B=0 fork A := 1 Membar #StoreLoad; if(b==0)print( A wins ); Can both A and B win? B := 1 Membar #StoreLoad; if(a==0)print( B wins ); Membar: the read is started after all previous stores have been globally ordered Behaves like a sequentially consistent machine No, they won t both win. Good job Mister Programmer!

25 Dekker s algorithm, in general Initially A = 0,B = 0 fork A := 1 if(b==0)print( A wins ); B := 1 if(a==0)print( B wins ); Can both A and B win? The answer depends on the memory model Remember?... Contract between the HW and SW developers.

26 So... Memory Model is a tricky issue

27 New issues Compulsory miss Capacity miss Conflict miss Memory... Memory Interconnection network / Bus Cache Cache... Communication miss Cache-to-cache transfer False-sharing Side-effect from large cache lines What about the compiler? Code reordering? volatile keyword in C... CPU CPU

28 Good to know Performance Use of Cache Memory hierarchy Consistency problems To get maximal performance on a given machine, the programmer has to know about the characteristics of the memory system and has to write programs to account them

29 Distributed Memory Architecture Interconnection network Memory Cache CPU... Memory Cache CPU Communication through Message Passing Own cache, but memory not shared No coherency problems

30 Classical Paradigms Data Parallel Task Parallel 5 paradigms: Iterative parallelism Recursive parallelism Producer/Consumer Client/Server Interacting peers

31 Iterative Parallelism: Matrix multiplication 1: double a[n,n], b[n,n], c[n,n]; 2: for [i=0 to n-1] { iterating trough the rows 3: for [j=0 to n-1] { iterating trough the columns 4: Computes inner product of a[i,*] and b[*,j] 5: c[i,j] = 0.0; 6: for [ k = 0 to n-1 ] { 7: c[i,j] = c[i,j] + a[i,k]*b[k,j]; 8: } 9: } 10: } What can we parallelize? Line 5 to 7 c[i,j] is written to, and a[i,k], b[k,j] are only read every c[i,j] computation!

32 Iterative Parallelism: Matrix multiplication Parallelizing the rows co [i=0 to n-1] { compute rows in parallel for [j=0 to n-1] { c[i,j] = 0.0; for [ k = 0 to n-1 ] { c[i,j] = c[i,j] + a[i,k]*b[k,j]; } } }

33 Iterative Parallelism: Matrix multiplication Parallelizing the columns co [j=0 to n-1] { compute columns in parallel for [i=0 to n-1] { c[i,j] = 0.0; for [ k = 0 to n-1 ] { c[i,j] = c[i,j] + a[i,k]*b[k,j]; } } }

34 Iterative Parallelism: Matrix multiplication Parallelizing all rows and columns co [i=0 to n-1, j=0 to n-1] { c[i,j] = 0.0; for [ k = 0 to n-1 ] { c[i,j] = c[i,j] + a[i,k]*b[k,j]; } }

35 Recursive Parallelism: Adaptive Quadrature y f(x) a b x b a f(x)dx

36 Recursive Parallelism: Adaptive Quadrature 1: double fleft = f(a), fright, area = 0.0; 2: double width = (b-a)/ INTERVALS; 3: for [x = (a+width) to b by width] { 4: fright = f(x); 5: Compute the small rectangle area 6: area = area + (fleft * lfright) * width / 2; 7: fleft = fright; the right-hand value becomes the new left-hand value 8: } y x f(x) x

37 Divide and Conquer y y f(x) x f(x) area new area old > EPSILON x

38 Divide and Conquer double quad(double left, right, fleft, fright, oldarea) { double mid = (left + right)/2; find the middle point double fmid = f(mid); get its value double larea = (fleft + fmid) (mid left)/2; double rarea = (fmid + fright) (right mid)/2; } if (larea + rarea) oldarea > EPSILON { Recurse to integrate both halves larea = quad(left,mid,fleft,fmid,larea); rarea = quad(mid,right,fmid,fright,rarea); } return (larea + rarea); b a f(x)dx quad(a, b, f(a), f(b),(f(a) + f(b)) (b a)/2);

39 Divide and Conquer - Parallel double quad(double left, right, fleft, fright, oldarea) { double mid = (left + right)/2; find the middle point double fmid = f(mid); get its value double larea = (fleft + fmid) (mid left)/2; double rarea = (fmid + fright) (right mid)/2; } if (larea + rarea) oldarea > EPSILON { Recurse to integrate both halves co [] { larea = quad(left,mid,fleft,fmid,larea); in parallel! rarea = quad(mid,right,fmid,fright,rarea); } Must wait for larea and rarea } return (larea + rarea);

40 Producer / Consumer Producer Consumer Shared Resource

41 Client / Server Client n Request Reply Reply Server Client 1 Request

42 Interacting Peers - Coordinator/Workers Worker 1 Results Results Worker n 1 Data Coordinator Data

43 Interacting Peers - Circular Pipeline Worker 1... Worker n 1

44 Interacting Peers Coordinator/Workers Circular pipeline Worker 1 Results Results Worker n 1 Worker 1... Worker n 1 Data Coordinator Data

Computer Architecture Crash course

Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping