Parallel Programming Patterns

Size: px

Start display at page:

Download "Parallel Programming Patterns"

Marshall Johns
5 years ago
Views:

1 Parallel Programming Patterns Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna

Copyright 2013, 2017, 2018 Moreno Marzolla, Università di Bologna, Italy (http://www.moreno.marzolla.name/teaching/hpc/) This work is licensed under the Creative Commons Attribution-ShareAlike 4.

2 Copyright 2013, 2017, 2018 Moreno Marzolla, Università di Bologna, Italy ( This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. 2

3 What is a pattern? A design pattern is a general solution to a recurring engineering problem A design pattern is not a ready-made solution to a given problem......rather, it is a description of how a certain kind of problem can be solved 3

Architectural patterns The term architectural pattern was first used by architect Christopher Alexander to denote common design decision that have been used by

4 Architectural patterns The term architectural pattern was first used by architect Christopher Alexander to denote common design decision that have been used by architects and engineers to realize buildings and constructions in general Christopher Alexander, (1936--), A Pattern Language: Towns, Buildings, Construction 4

5 Example Building a bridge across a river You do not invent a brand new type of bridge each time Instead, you adapt an already existing type of bridge 5

6 Example 6

7 Example 7

8 Example 8

9 Embarrassingly Parallel Partition Master-Worker Stencil Reduce Scan 9

10 Parallel programming patterns: Embarrassingly parallel 10

11 Embarrassingly Parallel Applies when the computation can be decomposed in independent tasks that require little or no communication Examples: Vector sum Mandelbrot set 3D rendering Brute force password cracking... Processor 0 Processor 1 Processor = = = a[] b[] c[] 11

12 Parallel programming patterns: Partition 12

13 Partition The input data space (in short, domain) is split in disjoint regions called partitions Each processor operates on one partition This pattern is particularly useful when the application exhibits locality of reference i.e., when processors can refer to their own partition only and need little or no communication with other processors 13

14 Example Matrix-vector product Ax = b Matrix A[][] is partitioned into P horizontal blocks Each processor operates on one block of A[][] and on a full copy of x[] computes a portion of the result b[] Core 0 Core 1 x = Core 2 Core 3 A[][] x[] b[] 14

15 Partition Types of partition Regular: the domain is split into partitions of roughly the same size and shape. E.g., matrix-vector product Irregular: partitions do not necessarily have the same size or shape. E.g., heath transfer on irregular solids Size of partitions (granularity) Fine-Grained: a large number of small partitions Coarse-Grained: a few large partitions 15

16 1-D Partitioning Block Core 0 Core 1 Core 2 Core 3 Cyclic 16

17 2-D Block Partitioning Block, * *, Block Block, Block Core 0 Core 1 Core 2 Core 3 17

18 2-D Cyclic Partitioning Cyclic, * *, Cyclic 18

19 2-D Cyclic Partitioning Cyclic-cyclic 19

20 Irregular partitioning example A lake surface is approximated with a triangular mesh Colors indicate the mapping of mesh elements to processors Source: 20

21 Computation Fine grained vs Coarse grained partitioning Coarse-grained Partitioning Better load balancing, especially if combined with the master-worker pattern (see later) If granularity is too fine, the computation / communication ratio might become too low (communication dominates on computation) Time Fine-grained Partitioning In general improves the computation / communication ratio However, it might cause load imbalancing The "optimal" granularity is sometimes problem-dependent; in other cases the user must choose which granularity to use Time Communication 21

22 Example: Mandelbrot set The Mandelbrot set is the set of points c on the complex plane s.t. the sequence zn(c) defined as { 0 if n=0 z n (c)= 2 z n 1 (c) + c otherwise does not diverge when n + 22

23 Mandelbrot set in color If the modulus of zn(c) does not exceed 2 after nmax iterations, the pixel is black (the point is assumed to be part of the Mandelbrot set) Otherwise, the color depends on the number of iterations required for the modulus of zn(c) to become >2 23

24 Pseudocode Embarassingly parallel structure: the color of each pixel can be computed independently from other pixels maxit = 1000 for each pixel (x0, y0) { x = 0; y = 0; it = 0; while ( it < maxit AND x*x + y*y 2*2 ) { xnew = x*x - y*y + cx; ynew = 2*x*y + cy; x = xnew; y = ynew; } plot(x0, y0, it); } Source: 24

25 Mandelbrot set A regular partitioning can result in uneven load distribution Black pixels require maxit iterations Other pixels require fewer iterations 25

26 Load balancing Ideally, each processor should perform the same amount of work If the tasks synchronize at the end of the computation, the execution time will be that of the slower task Task 0 Task 1 busy Task 2 idle Task 3 barrier synchronization 26

27 Load balancing howto The workload is balanced if each processor performs more or less the same amount of work Ways to achieve load balancing: Use fine-grained partitioning...but beware of the possible communication overhead if the tasks need to communicate Use dynamic task allocation (master-worker paradigm)...but beware that dynamic task allocation might incur in higher overhead with respect to static task allocation 27

28 Master-worker paradigm (process farm, work pool) Apply a fine-grained partitioning number of task >> number of cores The master assigns a task to the first available worker Worker 0 Master Bag of tasks of possibly different duration Worker 1 Worker P-1 28

29 Choosing the partition size Wall-clock time The optimal partition size is in general system- and applicationdependent; it might be estimated by measurement Optimal partition size Partition size Too small = higher scheduling overhead Too large = unbalanced workload 29

30 coarse-grained decomposition static task assignment block size = 64 static task assignment P0 P0 P1 P2 P3 P1 P0 P1 P2 P2 P3 P0 P3 P1 block size = 64 dynamic (master-worker) task assignment P0 (example) P2 P1 P3 P0 P2 P0 P3 P2 P0 30

31 Example omp-mandelbrot.c Coarse-grained partitioning Cyclic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE="static,64"./omp-mandelbrot Dynamic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE="static"./omp-mandelbrot OMP_SCHEDULE="dynamic,64"./omp-mandelbrot Dynamic, fine-grained partitioning (1 row per block) OMP_SCHEDULE="dynamic"./omp-mandelbrot 31

32 Parallel programming patterns: Stencil 32

33 Stencils Stencil computations involve a grid whose values are updated according to a fixed pattern called stencil Example: the Gaussian smoothing of an image updates the color of each pixel with the weighted average of the previous colors of the 5 5 neighborhood

34 2D Stencils 5-point 2-axis 2D stencil 9-point 2-axis 2D stencil 9-point 1-plane 2D stencil 34

35 3D Stencils 13-point 3-axis 3D stencil 7-point 3-axis 3D stencil 35

36 3D Stencils 72-point 3-plane 3D stencil 36

37 2D Stencils 2D stencil computations usually employ two grids to keep the current and next values Values are read from the current grid New values are written to the next grid current and next grid are exchanged at the end of each phase 37

38 Ghost Cells How do we handle cells on the border of the domain? We might assume that cells outside the border have some fixed, application-dependent value, or We may assume periodic boundary conditions, where sides are glued together to form a torus Ghost cells Domain In either case, we extend the domain with ghost cells, so that cells on the border do not require any special treatment Parallel Programming Patterns i-animate-a-plane-into-a-pipe-and-then-a-pipe-into-a-torus

39 Periodic boundary conditions: How to fill ghost cells 39

40 2D Stencil Example: Game of Life 2D cyclic domain, each cell has two possible states The state of a cell at time t + 1 depends on 0 = dead 1 = alive the state of that cell at time t the number of alive cells at time t among the 8 neighbors Rules: Alive cell with less than 2 alive neighbors dies Alive cell with two or three alive neighbors lives Alive cell with more than three alive neighbors dies Dead cell with three alive neighbors lives 40

41 Example: Game of Life See game-of-life.c 41

42 Periodic boundary conditions: Another way to fill ghost cells 42

43 Periodic boundary conditions: Another way to fill ghost cells 43

44 Periodic boundary conditions: Another way to fill ghost cells 44

45 Periodic boundary conditions: Another way to fill ghost cells 45

46 Periodic boundary conditions: Another way to fill ghost cells 46

47 Periodic boundary conditions: Another way to fill ghost cells 47

48 Parallelizing stencil computations Computing the next grid from the current one has embarassingly parallel structure Initialize current grid while (!terminated) { Fill ghost cells Compute next grid Exchange current and next grids } Embarassingly Parallel However, domain partitioning on distributed-memory architectures requires special care 48

49 Ghost cells Partitions are again augmented with ghost cells (halo) They contain a copy of logically adjacent cells The width of the halo depends on the shape of the stencil halo Partition 1 Partition 2 49

50 Example: 2D partitioning with 5P stencil Periodic boundary P0 P1 P2 P3 P4 P5 P6 P7 P8 50

51 Example: 2D partitioning with 5P stencil Periodic boundary 51

52 Example: 2D partitioning with 5P stencil Periodic boundary 52

53 Example: 2D partitioning with 5P stencil Periodic boundary 53

54 Example: 2D partitioning with 5P stencil Periodic boundary 54

55 Example: 2D partitioning with 9P stencil 55

56 Example: 2D partitioning with 9P stencil 56

57 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary P0 P1 P2 57

58 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary 58

59 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary 59

60 Example: 2D (Block, *) partitioning with 5P stencil Periodic boundary 60

61 Parallelizing 2D stencil computations on distributed-memory architectures Let us consider a 2D domain of size N N subject to a 5P-2D stencil We have a distributed-memory machine with P = 4 processors Compare the following types of decomposition......assuming the following boundary conditions: (Block, *) : the first N/P rows are assigned to the first processor, the next N/P are assigned to the second processor, and so on (Block, Block) : the domain is decomposed in four square subdomains Periodic Non-periodic Goal: minimize the number of ghost cells that must be exchanged among processors 61

62 Choosing a decomposition (Block, *) (Block, Block) P0 P0 P1 P2 P3 P1 P2 P3 62

63 Choosing a decomposition (Block, *), periodic boundary conditions N P0 P1 The ghost cells at the sides are not exchanged across processors, so they do not contribute to the total messages size 8 N ghost cells P2 P3 63

64 Choosing a decomposition (Block, *), non-periodic boundary conditions N P0 P1 6 N ghost cells P2 P3 64

65 Choosing a decomposition (Block, Block), periodic boundary conditions N/2 N/2 P0 P1 8 N ghost cells P2 P3 65

66 Choosing a decomposition (Block, Block), non-periodic boundary conditions N/2 N/2 P0 P1 4 N ghost cells P2 P3 66

67 Recap (Block, *) (Block, Block) P0 P0 P1 P2 P3 P1 P2 P3 (Block, *) (Block, Block) 8N 8N Non periodic 6 N 4N Periodic 67

68 1D Stencil Example: Rule 30 Cellular Automaton The state at time t + 1 depends on the state of the red cells at time t Time t t+1 t+2 Rule 30 cellular automaton 68

69 Example Rule 30 cellular automaton Initial configuration Configuration at time 1 Configuration at time 2 69

70 Rule 30 cellular automaton Conus textile shell Rule 30 CA 70

71 1D Cellular Automata On distributed-memory architectures, care must be taken to properly handle cells on the border Again, we use ghost cells to augment each subdomain P0 P1 P2 Cur Next 71

72 Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 Communication Compute next step Communication Compute next step Communication 72

73 Note In the Rule 30 example, using one ghost cell per side it is possible to compute one step of the CA After that, it is necessary to fill the ghost cells with the new values from the neighbors If we use two ghost cells per side we can compute two steps of the CA 73

74 Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 74

75 Why? Using more ghost cells fewer communication operations, but each communication involves more data overall, the number of bytes exchanged remains more or less the same Data transfers of large blocks are usually handled more efficiently than small blocks 75

76 Parallel programming patterns: Reduce 76

77 Reduce A reduction is the application of an associative binary operator (e.g., sum, product, min, max...) to the elements of an array [x0, x1, xn-1] sum-reduce( [x0, x1, xn-1] ) = x0+ x1+ + xn-1 min-reduce( [x0, x1, xn-1] ) = min { x0, x1, xn-1} A reduction can be realized in O(log2 n) parallel steps 77

78 Example: sum

79 Example: sum

80 Example: sum

81 Example: sum

82 Example: sum

83 Example: sum d int d, i; /* compute largest power of two < n */ for (d=1; 2*d < n; d *= 2) ; /* do reduction */ for ( ; d>0; d /= 2 ) { for (i=0; i<d; i++) { if (i+d<n) x[i] += x[i+d]; } } return x[0]; 83 See reduction.c

84 Work efficiency How many sums are computed by the parallel reduction algorithm? n / 2 sums at the first level n / 4 sums at the second level n / 2j sums at the j-th level 1 sum at the (log2 n)-th level n n/2 n/4 n/8... Total: O(n) sums The tree-structured reduction algorithm is work-efficient, which means that it performs the same amount of work of the optimal serial algorithm 84

85 Parallel programming patterns: Scan 85

86 Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, xn-1] using a given associative binary operator op (e.g., sum, product, min, max... ) [y0, y1, yn - 1] = inclusive-scan( op, [x0, x1, xn - 1] ) where y0 = y1 = y2 = yn - 1= x0 x0 op x1 x0 op x1 op x2 x0 op x1 op op xn

87 Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, xn-1] using a given associative binary operator op (e.g., sum, product, min, max... ) [y0, y1, yn - 1] = exclusive-scan( op, [x0, x1, xn - 1] ) where y0 = y1 = y2 = yn - 1= 0 x0 x0 op x1 this is the neutral element of the binary operator (zero for sum, 1 for product,...) x0 op x1 op op xn

88 Blelloch Scan 88

89 Exclusive scan: Up-sweep x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] x[0..7] x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] x[4..7] x[0] x[0..1] x[2] x[2..3] x[4] x[4..5] x[6] x[6..7] x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] for ( d=1; d<n/2; d *= 2 ) { for ( k=0; k<n; k+=2*d ) { x[k+2*d-1] = x[k+d-1] + x[k+2*d-1]; } } O(n) additions 89

90 Exclusive scan: Down-sweep x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] x[0..7] zero x[0] x[0..1] x[2] x[0..3] x[4] x[4..5] x[6] 0 x[0] x[0..1] x[2] 0 x[4] x[4..5] x[6] x[0..3] x[0] 0 x[2] x[0..1] x[4] x[0..3] x[6] x[0..5] 0 x[0] x[0..1] x[0..2] x[0..3] x[0..4] x[0..5] x[0..6] O(n) x[n-1] = 0; for ( ; d > 0; d >>= 1 ) { for (k=0; k<n; k += 2*d ) { float t = x[k+d-1]; x[k+d-1] = x[k+2*d-1]; x[k+2*d-1] = t + x[k+2*d-1]; } } additions See prefix-sum.c 90

91 Example: Line of Sight n peaks of heights h[0], h[n - 1]; the distance between consecutive peaks is one Which peaks are visible from peak 0? h[0] h[1] h[2] h[3] visible not visible h[4] h[5] h[6] h[7] 91

92 Line of sight Source: Guy E. Blelloch, Prefix Sums and Their Applications 92

93 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 93

94 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 94

95 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 95

96 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 96

97 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 97

98 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 98

99 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 99

100 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 100

101 Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] 101

102 Serial algorithm For each i = 0, n 1 For each i = 0, n 1 Let a[i] be the slope of the line connecting the peak 0 to the peak i a[0] - a[i] arctan( ( h[i] h[0] ) / i ), se i > 0 amax[0] - amax[i] max {a[0], a[1], a[i 1]}, se i > 0 For each i = 0, n 1 If a[i] amax[i] then the peak i is visible otherwise the peak i is not visible 102

103 Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] - for i 1 to n-1 do a[i] arctan( ( h[i] h[0] ) / i ) endfor amax[0] - for i 1 to n-1 do amax[i] max{ a[i-1], amax[i-1] } endfor for i 0 to n-1 do v[i] ( a[i] amax[i] ) endfor return v 103

104 Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] - for i 1 to n-1 do a[i] arctan( ( h[i] h[0] ) / i ) endfor amax[0] - for i 1 to n-1 do amax[i] max{ a[i-1], amax[i-1] } endfor for i 0 to n-1 do v[i] ( a[i] amax[i] ) endfor return v ) Embarassingly parallel Embarassingly parallel 104

105 Parallel algorithm bool[0..n-1] Parallel-line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] - for i 1 to n-1 do in parallel a[i] arctan( ( h[i] h[0] ) / i ) endfor amax exclusive-scan( max, a ) for i 0 to n-1 do in parallel v[i] ( a[i] amax[i] ) endfor return v 105

106 Conclusions A parallel programming patterns defines: a partitioning of the input data a communication structure among parallel tasks Parallel programming patterns can help to define efficient algorithms Many problems can be solved using one or more known patterns 106

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2017, 2018 Moreno Marzolla, Università di Bologna, Italy (http://www.moreno.marzolla.name/teaching/hpc/)