Parallel Computing. Parallel Algorithm Design

Size: px

Start display at page:

Download "Parallel Computing. Parallel Algorithm Design"

Eileen Welch
6 years ago
Views:

1 Parallel Computing Parallel Algorithm Design

2 Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels 2010@FEUP Parallel Algorithm Design 2

3 Task/Channel Model Task Channel Parallel Algorithm Design 3

4 Foster s Design Methodoly 1. Partitioning 2. Communication 3. Agglomeration 4. Mapping Problem Partitioning Communication Mapping Agglomeration 2010@FEUP Parallel Algorithm Design 4

5 1. Partitioning Dividing computation and data into pieces Domain decomposition Divide data into pieces e.g., An array into sub-arrays (reduction); A loop into sub-loops (matrix multiplication), A search space into sub-spaces (chess) Functional decomposition Divide computation into pieces e.g., pipelines (floating point multiplication), workflows (pay roll processing) Determine how to associate data with computations 2010@FEUP Parallel Algorithm Design 5

6 Partitioning The individual pieces are called primitive tasks. Desirable attributes for partition Many more primitive tasks than processors on target computer. Tasks of roughly equal size (in computation and data). Number of tasks increases with problem size. Parallel Algorithm Design 6

7 Example of domain decomposition Parallel Algorithm Design 7

8 Example of Functional Decomposition Parallel Algorithm Design 8

9 2. Communication Determine values passed among tasks Local communication Task needs values from a small number of other tasks Create channels illustrating data flow Global communication Significant number of tasks contribute data to perform a computation Don t create channels for them early in design 2010@FEUP Parallel Algorithm Design 9

10 Desirable attributes for communication Balanced Communication operations balanced among tasks Small degree: Each task communicates with only small group of neighbors Concurrency Tasks can perform communications concurrently Task can perform computations concurrently Parallel Algorithm Design 10

11 3. Agglomeration Agglomeration is the process of grouping tasks into larger tasks to improve performance. Here, minimizing communication is typically a design goal. Grouping tasks that communicate with each other eliminates the need for communication, called increasing the locality Grouping tasks can also allow us to combine multiple communications into one. 2010@FEUP Parallel Algorithm Design 11

12 Desirable attributes of agglomeration Increased the locality of the parallel algorithm Agglomerated tasks have similar computational and communication costs Number of tasks increases with problem size Number of tasks is as small as possible, yet at least as great as the number of processors on target computer Parallel Algorithm Design 12

13 4. Mapping Mapping is the process of assigning agglomerated tasks to the processors Here, were thinking of a distributed memory machine If we choose the number of agglomerated tasks to equal the number of processors then the mapping is already done. Each processor gets one agglomerated task 2010@FEUP Parallel Algorithm Design 13

14 Mapping Goals Processor utilization: would like processors to have roughly equal computational and communication costs Minimize interprocessor communication This can be posed as a graph partitioning problem: Each partition should have roughly the same number of nodes The partition should cut a minimal amount of edges 2010@FEUP Parallel Algorithm Design 14

15 Partitioning a graph P0 P1 P0 P1 P0 P1 Equalizing processor utilization and minimizing interprocessor communication are often competing forces 2010@FEUP Parallel Algorithm Design 15

16 Mapping heuristics Static number of tasks Structured communication Constant computation time per task Agglomerate tasks to minimize comm Create one task per processor Variable computation time per task Cyclically map tasks to processors Unstructured communication Use a static load balancing algorithm Dynamic number of tasks Use a run-time task-scheduling algorithm e.g., a master slave strategy Use a dynamic load balancing algorithm e.g., share load among neighboring processors; remapping periodically 2010@FEUP Parallel Algorithm Design 16

17 Example 1. Boundary value problems Ice water Rod Insulation Parallel Algorithm Design 17

18 Parallel Algorithm Design 18 Boundary Value Problem x u a u a t u c k a 2 t u u t u j i j i, 1, Heat conduction physics Discretization u i,j = temperature at position i and time j 2 1,, 1, x u u u x u j i j i j i j i j i j i j i ru u r ru u, 1, 1, 1, ) 2 1 ( 2 2 ( x) t a r

19 Boundary Value Problem Partition One data item per grid point Associate one primitive task with each grid point Two-dimensional domain decomposition Communication Identify communication pattern between primitive tasks Each interior primitive task has three incoming and three outgoing channels Parallel Algorithm Design 19

20 Boundary Value Problem Agglomeration and mapping Agglomeration Parallel Algorithm Design 20

21 Model Analysis Sequential execution time to update element n number of elements m number of iterations Sequential execution time: m n Parallel execution p number of processors message time = + q/β, if q «β Parallel execution time m (n /p + 2) 2010@FEUP Parallel Algorithm Design 21

22 Example Parallel reduction Given associative operator a 0 a 1 a 2 a n-1 Examples Add Multiply And, Or Maximum, Minimum Data decomposition 1 task 1 of the values to operate (1 of the a s) 2010@FEUP Parallel Algorithm Design 22

23 Parallel reduction Further steps to reach a binomial tree 2010@FEUP Parallel Algorithm Design 23

24 Parallel reduction @FEUP Parallel Algorithm Design 24

25 Parallel reduction @FEUP Parallel Algorithm Design 25

26 Parallel reduction @FEUP Parallel Algorithm Design 26

27 Parallel reduction @FEUP Parallel Algorithm Design 27

28 Parallel reduction Binomial tree 25 Parallel Algorithm Design 28

29 Agglomeration sum sum sum sum Parallel Algorithm Design 29

30 Analysis Parallel running time time to perform the binary operation - time to communicate a value via a channel n values and p tasks Time for the tasks perform its inner calculations: (n/p - 1) Communication steps: log p After each receiving communication there is an operation Total time: (n/p - 1) + log p ( + ) 2010@FEUP Parallel Algorithm Design 30

31 Example: the N-body problem m (x,y) f1 B1 v f2 B2 B3 2010@FEUP Parallel Algorithm Design 31

32 The N-body problem Parallel Algorithm Design 32

33 The N-body problem partitioning Domain partitioning Assume one task per particle Task has particle s position, velocity vector and mass Iteration Get positions and mass of all other particles Compute new position and velocity 2010@FEUP Parallel Algorithm Design 33

34 Gather and All-Gather operations Gather operation (sequential) (p-1) All-Gather operation Parallel Algorithm Design 34

35 All-Gather To avoid conflicts all-gather is performed in log p steps, doubling the data in each step Communication (n items) = + (n / ) With p tasks there are log p iterations The number of items doubles at each iteration log p i1 ( i1 2 n ) p log p n( p 1) p 2010@FEUP Parallel Algorithm Design 35

36 Analysis N-body problem parallel version n bodies and p tasks m iterations over time Total time excluding I/O m log p n( p 1) p n p 2010@FEUP Parallel Algorithm Design 36

37 Considering I/O Reading or writing n items of data through an I/O channel io + n/ io In N-body problem the initial values must be transmitted to the other tasks 2010@FEUP Parallel Algorithm Design 37

38 Scatter operation Improving 1. First task transmits n/2 items to another task 2. The 2 tasks transmits n/4 items to 2 other tasks 3. The 4 tasks transmits n/8 items to 8 other tasks 4. And so on log p i1 ( i1 2 n ) p log p n( p 1) p 2010@FEUP Parallel Algorithm Design 38

39 Analysis considering I/O Total time after m iterations Initial reading + scattering Computing m iterations Final gathering + writing 2 io n io 2 log p n( p 1) p m log p n( p 1) p n p 2010@FEUP Parallel Algorithm Design 39

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Algorithm Design. Parallel Algorithm Design p. 1 Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html