Parallel Computing Parallel Algorithm Design
Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels 2010@FEUP Parallel Algorithm Design 2
Task/Channel Model Task Channel 2010@FEUP Parallel Algorithm Design 3
Foster s Design Methodoly 1. Partitioning 2. Communication 3. Agglomeration 4. Mapping Problem Partitioning Communication Mapping Agglomeration 2010@FEUP Parallel Algorithm Design 4
1. Partitioning Dividing computation and data into pieces Domain decomposition Divide data into pieces e.g., An array into sub-arrays (reduction); A loop into sub-loops (matrix multiplication), A search space into sub-spaces (chess) Functional decomposition Divide computation into pieces e.g., pipelines (floating point multiplication), workflows (pay roll processing) Determine how to associate data with computations 2010@FEUP Parallel Algorithm Design 5
Partitioning The individual pieces are called primitive tasks. Desirable attributes for partition Many more primitive tasks than processors on target computer. Tasks of roughly equal size (in computation and data). Number of tasks increases with problem size. 2010@FEUP Parallel Algorithm Design 6
Example of domain decomposition 2010@FEUP Parallel Algorithm Design 7
Example of Functional Decomposition 2010@FEUP Parallel Algorithm Design 8
2. Communication Determine values passed among tasks Local communication Task needs values from a small number of other tasks Create channels illustrating data flow Global communication Significant number of tasks contribute data to perform a computation Don t create channels for them early in design 2010@FEUP Parallel Algorithm Design 9
Desirable attributes for communication Balanced Communication operations balanced among tasks Small degree: Each task communicates with only small group of neighbors Concurrency Tasks can perform communications concurrently Task can perform computations concurrently 2010@FEUP Parallel Algorithm Design 10
3. Agglomeration Agglomeration is the process of grouping tasks into larger tasks to improve performance. Here, minimizing communication is typically a design goal. Grouping tasks that communicate with each other eliminates the need for communication, called increasing the locality Grouping tasks can also allow us to combine multiple communications into one. 2010@FEUP Parallel Algorithm Design 11
Desirable attributes of agglomeration Increased the locality of the parallel algorithm Agglomerated tasks have similar computational and communication costs Number of tasks increases with problem size Number of tasks is as small as possible, yet at least as great as the number of processors on target computer 2010@FEUP Parallel Algorithm Design 12
4. Mapping Mapping is the process of assigning agglomerated tasks to the processors Here, were thinking of a distributed memory machine If we choose the number of agglomerated tasks to equal the number of processors then the mapping is already done. Each processor gets one agglomerated task 2010@FEUP Parallel Algorithm Design 13
Mapping Goals Processor utilization: would like processors to have roughly equal computational and communication costs Minimize interprocessor communication This can be posed as a graph partitioning problem: Each partition should have roughly the same number of nodes The partition should cut a minimal amount of edges 2010@FEUP Parallel Algorithm Design 14
Partitioning a graph P0 P1 P0 P1 P0 P1 Equalizing processor utilization and minimizing interprocessor communication are often competing forces 2010@FEUP Parallel Algorithm Design 15
Mapping heuristics Static number of tasks Structured communication Constant computation time per task Agglomerate tasks to minimize comm Create one task per processor Variable computation time per task Cyclically map tasks to processors Unstructured communication Use a static load balancing algorithm Dynamic number of tasks Use a run-time task-scheduling algorithm e.g., a master slave strategy Use a dynamic load balancing algorithm e.g., share load among neighboring processors; remapping periodically 2010@FEUP Parallel Algorithm Design 16
Example 1. Boundary value problems Ice water Rod Insulation 2010@FEUP Parallel Algorithm Design 17
2010@FEUP Parallel Algorithm Design 18 Boundary Value Problem 2 2 2 2 2 x u a u a t u c k a 2 t u u t u j i j i, 1, Heat conduction physics Discretization u i,j = temperature at position i and time j 2 1,, 1, 2 2 2 x u u u x u j i j i j i j i j i j i j i ru u r ru u, 1, 1, 1, ) 2 1 ( 2 2 ( x) t a r
Boundary Value Problem Partition One data item per grid point Associate one primitive task with each grid point Two-dimensional domain decomposition Communication Identify communication pattern between primitive tasks Each interior primitive task has three incoming and three outgoing channels 2010@FEUP Parallel Algorithm Design 19
Boundary Value Problem Agglomeration and mapping Agglomeration 2010@FEUP Parallel Algorithm Design 20
Model Analysis Sequential execution time to update element n number of elements m number of iterations Sequential execution time: m n Parallel execution p number of processors message time = + q/β, if q «β Parallel execution time m (n /p + 2) 2010@FEUP Parallel Algorithm Design 21
Example Parallel reduction Given associative operator a 0 a 1 a 2 a n-1 Examples Add Multiply And, Or Maximum, Minimum Data decomposition 1 task 1 of the values to operate (1 of the a s) 2010@FEUP Parallel Algorithm Design 22
Parallel reduction Further steps to reach a binomial tree 2010@FEUP Parallel Algorithm Design 23
Parallel reduction 4 2 0 7-3 5-6 -3 8 1 2 3-4 4 6-1 2010@FEUP Parallel Algorithm Design 24
Parallel reduction 1 7-6 4 4 5 8 2 2010@FEUP Parallel Algorithm Design 25
Parallel reduction 8-2 9 10 2010@FEUP Parallel Algorithm Design 26
Parallel reduction 17 8 2010@FEUP Parallel Algorithm Design 27
Parallel reduction Binomial tree 25 2010@FEUP Parallel Algorithm Design 28
Agglomeration sum sum sum sum 2010@FEUP Parallel Algorithm Design 29
Analysis Parallel running time time to perform the binary operation - time to communicate a value via a channel n values and p tasks Time for the tasks perform its inner calculations: (n/p - 1) Communication steps: log p After each receiving communication there is an operation Total time: (n/p - 1) + log p ( + ) 2010@FEUP Parallel Algorithm Design 30
Example: the N-body problem m (x,y) f1 B1 v f2 B2 B3 2010@FEUP Parallel Algorithm Design 31
The N-body problem 2010@FEUP Parallel Algorithm Design 32
The N-body problem partitioning Domain partitioning Assume one task per particle Task has particle s position, velocity vector and mass Iteration Get positions and mass of all other particles Compute new position and velocity 2010@FEUP Parallel Algorithm Design 33
Gather and All-Gather operations Gather operation (sequential) (p-1) All-Gather operation 2010@FEUP Parallel Algorithm Design 34
All-Gather To avoid conflicts all-gather is performed in log p steps, doubling the data in each step Communication (n items) = + (n / ) With p tasks there are log p iterations The number of items doubles at each iteration log p i1 ( i1 2 n ) p log p n( p 1) p 2010@FEUP Parallel Algorithm Design 35
Analysis N-body problem parallel version n bodies and p tasks m iterations over time Total time excluding I/O m log p n( p 1) p n p 2010@FEUP Parallel Algorithm Design 36
Considering I/O Reading or writing n items of data through an I/O channel io + n/ io In N-body problem the initial values must be transmitted to the other tasks 2010@FEUP Parallel Algorithm Design 37
Scatter operation Improving 1. First task transmits n/2 items to another task 2. The 2 tasks transmits n/4 items to 2 other tasks 3. The 4 tasks transmits n/8 items to 8 other tasks 4. And so on log p i1 ( i1 2 n ) p log p n( p 1) p 2010@FEUP Parallel Algorithm Design 38
Analysis considering I/O Total time after m iterations Initial reading + scattering Computing m iterations Final gathering + writing 2 io n io 2 log p n( p 1) p m log p n( p 1) p n p 2010@FEUP Parallel Algorithm Design 39