COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure Implementation Mechanism Structure the algorithm to take advantage of concurrency Intermediate stage between Algorithm Structure and Implementation program structuring definition of shared data structures Mapping of the higher level patterns onto a programming environment
Finding concurrency - Overview Decomposition Task Decomposition Data Decomposition Dependency Analysis Group Tasks Order Tasks Data Sharing Design Evaluation Overview (II) Is the problem large enough to justify the efforts for parallelizing it? Are the key features of the problem and the data elements within the problem well understood? Which parts of the problem are most computationally intensive? 2
Task decomposition How can a problem be decomposed into tasks that can execute concurrently? Goal: a collection of (nearly) independent tasks Initially, try to find as many as tasks as possible (can be merged later to form larger tasks) A task can correspond to A function call Distinct iterations of loops within the algorithm (loop splitting) Any independent sections in the source code Task decomposition goals and constraints Flexibility: the design should be flexible enough to be able to handle any numbers of processes E.g. the number and the size of each task should be a parameter for the task decomposition Efficiency: each task should include enough work to compensate for the overhead of generating tasks and managing their dependencies Simplicity: tasks should be defined such that it keeps debugging and maintenance easy 3
Task decomposition Two tasks A and B are considered independent if I I A B O A O B O A O B = = = Task decomposition - example Solving a system of linear equations using the Conjugant Gradient Method Given the vectors x 0, A, b p 0 = r 0 :=b-ax 0 For i=0,,2, α = r T ip i β = (Ap i ) T p i λ = α/β x i+ = x i + λp i r i+ = r i λap i p i+ = r i+ -((Ar i+ ) T p i )/β Can be executed in parallel Can be executed in parallel 4
Data decomposition How can a problem s data be decomposed into units that can be operated (relatively) independently? Most computationally intensive parts of a problem are dealing with large data structures Common mechanisms Array-based decomposition (e.g. see example on the next slides) Recursive data structures (e.g. decomposing the parallel update of a large tree data structure) Data decomposition goals and constraints Flexibility: size and number of data chunks should be flexible (granularity) Efficiency: Data chunks have to be large enough that the amount of work with the data chunk compensates for managing dependencies Load balancing Simplicity: Complex data types difficult to debug Mapping of local indexes to global indexed often required 5
Numerical differentiation - recap Forward difference formula: f( x+ h) f( x) f ( x) = h Central difference formula for the st derivative: f ( x) = [ f( x+ h) f( x h)] 2h Central difference formula for the 2 nd derivative: f ( x) = [ f( x+ h) 2f( x) + f( x h)] 2 h Finite Differences Approach for Solving Differential Equations Idea: replace the derivatives in the DE by an according approximation formula Typically central differences y ( t) = [ y( t+ h) y( t h)] 2h y ( t) = [ y( t+ h) 2y( t) + y( t h)] 2 h Example: Boundary value problem of an ordinary differential equation 2 d y dy = f( x, y, ) 2 dx dx y(a) =α y(b) =β a x b 6
Finite Differences Approach (II) For simplicity, lets assume the points are equally spaced ( b a) x i = a+ ih 0 n+ h= n i + A two point boundary value problem becomes then y 0 =α ( yi 2yi + yi h =β ) = f( x, y, ( yi 2h 2 + + yi y n+ )) (x:) Equation (x:) leads to a system of equations Solving the system of linear equations x 0, x,..., gives x, x n n+ the solution of the ODE at the distinct points Example (I) Solve the following two point boundary value problem using the finite difference method with h=0.2 2 d y dy + 2 + 0x= 0 0 x 2 dx dx y( 0) = y( ) = 2 Since h=0.2, the mesh points are x0 = 0, x = 0.2, x2 = 0.4, x3 = 0.6, x4 = 0.8, x5 =.0 Thus, y = y( x 0 ) y = y( x 5 ) 2 0 = 5 = y - y 4 are unknown 7
Example (II) Discrete version of the ODE using central differences: ( y 2 + ) + 2 ( ) + 0 = 0 2 i+ yi yi yi+ yi xi h 2h ( y 2 ) 2 ( ) 0 0 2 i+ yi+ yi + yi+ yi + xi = 0.2 0.4 25( yi+ 2yi+ yi ) + 5( yi+ yi ) + 0xi = 20 i 50yi+ 30yi+ = y 0x i 0 i=: Example (III) 20y y y x 20 50y+ 30y2 = 0 0. 2 0 50 + 30 2 = 0 i=: 50y + 30y2 = 22 i=2: 20y 50y2+ 30y3 = 4 i=3: 20y2 50y3+ 30y4 = 6 i=4: 20y3 50y4 = 68 or 50 20 30 50 20 30 50 20 y y 2 = 30 y 3 50 y4 22 4 6 68 A y b 8
Solving Ay=b using Bi-CGSTAB Scalar product Given A,b and an initial guess y 0 r0 = b Ay 0 Given rˆ such that rˆ T 0 r0 0 ρ0 = α = ω0 = v =p 0 0 0 = for i =,2, T ρ ˆ i = r0 ri ρ i α β = ρ i ω i pi = ri + β pi ωi vi v = i Ap i ρ i α = T rˆ 0 vi s= r i αv i t= As T t s ω i = T t t ( ) yi = yi +α pi+ ωis r = s ωt i i Matrix-vector multiplication Scalar product: s= i= 0 Parallel algorithm s= / 2 i= 0 / 2 Scalar product in parallel a[ i]* b[ i] ( a[ i]* b[ i]) + i= / 2 ( a[ i]* b[ i]) / 2 = ( alocal[ i]* blocal[ i]) + ( alocal[ i]* blocal[ i]) i= 0 i= 0 4442 4443 4442 4443 rank= 0 Process with rank=0 a ( 0... ) b ( 0... ) a (... ) b(... ) 2 2 2 2 rank= requires communication between the processes Process with rank= 9
Matrix-vector product in parallel 50 20 30 50 20 30 50 20 x rhs x2 = rhs2 30 x 3 rhs3 50 x4 rhs4 Process 0 Process 50x + 30x 2 20x + x =rhs 2 20x x = rhs 50x2 30 3 2 50x3+ 30 20x 3 50x4 = rhs 4 =rhs 3 4 Process 0 needs x 3 Process needs x 2 Matrix vector product in parallel (II) Introduction of ghost cells Process zero Process one x x2 x3 x x 2 3 Looking at the source code, e.g p v = i = ri i i i i Ap i + β( p ω v ) since the vector used in the matrix vector multiplication changes every iteration, you always have to update the ghost cells before doing the calculation x 4 0
Matrix vector product in parallel (III) so the parallel algorithm for the same area is: pi = ri + β( pi ωi vi ) Update the ghost-cells of p, e.g - Process 0 sends p(2) to Process - Process sends p(3) to Process 0 v = i Ap i 2-D Example Laplace equation Parallel domain decomposition Data exchange at process boundaries required Halo cells / Ghost cells Copy of the last row/column of data from the neighbor process
Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure Implementation Mechanism Structure the algorithm to take advantage of concurrency Intermediate stage between Algorithm Structure and Implementation program structuring definition of shared data structures Mapping of the higher level patterns onto a programming environment Finding concurrency Result A task decomposition that identifies tasks that can execute concurrently A data decomposition that identifies data local to each task A way of grouping tasks and ordering them according to temporal constraints 2
Algorithm structure Organize by tasks Task Parallelism Divide and Conquer Organize by data decomposition Geometric decomposition Recursive data Organize by flow of data Pipeline Event-based coordination Task parallelism (I) Problem can be decomposed into a collection of tasks that can execute concurrently Tasks can be completely independent (embarrassingly parallel) or can have dependencies among them All tasks might be known at the beginning or might be generated dynamically 3
Task parallelism (II) Tasks: There should be at least as many tasks as UEs (typically many, many more) Computation associated with each task should be large enough to offset the overhead associated with managing tasks and handling dependencies Dependencies: Ordering constraints: sequential composition of taskparallel computations Shared-data dependencies: several tasks have to access the same data structure Shared data dependencies Shared data dependencies can be categorized as follows: Removable dependencies: an apparent dependency that can be removed by code transformation int i, ii=0, jj=0; for (i=0; i<n; i++ ) { ii = ii + ; d[ii] = big_time_consuming_work (ii); jj = jj + ii; a[jj] = other_big_time_consuming_work (jj); } for (i=0; i<n; i++ ) { d[i] = big_time_consuming_work (i); a[(i*i+i)/2] = other_big_time_consuming_work((i*i+i)/2); } 28 4
Shared data dependencies (II) Separable dependencies: replicate the shared data structure and combine the copies into a single structure at the end Remember the matrix-vector multiply using columnwise block distribution in the first MPI lecture? Other dependencies: non-resolvable, have to be followed Task scheduling Schedule: the way in which tasks are assigned to UEs for execution Goal: load balance minimize the overall execution of all tasks Two classes of schedule: Static schedule: distribution of tasks to UEs is determined at the start of the computation and not changed anymore Dynamic schedule: the distribution of tasks to UEs changes as the computation proceeds 5
Task scheduling - example Independent tasks A B C D E F Poor mapping to 4 UEs Good mapping to 4 UEs A B F C D A B D C F E E Static schedule Tasks are associated into blocks Blocks are assigned to UEs Each UE should take approximately same amount of time to complete task Static schedule usually used when Availability of computational resources is predictable (e.g. dedicated usage of nodes) UEs are identical (e.g. homogeneous parallel computer) Size of each task is nearly identical 6
Dynamic scheduling Used when Effort associated with each task varies widely/is unpredictable Capabilities of UEs vary widely (heterogeneous parallel machine) Common implementations: usage of task queues: if a UE finishes current task, it removes the next task from the task-queue Work-stealing: each UE has its own work queue once its queue is empty, a UE steals work from the task queue of another UE Dynamic scheduling Trade-offs: Fine grained (=shorter, smaller) tasks allow for better load balance Fine grained task have higher costs for task management and dependency management 7
Divide and Conquer algorithms split split split Base solve Merge Base solve Merge Merge Divide and Conquer A problem is split into a number of smaller subproblems Each sub-problem is solved independently Sub-solutions of each sub-problem will be merged to the solution of the final problem Problems of Divide and Conquer for Parallel Computing: Amount of exploitable concurrency decreases over the lifetime Trivial parallel implementation: each function call to solve is a task on its own. For small problems, no new task should be generated, but the basesolve should be applied 8
Divide and Conquer Implementation: On shared memory machines, a divide and conquer algorithm can easily be mapped to a fork/join model A new task is forked(=created) After this task is done, it joins the original task (=destroyed) On distributed memory machines: task queues Often implemented using the Master/Worker framework discussed later in this course int solve ( Problem P ) { int solution; Divide and Conquer /* Check whether we can further partition the problem */ if (basecase(p) ) { solution = basesolve(p); /* No, we can t */ } else { /* yes, we can */ Problem subproblems[n]; int subsolutions[n]; } subproblems = split (P); /* Partition the problem */ for ( i=0; i < N; i++ ) { subsolutions[i] = solve ( subproblems[i]); } solution = merge (subsolutions); } return ( solution ); 38 9
Task Parallelism using Master-Worker framework Master Process Worker Process Worker Process 2 Result queue Task queue Task Parallelism using work stealing Worker Process Worker Process 2 20
Geometric decomposition For all applications relying on data decomposition All processes should apply the same operations on different data items Key elements: Data decomposition Exchange and update operation Data distribution and task scheduling Algorithm structure - Pipeline pattern Calculation can be viewed in terms of data flowing through a sequence of stages Computation performed on many data sets Compare to pipelining in processors on the instruction level time Pipeline stage Pipeline stage 2 Pipeline stage 3 Pipeline stage 4 C C2 C3 C4 C5 C6 C C2 C3 C4 C5 C6 C C2 C3 C4 C5 C6 C C2 C3 C4 C5 C6 2
Pipeline pattern (II) Amount of concurrency limited to the number of stages of the pipeline Patterns works best, if amount of work performed by various stages is roughly equal Filling the pipeline: some stages will be idle Draining the pipeline: some stages will be idle Non-linear pipeline: pattern allows for different execution for different data items Stage Stage 2 Stage 3a Stage 3b Stage 4 Pipeline pattern (III) Implementation: Each stage typically assigned to a process/thread A stage might be a data-parallel task itself Computation per task has to be large enough to compensate for communication costs between the tasks 22