Parallelization Strategy

Similar documents
Parallelization Strategy

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

Patterns for! Parallel Programming II!

Patterns for! Parallel Programming!

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

The Algorithm Structure Design Space

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Lecture 15: More Iterative Ideas

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Parallel Programming Patterns Overview and Concepts

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 11 Parallelism in Software II

Adaptive-Mesh-Refinement Pattern

Marco Danelutto. May 2011, Pisa

Design of Parallel Algorithms. Models of Parallel Computation

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

HPC Algorithms and Applications

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Parallelism in Software

Patterns for Parallel Application Programs *

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Parallel Programming Patterns

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

Numerical Algorithms

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Principles of Parallel Algorithm Design: Concurrency and Mapping

Kevin J. Barker. Scott Pakin and Darren J. Kerbyson

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallelism I. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Lecture 4: Principles of Parallel Algorithm Design (part 3)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

CSC630/COS781: Parallel & Distributed Computing

ELEC Dr Reji Mathew Electrical Engineering UNSW

Divide-and-Conquer. Divide-and conquer is a general algorithm design paradigm:

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

Parallel Algorithm Design. CS595, Fall 2010

An Introduction to Parallel Programming

PARALLEL CLASSIFICATION ALGORITHMS

Introduction to Multigrid and its Parallelization

Principles of Parallel Algorithm Design: Concurrency and Mapping

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Numerical Methods for PDEs. SSC Workgroup Meetings Juan J. Alonso October 8, SSC Working Group Meetings, JJA 1

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Search and Optimization

CS 140 : Numerical Examples on Shared Memory with Cilk++

Introduction to Parallel Programming

Programmazione Avanzata e Paradigmi Ingegneria e Scienze Informatiche - UNIBO a.a 2013/2014 Lecturer: Alessandro Ricci

Programming as Successive Refinement. Partitioning for Performance

Lecture 5 2D Transformation

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Lecture 27: Fast Laplacian Solvers

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Review: Creating a Parallel Program. Programming for Performance

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Parallel Computing. Parallel Algorithm Design

15. The Software System ParaLab for Learning and Investigations of Parallel Methods

Lecture 4: Principles of Parallel Algorithm Design (part 4)

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS. Ioana Chiorean

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Parallel Programming Patterns. Overview and Concepts

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Parallel Programming

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

High Performance Computing. Introduction to Parallel Computing

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

Parallel Mesh Partitioning in Alya

Modalis. A First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche and Hélène Renard, May 5, 2010

CS 468 (Spring 2013) Discrete Differential Geometry

MPI Case Study. Fabio Affinito. April 24, 2012

CSE 421 Greedy Alg: Union Find/Dijkstra s Alg

Introduction to Optimization Problems and Methods

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Parallelization Principles. Sathish Vadhiyar

Multiview Stereo COSC450. Lecture 8

Introduction to Parallel Programming Models

A Pattern-supported Parallelization Approach

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Dataflow Architectures. Karin Strauss

AMS526: Numerical Analysis I (Numerical Linear Algebra)

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

A First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche

CS473-Algorithms I. Lecture 10. Dynamic Programming. Cevdet Aykanat - Bilkent University Computer Engineering Department

Advanced Algorithms and Data Structures

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Transcription:

COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure Implementation Mechanism Structure the algorithm to take advantage of concurrency Intermediate stage between Algorithm Structure and Implementation program structuring definition of shared data structures Mapping of the higher level patterns onto a programming environment

Finding concurrency - Overview Decomposition Task Decomposition Data Decomposition Dependency Analysis Group Tasks Order Tasks Data Sharing Design Evaluation Overview (II) Is the problem large enough to justify the efforts for parallelizing it? Are the key features of the problem and the data elements within the problem well understood? Which parts of the problem are most computationally intensive? 2

Task decomposition How can a problem be decomposed into tasks that can execute concurrently? Goal: a collection of (nearly) independent tasks Initially, try to find as many as tasks as possible (can be merged later to form larger tasks) A task can correspond to A function call Distinct iterations of loops within the algorithm (loop splitting) Any independent sections in the source code Task decomposition goals and constraints Flexibility: the design should be flexible enough to be able to handle any numbers of processes E.g. the number and the size of each task should be a parameter for the task decomposition Efficiency: each task should include enough work to compensate for the overhead of generating tasks and managing their dependencies Simplicity: tasks should be defined such that it keeps debugging and maintenance easy 3

Task decomposition Two tasks A and B are considered independent if I I A B O A O B O A O B = = = Task decomposition - example Solving a system of linear equations using the Conjugant Gradient Method Given the vectors x 0, A, b p 0 = r 0 :=b-ax 0 For i=0,,2, α = r T ip i β = (Ap i ) T p i λ = α/β x i+ = x i + λp i r i+ = r i λap i p i+ = r i+ -((Ar i+ ) T p i )/β Can be executed in parallel Can be executed in parallel 4

Data decomposition How can a problem s data be decomposed into units that can be operated (relatively) independently? Most computationally intensive parts of a problem are dealing with large data structures Common mechanisms Array-based decomposition (e.g. see example on the next slides) Recursive data structures (e.g. decomposing the parallel update of a large tree data structure) Data decomposition goals and constraints Flexibility: size and number of data chunks should be flexible (granularity) Efficiency: Data chunks have to be large enough that the amount of work with the data chunk compensates for managing dependencies Load balancing Simplicity: Complex data types difficult to debug Mapping of local indexes to global indexed often required 5

Numerical differentiation - recap Forward difference formula: f( x+ h) f( x) f ( x) = h Central difference formula for the st derivative: f ( x) = [ f( x+ h) f( x h)] 2h Central difference formula for the 2 nd derivative: f ( x) = [ f( x+ h) 2f( x) + f( x h)] 2 h Finite Differences Approach for Solving Differential Equations Idea: replace the derivatives in the DE by an according approximation formula Typically central differences y ( t) = [ y( t+ h) y( t h)] 2h y ( t) = [ y( t+ h) 2y( t) + y( t h)] 2 h Example: Boundary value problem of an ordinary differential equation 2 d y dy = f( x, y, ) 2 dx dx y(a) =α y(b) =β a x b 6

Finite Differences Approach (II) For simplicity, lets assume the points are equally spaced ( b a) x i = a+ ih 0 n+ h= n i + A two point boundary value problem becomes then y 0 =α ( yi 2yi + yi h =β ) = f( x, y, ( yi 2h 2 + + yi y n+ )) (x:) Equation (x:) leads to a system of equations Solving the system of linear equations x 0, x,..., gives x, x n n+ the solution of the ODE at the distinct points Example (I) Solve the following two point boundary value problem using the finite difference method with h=0.2 2 d y dy + 2 + 0x= 0 0 x 2 dx dx y( 0) = y( ) = 2 Since h=0.2, the mesh points are x0 = 0, x = 0.2, x2 = 0.4, x3 = 0.6, x4 = 0.8, x5 =.0 Thus, y = y( x 0 ) y = y( x 5 ) 2 0 = 5 = y - y 4 are unknown 7

Example (II) Discrete version of the ODE using central differences: ( y 2 + ) + 2 ( ) + 0 = 0 2 i+ yi yi yi+ yi xi h 2h ( y 2 ) 2 ( ) 0 0 2 i+ yi+ yi + yi+ yi + xi = 0.2 0.4 25( yi+ 2yi+ yi ) + 5( yi+ yi ) + 0xi = 20 i 50yi+ 30yi+ = y 0x i 0 i=: Example (III) 20y y y x 20 50y+ 30y2 = 0 0. 2 0 50 + 30 2 = 0 i=: 50y + 30y2 = 22 i=2: 20y 50y2+ 30y3 = 4 i=3: 20y2 50y3+ 30y4 = 6 i=4: 20y3 50y4 = 68 or 50 20 30 50 20 30 50 20 y y 2 = 30 y 3 50 y4 22 4 6 68 A y b 8

Solving Ay=b using Bi-CGSTAB Scalar product Given A,b and an initial guess y 0 r0 = b Ay 0 Given rˆ such that rˆ T 0 r0 0 ρ0 = α = ω0 = v =p 0 0 0 = for i =,2, T ρ ˆ i = r0 ri ρ i α β = ρ i ω i pi = ri + β pi ωi vi v = i Ap i ρ i α = T rˆ 0 vi s= r i αv i t= As T t s ω i = T t t ( ) yi = yi +α pi+ ωis r = s ωt i i Matrix-vector multiplication Scalar product: s= i= 0 Parallel algorithm s= / 2 i= 0 / 2 Scalar product in parallel a[ i]* b[ i] ( a[ i]* b[ i]) + i= / 2 ( a[ i]* b[ i]) / 2 = ( alocal[ i]* blocal[ i]) + ( alocal[ i]* blocal[ i]) i= 0 i= 0 4442 4443 4442 4443 rank= 0 Process with rank=0 a ( 0... ) b ( 0... ) a (... ) b(... ) 2 2 2 2 rank= requires communication between the processes Process with rank= 9

Matrix-vector product in parallel 50 20 30 50 20 30 50 20 x rhs x2 = rhs2 30 x 3 rhs3 50 x4 rhs4 Process 0 Process 50x + 30x 2 20x + x =rhs 2 20x x = rhs 50x2 30 3 2 50x3+ 30 20x 3 50x4 = rhs 4 =rhs 3 4 Process 0 needs x 3 Process needs x 2 Matrix vector product in parallel (II) Introduction of ghost cells Process zero Process one x x2 x3 x x 2 3 Looking at the source code, e.g p v = i = ri i i i i Ap i + β( p ω v ) since the vector used in the matrix vector multiplication changes every iteration, you always have to update the ghost cells before doing the calculation x 4 0

Matrix vector product in parallel (III) so the parallel algorithm for the same area is: pi = ri + β( pi ωi vi ) Update the ghost-cells of p, e.g - Process 0 sends p(2) to Process - Process sends p(3) to Process 0 v = i Ap i 2-D Example Laplace equation Parallel domain decomposition Data exchange at process boundaries required Halo cells / Ghost cells Copy of the last row/column of data from the neighbor process

Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure Implementation Mechanism Structure the algorithm to take advantage of concurrency Intermediate stage between Algorithm Structure and Implementation program structuring definition of shared data structures Mapping of the higher level patterns onto a programming environment Finding concurrency Result A task decomposition that identifies tasks that can execute concurrently A data decomposition that identifies data local to each task A way of grouping tasks and ordering them according to temporal constraints 2

Algorithm structure Organize by tasks Task Parallelism Divide and Conquer Organize by data decomposition Geometric decomposition Recursive data Organize by flow of data Pipeline Event-based coordination Task parallelism (I) Problem can be decomposed into a collection of tasks that can execute concurrently Tasks can be completely independent (embarrassingly parallel) or can have dependencies among them All tasks might be known at the beginning or might be generated dynamically 3

Task parallelism (II) Tasks: There should be at least as many tasks as UEs (typically many, many more) Computation associated with each task should be large enough to offset the overhead associated with managing tasks and handling dependencies Dependencies: Ordering constraints: sequential composition of taskparallel computations Shared-data dependencies: several tasks have to access the same data structure Shared data dependencies Shared data dependencies can be categorized as follows: Removable dependencies: an apparent dependency that can be removed by code transformation int i, ii=0, jj=0; for (i=0; i<n; i++ ) { ii = ii + ; d[ii] = big_time_consuming_work (ii); jj = jj + ii; a[jj] = other_big_time_consuming_work (jj); } for (i=0; i<n; i++ ) { d[i] = big_time_consuming_work (i); a[(i*i+i)/2] = other_big_time_consuming_work((i*i+i)/2); } 28 4

Shared data dependencies (II) Separable dependencies: replicate the shared data structure and combine the copies into a single structure at the end Remember the matrix-vector multiply using columnwise block distribution in the first MPI lecture? Other dependencies: non-resolvable, have to be followed Task scheduling Schedule: the way in which tasks are assigned to UEs for execution Goal: load balance minimize the overall execution of all tasks Two classes of schedule: Static schedule: distribution of tasks to UEs is determined at the start of the computation and not changed anymore Dynamic schedule: the distribution of tasks to UEs changes as the computation proceeds 5

Task scheduling - example Independent tasks A B C D E F Poor mapping to 4 UEs Good mapping to 4 UEs A B F C D A B D C F E E Static schedule Tasks are associated into blocks Blocks are assigned to UEs Each UE should take approximately same amount of time to complete task Static schedule usually used when Availability of computational resources is predictable (e.g. dedicated usage of nodes) UEs are identical (e.g. homogeneous parallel computer) Size of each task is nearly identical 6

Dynamic scheduling Used when Effort associated with each task varies widely/is unpredictable Capabilities of UEs vary widely (heterogeneous parallel machine) Common implementations: usage of task queues: if a UE finishes current task, it removes the next task from the task-queue Work-stealing: each UE has its own work queue once its queue is empty, a UE steals work from the task queue of another UE Dynamic scheduling Trade-offs: Fine grained (=shorter, smaller) tasks allow for better load balance Fine grained task have higher costs for task management and dependency management 7

Divide and Conquer algorithms split split split Base solve Merge Base solve Merge Merge Divide and Conquer A problem is split into a number of smaller subproblems Each sub-problem is solved independently Sub-solutions of each sub-problem will be merged to the solution of the final problem Problems of Divide and Conquer for Parallel Computing: Amount of exploitable concurrency decreases over the lifetime Trivial parallel implementation: each function call to solve is a task on its own. For small problems, no new task should be generated, but the basesolve should be applied 8

Divide and Conquer Implementation: On shared memory machines, a divide and conquer algorithm can easily be mapped to a fork/join model A new task is forked(=created) After this task is done, it joins the original task (=destroyed) On distributed memory machines: task queues Often implemented using the Master/Worker framework discussed later in this course int solve ( Problem P ) { int solution; Divide and Conquer /* Check whether we can further partition the problem */ if (basecase(p) ) { solution = basesolve(p); /* No, we can t */ } else { /* yes, we can */ Problem subproblems[n]; int subsolutions[n]; } subproblems = split (P); /* Partition the problem */ for ( i=0; i < N; i++ ) { subsolutions[i] = solve ( subproblems[i]); } solution = merge (subsolutions); } return ( solution ); 38 9

Task Parallelism using Master-Worker framework Master Process Worker Process Worker Process 2 Result queue Task queue Task Parallelism using work stealing Worker Process Worker Process 2 20

Geometric decomposition For all applications relying on data decomposition All processes should apply the same operations on different data items Key elements: Data decomposition Exchange and update operation Data distribution and task scheduling Algorithm structure - Pipeline pattern Calculation can be viewed in terms of data flowing through a sequence of stages Computation performed on many data sets Compare to pipelining in processors on the instruction level time Pipeline stage Pipeline stage 2 Pipeline stage 3 Pipeline stage 4 C C2 C3 C4 C5 C6 C C2 C3 C4 C5 C6 C C2 C3 C4 C5 C6 C C2 C3 C4 C5 C6 2

Pipeline pattern (II) Amount of concurrency limited to the number of stages of the pipeline Patterns works best, if amount of work performed by various stages is roughly equal Filling the pipeline: some stages will be idle Draining the pipeline: some stages will be idle Non-linear pipeline: pattern allows for different execution for different data items Stage Stage 2 Stage 3a Stage 3b Stage 4 Pipeline pattern (III) Implementation: Each stage typically assigned to a process/thread A stage might be a data-parallel task itself Computation per task has to be large enough to compensate for communication costs between the tasks 22