Parallel Computing. Parallel Algorithm Design

Similar documents
Parallel Algorithm Design. Parallel Algorithm Design p. 1

Foster s Methodology: Application Examples

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Algorithm Design

Lecture 4: Principles of Parallel Algorithm Design (part 4)

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Parallel Programming with MPI and OpenMP

COMMUNICATION IN HYPERCUBES

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Basic Communication Operations (Chapter 4)

Simulating ocean currents

Principles of Parallel Algorithm Design: Concurrency and Mapping

COSC 462. Parallel Algorithms. The Design Basics. Piotr Luszczek

Parallel Algorithm Design. CS595, Fall 2010

Matrix multiplication

Matrix-vector Multiplication

Parallel Programming. Functional Decomposition (Document Classification)

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Copyright 2010, Elsevier Inc. All rights Reserved

Parallel Real-Time Systems

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

Blocking SEND/RECEIVE

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Lecture 17: Array Algorithms

Project C/MPI: Matrix-Vector Multiplication

f xx + f yy = F (x, y)

Dense Matrix Algorithms

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

Demonstration of Legion Runtime Using the PENNANT Mini-App

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Parallel Programming in C with MPI and OpenMP

L15: Putting it together: N-body (Ch. 6)!

Design of Parallel Algorithms. Models of Parallel Computation

Abstract. Introduction. Kevin Todisco

University of Innsbruck. Topology Aware Data Organisation for Large Scale Simulations

Principles of Parallel Algorithm Design: Concurrency and Mapping

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

COLA: Optimizing Stream Processing Applications Via Graph Partitioning

17/03/2018. Meltem Özturan

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Communicating Process Architectures in Light of Parallel Design Patterns and Skeletons

Parallelization of an Example Program

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Parallelization Strategy

MULTIPLE OPERAND ADDITION. Multioperand Addition

A Modified Inertial Method for Loop-free Decomposition of Acyclic Directed Graphs

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

MPI Casestudy: Parallel Image Processing

Numerical Algorithms

MPI Case Study. Fabio Affinito. April 24, 2012

More Communication (cont d)

ECE 669 Parallel Computer Architecture

All-Pairs Shortest Paths - Floyd s Algorithm

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

The Icosahedral Nonhydrostatic (ICON) Model

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Basic MPI Communications. Basic MPI Communications (cont d)

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

CS 426. Building and Running a Parallel Application

Shape optimisation using breakthrough technologies

Basic Communication Ops

June 27, Real-Time Analytics through Convergence. of User-Defined Functions. Vinay Deolalikar. HP-Autonomy Research. Sunnyvale, CA.

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Lecture 27: Board Notes: Parallel Programming Examples

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.

Parallelization Principles. Sathish Vadhiyar

CSC630/COS781: Parallel & Distributed Computing

Parallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS

Parallel Implementations of Gaussian Elimination

Streaming Massive Environments From Zero to 200MPH

Data Structures and Algorithms

Lesson 2 7 Graph Partitioning

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Basic Idea. The routing problem is typically solved using a twostep

Using GPUs to compute the multilevel summation of electrostatic forces

Chapter 8 Dense Matrix Algorithms

OpenMP and MPI parallelization

Scalable Software Components for Ultrascale Visualization Applications

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Hardware-Software Codesign

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

Algorithms and Applications

CS 475: Parallel Programming Introduction

TELCOM2125: Network Science and Analysis

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016

Parallelization Strategy

A First Step to the Evaluation of SimGrid in the Context of a Real Application. Abdou Guermouche

Transcription:

Parallel Computing Parallel Algorithm Design

Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels 2010@FEUP Parallel Algorithm Design 2

Task/Channel Model Task Channel 2010@FEUP Parallel Algorithm Design 3

Foster s Design Methodoly 1. Partitioning 2. Communication 3. Agglomeration 4. Mapping Problem Partitioning Communication Mapping Agglomeration 2010@FEUP Parallel Algorithm Design 4

1. Partitioning Dividing computation and data into pieces Domain decomposition Divide data into pieces e.g., An array into sub-arrays (reduction); A loop into sub-loops (matrix multiplication), A search space into sub-spaces (chess) Functional decomposition Divide computation into pieces e.g., pipelines (floating point multiplication), workflows (pay roll processing) Determine how to associate data with computations 2010@FEUP Parallel Algorithm Design 5

Partitioning The individual pieces are called primitive tasks. Desirable attributes for partition Many more primitive tasks than processors on target computer. Tasks of roughly equal size (in computation and data). Number of tasks increases with problem size. 2010@FEUP Parallel Algorithm Design 6

Example of domain decomposition 2010@FEUP Parallel Algorithm Design 7

Example of Functional Decomposition 2010@FEUP Parallel Algorithm Design 8

2. Communication Determine values passed among tasks Local communication Task needs values from a small number of other tasks Create channels illustrating data flow Global communication Significant number of tasks contribute data to perform a computation Don t create channels for them early in design 2010@FEUP Parallel Algorithm Design 9

Desirable attributes for communication Balanced Communication operations balanced among tasks Small degree: Each task communicates with only small group of neighbors Concurrency Tasks can perform communications concurrently Task can perform computations concurrently 2010@FEUP Parallel Algorithm Design 10

3. Agglomeration Agglomeration is the process of grouping tasks into larger tasks to improve performance. Here, minimizing communication is typically a design goal. Grouping tasks that communicate with each other eliminates the need for communication, called increasing the locality Grouping tasks can also allow us to combine multiple communications into one. 2010@FEUP Parallel Algorithm Design 11

Desirable attributes of agglomeration Increased the locality of the parallel algorithm Agglomerated tasks have similar computational and communication costs Number of tasks increases with problem size Number of tasks is as small as possible, yet at least as great as the number of processors on target computer 2010@FEUP Parallel Algorithm Design 12

4. Mapping Mapping is the process of assigning agglomerated tasks to the processors Here, were thinking of a distributed memory machine If we choose the number of agglomerated tasks to equal the number of processors then the mapping is already done. Each processor gets one agglomerated task 2010@FEUP Parallel Algorithm Design 13

Mapping Goals Processor utilization: would like processors to have roughly equal computational and communication costs Minimize interprocessor communication This can be posed as a graph partitioning problem: Each partition should have roughly the same number of nodes The partition should cut a minimal amount of edges 2010@FEUP Parallel Algorithm Design 14

Partitioning a graph P0 P1 P0 P1 P0 P1 Equalizing processor utilization and minimizing interprocessor communication are often competing forces 2010@FEUP Parallel Algorithm Design 15

Mapping heuristics Static number of tasks Structured communication Constant computation time per task Agglomerate tasks to minimize comm Create one task per processor Variable computation time per task Cyclically map tasks to processors Unstructured communication Use a static load balancing algorithm Dynamic number of tasks Use a run-time task-scheduling algorithm e.g., a master slave strategy Use a dynamic load balancing algorithm e.g., share load among neighboring processors; remapping periodically 2010@FEUP Parallel Algorithm Design 16

Example 1. Boundary value problems Ice water Rod Insulation 2010@FEUP Parallel Algorithm Design 17

2010@FEUP Parallel Algorithm Design 18 Boundary Value Problem 2 2 2 2 2 x u a u a t u c k a 2 t u u t u j i j i, 1, Heat conduction physics Discretization u i,j = temperature at position i and time j 2 1,, 1, 2 2 2 x u u u x u j i j i j i j i j i j i j i ru u r ru u, 1, 1, 1, ) 2 1 ( 2 2 ( x) t a r

Boundary Value Problem Partition One data item per grid point Associate one primitive task with each grid point Two-dimensional domain decomposition Communication Identify communication pattern between primitive tasks Each interior primitive task has three incoming and three outgoing channels 2010@FEUP Parallel Algorithm Design 19

Boundary Value Problem Agglomeration and mapping Agglomeration 2010@FEUP Parallel Algorithm Design 20

Model Analysis Sequential execution time to update element n number of elements m number of iterations Sequential execution time: m n Parallel execution p number of processors message time = + q/β, if q «β Parallel execution time m (n /p + 2) 2010@FEUP Parallel Algorithm Design 21

Example Parallel reduction Given associative operator a 0 a 1 a 2 a n-1 Examples Add Multiply And, Or Maximum, Minimum Data decomposition 1 task 1 of the values to operate (1 of the a s) 2010@FEUP Parallel Algorithm Design 22

Parallel reduction Further steps to reach a binomial tree 2010@FEUP Parallel Algorithm Design 23

Parallel reduction 4 2 0 7-3 5-6 -3 8 1 2 3-4 4 6-1 2010@FEUP Parallel Algorithm Design 24

Parallel reduction 1 7-6 4 4 5 8 2 2010@FEUP Parallel Algorithm Design 25

Parallel reduction 8-2 9 10 2010@FEUP Parallel Algorithm Design 26

Parallel reduction 17 8 2010@FEUP Parallel Algorithm Design 27

Parallel reduction Binomial tree 25 2010@FEUP Parallel Algorithm Design 28

Agglomeration sum sum sum sum 2010@FEUP Parallel Algorithm Design 29

Analysis Parallel running time time to perform the binary operation - time to communicate a value via a channel n values and p tasks Time for the tasks perform its inner calculations: (n/p - 1) Communication steps: log p After each receiving communication there is an operation Total time: (n/p - 1) + log p ( + ) 2010@FEUP Parallel Algorithm Design 30

Example: the N-body problem m (x,y) f1 B1 v f2 B2 B3 2010@FEUP Parallel Algorithm Design 31

The N-body problem 2010@FEUP Parallel Algorithm Design 32

The N-body problem partitioning Domain partitioning Assume one task per particle Task has particle s position, velocity vector and mass Iteration Get positions and mass of all other particles Compute new position and velocity 2010@FEUP Parallel Algorithm Design 33

Gather and All-Gather operations Gather operation (sequential) (p-1) All-Gather operation 2010@FEUP Parallel Algorithm Design 34

All-Gather To avoid conflicts all-gather is performed in log p steps, doubling the data in each step Communication (n items) = + (n / ) With p tasks there are log p iterations The number of items doubles at each iteration log p i1 ( i1 2 n ) p log p n( p 1) p 2010@FEUP Parallel Algorithm Design 35

Analysis N-body problem parallel version n bodies and p tasks m iterations over time Total time excluding I/O m log p n( p 1) p n p 2010@FEUP Parallel Algorithm Design 36

Considering I/O Reading or writing n items of data through an I/O channel io + n/ io In N-body problem the initial values must be transmitted to the other tasks 2010@FEUP Parallel Algorithm Design 37

Scatter operation Improving 1. First task transmits n/2 items to another task 2. The 2 tasks transmits n/4 items to 2 other tasks 3. The 4 tasks transmits n/8 items to 8 other tasks 4. And so on log p i1 ( i1 2 n ) p log p n( p 1) p 2010@FEUP Parallel Algorithm Design 38

Analysis considering I/O Total time after m iterations Initial reading + scattering Computing m iterations Final gathering + writing 2 io n io 2 log p n( p 1) p m log p n( p 1) p n p 2010@FEUP Parallel Algorithm Design 39