Parallel Algorithm Design. CS595, Fall 2010

Similar documents
Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Review: Creating a Parallel Program. Programming for Performance

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Moore s Law. Computer architect goal Software developer assumption

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Introduction to parallel Computing

Principles of Parallel Algorithm Design: Concurrency and Mapping

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

Programming as Successive Refinement. Partitioning for Performance

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Principles of Parallel Algorithm Design: Concurrency and Mapping

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

High Performance Computing. Introduction to Parallel Computing

Parallel Computing. Parallel Algorithm Design

Design of Parallel Algorithms. Models of Parallel Computation

An Introduction to Parallel Programming

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Big Orange Bramble. August 09, 2016

Introduction to Parallel Programming

Design of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Computing Basics, Semantics

Parallelization Principles. Sathish Vadhiyar

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Distributed Systems CS /640

CUDA GPGPU Workshop 2012

Marco Danelutto. May 2011, Pisa

6.1 Multiprocessor Computing Environment

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Multiprocessor scheduling

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Introduction to Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Concurrency for data-intensive applications

CS 475: Parallel Programming Introduction

Processor Architecture and Interconnect

CSC630/CSC730 Parallel & Distributed Computing

Introduction to Parallel Programming. Tuesday, April 17, 12

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Patterns for! Parallel Programming!

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Overview of research activities Toward portability of performance

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Communicating Process Architectures in Light of Parallel Design Patterns and Skeletons

Multicore Computing and Scientific Discovery

Introduction to Parallel Programming

Parallel Numerics, WT 2013/ Introduction

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Introduction to High Performance Computing

Parallelization Strategy

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Lecture 7: Parallel Processing

High Performance Computing on GPUs using NVIDIA CUDA

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

High Performance Computing

Steps in Creating a Parallel Program

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Introducing the Cray XMT. Petr Konecny May 4 th 2007

What is Parallel Computing?

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Computer Architecture

Parallel Computing. Hwansoo Han (SKKU)

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

Heckaton. SQL Server's Memory Optimized OLTP Engine

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

What are Clusters? Why Clusters? - a Short History

IOS: A Middleware for Decentralized Distributed Computing

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

Parallelization Strategy

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

ECE 669 Parallel Computer Architecture

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

PARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

EE/CSCI 451: Parallel and Distributed Computation

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

Client Server & Distributed System. A Basic Introduction

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Transcription:

Parallel Algorithm Design CS595, Fall 2010 1

Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the programming language or API. The names used for programming models differ in the literature. 2

Programming Models 1. Sequential Model: The sequential program is automatically parallelized. o Advantage: Familiar programming model o Disadvantage: Limitations in compiler analysis 2. Message Passing Model: The application consists of a set of processes with separate address spaces. The processes exchange messages by explicit send/ receive operations. o Advantage: Full control of performance aspects o Disadvantage: Complex programming 3

MP Programming Model Process Memory Node A Node B process send (Y) process receive (Y ) Y Y message 4

MP Programming Model locality is fully transparent error prone programming. difficult to build performance portable programs 5

3. Shared Memory Programming Model Processor Memory System Thread (Process) read(x) Thread (Process) write(x) X Shared variable 6

Shared or Distributed Memory? www.mcs.anl.gov/petsc/petsc-as/documentation/faq.html#computers What kind of parallel computers or clusters are needed to use PETSc? PETSc can be used with any kind of parallel system that supports MPI. BUT for any decent performance one needs a fast, low-latency interconnect; any Ethernet, even 10 gige simply cannot provide the needed performance. high per-cpu memory performance. Each CPU (core in dual core systems) needs to have its own memory bandwith of roughly 2 or more gigabytes. For example, standard dual processor "PC's" will not provide better performance when the second processor is used, that is, you will not see speed-up when you using the second processor. This is because the speed of sparse matrix computations is almost totally determined by the speed of the memory, not the speed of the CPU. 7

Concepts of Parallel Programming Concepts: o Task: arbitrary piece of work performed by a single process o Thread: is an abstract entity as part of a process that performs tasks. Defines a unit for scheduling. o Process: is active entity with resources that performs tasks. o Processor: is a physical resource executing processes 8

Phases in the Parallelization Process Partitioning Decomposition Assignment Orchestration Mapping P0 P2 P1 P3 Sequential computation Tasks Processes Parallel program Processors 9

Decomposition Dividing computation and data into pieces Domain decomposition o Divide data into pieces o Determine how to associate computations with the data Functional decomposition o Divide computation into pieces o Determine how to associate data with the computations 10

Example of Decomposition: UEDGE: 2D decomposition Physical mesh is based on magnetic flux surfaces (here DIII-D) 0.03 Wall Ion Density n i (m -3 ) 10 20 0 Separatrix 10 19-0.03 Core 0.03 Electron Temperature T e (ev) 10 2 0 10 1 10 0-0.03 0 2 4 6 plate Poloidal distance (m) plate Poloidal distance Recent new capability: Computing sparse parallel Jacobian using matrix coloring Original parallel UEDGE Jacobian (block Jacobi only)" Recent progress: Complete Jacobian data, enabling use of many different preconditioners"

Functional Decomposition Breaking the computation into a collection of tasks o Tasks may be of variable length o Tasks require data to be executed o Tasks may become available dynamically 12

Data vs. Functional Parallelism Data parallelism o The same operations are executed in parallel for the elements of large data structures, e.g. arrays. o Tasks are the operations on each individual element or on subsets of the elements. o Whether tasks are of same length or variable length depends on the application. Many applications have tasks of same length. 13

Example: Data Parallelism for (i=0;i<n;i++) a[i]=b[i]+c[i] a[0] a[1] a[2] a[3] a[4] a[5] a[6] = + Task0 Task1 Task2 Task3 Task4 Task5 Task6 14

Example: Data Parallelism, Variable Length for (i=0;i<n;i++) for (j=0;j<=i;j++) a[i]=a[i]+b[i][j] a[0] a[1] a[2] a[3] a[4] a[5] a[6] += Task0 Task1 Task2 Task3 Task4 Task5 Task6 15

Data vs. Functional Parallelism Functional parallelism o Entirely different calculations can be performed concurrently on either the same or different data. o The tasks are usually specified via different functions or different code regions. o The degree of available functional parallelism is usually modest. o Tasks are of different length in most cases. 16

Example: Function Parallelism f(a); g(b); i=i+10; j=2*k+z**2; The functions or statements can be executed in parallel. These are different operations functional parallelism 17

Phases in the Parallelization Process Partitioning Decomposition Assignment Orchestration Mapping P0 P2 P1 P3 Sequential computation Tasks Processes Parallel program Processors 18

Assignment Assignment means specifying the mechanism by which tasks will be distributed among processes. Goals: o Balance workload o Reduce interprocess communication o Reduce assignment overhead Assignment time o Static: fixed assignment during compilation or program creation o Dynamic: adaptive assignment during execution 19

Example: Balance Workload for (i=0;i<n;i++) a[i]=b[i]+c[i] a[0] a[1] a[2] a[3] a[4] a[5] a[6] = + Task0 Task1 Task2 Task3 Task4 Task5 Task6 20

Example: Balance Workload for (i=0;i<n;i++) for (j=0;j<i;j++) a[i]=a[i]+b[i][j] a[0] a[1] a[2] a[3] a[4] a[5] a[6] += Task0 Task1 Task2 Task3 Task4 Task5 Task6 Static Load Balancing o 2 different assignments Dynamic Load Balancing o postpone assignment until execution time 21

Phases in the Parallelization Process Partitioning Decomposition Assignment Orchestration Mapping P0 P2 P1 P3 Sequential computation Tasks Processes Parallel program Processors 22

Orchestration Implementation in a given programming model Means for o Naming and accessing data o Exchanging data o Synchronization Questions: o How to organize data structures? o How to schedule assigned tasks to improve locality? o Whether to communicate in large or small messages? Performance goal: o Reduction of communication and synchronization overhead o Load balancing Goals can be conflicting o reduction of communication vs. load balancing 23

Phases in the Parallelization Process Partitioning Decomposition Assignment Orchestration Mapping P0 P2 P1 P3 Sequential computation Tasks Processes Parallel program Processors 24

Mapping Mapping processes to processors Done by the program and/or operating system Shared memory system: done by operating system Distributed memory system: done by user or by runtime library such as MPI 25

Goal Mapping o If there are more processes than processors: put multiple related processes on the same processor o This may also be an option for heavily interacting processes no matter how many processors are available. o Exploit locality in network topology. place processes close to needed data o Maximize processor utilization Minimize interprocessor communication 26

Optimal Mapping Finding optimal mapping is NP-hard Must rely on heuristics 27

Performance Goals Step Architecture- Dependent? Major Performance Goal Decomposition Mostly No Expose enough concurrency but not too much Assignment Mostly no Balance workload Reduce communication volume Orchestration Yes Reduce unnecessary communication via data locality Reduce communication and synchronization cost as seen by the processor Reduce serialization at shared resources Mapping Yes Put related processes on the same processor if necessary Exploit locality in network topology 28

Application Structure Frequently used patterns for parallel applications: o Single Program Multiple Data - SPMD o Embarrassingly Parallel o Master / Slave o Work Pool o Divide and Conquer o Pipeline o Competition 29

Structure: Single Program Multiple Data Single program is executed in a replicated fashion. Processes or threads execute same operations on different data. Loosely-synchronous: Sequence of phases of computation and communication/synchronization. Synchronization/Communication Time Synchronization/Communication 30

Structure: Embarrassingly Parallel Multiple processes are spawned at the beginning. They execute totally independent of each other. Application terminates after all processes terminated. P0 P1 P2 P3 P4 P5 P6 Time 31

Structure: Master / Slave One process executes as a master. It distributes tasks to the slaves and receives the results from the slaves. Slaves execute the assigned tasks usually independent of the other slaves. Frequently used on workstation networks. Master Slave Slave Slave Slave 32

Structure: Work Pool Processes fetch tasks from a pool and insert new tasks into the pool. Pool requires synchronization. Large parallel machines require a distributed work pool. Leads to better load balancing. Work Pool P0 P1 P2 33

Structure: Divide and Conquer Recursive partitioning of tasks and collection of results Problems: o load balancing o least granularity Task Task Task Task Task Used on systems with background load Task Task Task Task Task 34

Structure: Pipeline Examples o Different functions are applied to data: functional decomposition o Parallel execution of functions for different data. o Signal and image processing o Groundwater flow, flow of pollutants, visualization o Almost no example of high parallelism P0 P1 P2 35

Pipelining: Example We would like to prepare and mail 1000 envelopes each containing a document of 4 pages to members of an association. N=1000 mailings XEROX Staple & Fold Insert into Envelope Seal write address Put stamp T 1 =10 sec. T 2 = 5 sec. T 3 = 5 sec. T 4 = 15 sec. At what intervals, do we see a new envelope prepared for mailing? Max(T 1, T 2,, T k ) = T max = 15 sec. What is the total time to get N envelopes prepared? Time = Cold_Start_Time + T max * (N-1) N* T max What is the total time we would have spent if pipelining is not used? N*Σ i T i 36

Pipelining: Example (contd.) N=1000 mailings XEROX Staple & Fold Insert into Envelope Seal write address Put stamp T 1 =10 sec. T 2 = 5 sec. T 3 = 5 sec. T 4 = 15 sec. How much speedup do we get? Speedup = T seq /T pipe = [ N*Σ i T i ]/ N* T max = Σ i T i / T max Speedup = 35/15 If you can not do much about the completion time for one task (i.e. Σ i T i ); what can you do to maximize the speedup? (i) Create as many stations (stages) as possible and (ii) Try to balance the load at each station, i.e. T 1 = T 2 = = T k 37

Pipelining: Example (contd.) One possible configuration to maximize speedup: XEROX-1 Staple & Fold Insert into Envelope Seal address stamp XEROX-2 T 2 = 5 sec. T 3 = 5 sec. T 4 = 5 sec. T 5 = 5 sec. T 6 = 5 sec. T 1 =5 sec. At what intervals, do we see a new envelope prepared for mailing? Max(T 1, T 2,, T k ) = T max = 5 sec. What is the speedup now? Speedup = 30/5 = 6 = number of stages in the pipeline! 38

Structure: Competition Evaluation of multiple solution strategies in parallel. It might be unknown which strategy is successful or which one is the fastest. With k processors, k strategies can be tested. If one of the additional strategies - not tested in the sequential program - is very fast, the speedup can be more than k (Superlinear speedup) 39

Summary Parallel Programming Programming models for SM and DM systems Sequential, Message Passing, Shared Memory Phases in the parallelization process o Decomposition, assignment, orchestration, mapping 40

Phases in the Parallelization Process Partitioning Decomposition Assignment Orchestration Mapping P0 P2 P1 P3 Sequential computation Tasks Processes Parallel program Processors 41

Performance Goals Step Architecture- Dependent? Major Performance Goal Decomposition Mostly No Expose enough concurrency but not too much Assignment Mostly no Balance workload Reduce communication volume Orchestration Yes Reduce unnecessary communication via data locality Reduce communication and synchronization cost as seen by the processor Reduce serialization at shared resources Mapping Yes Put related processes on the same processor if necessary Exploit locality in network topology 42

Application Structure Frequently used patterns for parallel applications: o Single Program Multiple Data - SPMD o Embarrassingly Parallel o Master / Slave o Work Pool o Divide and Conquer o Pipeline o Competition 43