Principles of Parallel Algorithm Design: Concurrency and Decomposition

Similar documents
Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping

Design of Parallel Algorithms. Models of Parallel Computation

CSC630/COS781: Parallel & Distributed Computing

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Parallelism in Software

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by

Shared-memory Parallel Programming with Cilk Plus

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

John Mellor-Crummey Department of Computer Science Rice University

Shared-memory Parallel Programming with Cilk Plus

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Design of Parallel Algorithms. Course Introduction

Dense Matrix Algorithms

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

Search Algorithms for Discrete Optimization Problems

CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

Chapter 5: Analytical Modelling of Parallel Programs

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Consultation for CZ4102

EE/CSCI 451: Parallel and Distributed Computation

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Analytical Modeling of Parallel Programs

Patterns for! Parallel Programming!

COMP 308 Parallel Efficient Algorithms. Course Description and Objectives: Teaching method. Recommended Course Textbooks. What is Parallel Computing?

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Limitations of Memory System Performance

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

An Introduction to Parallel Programming

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

CSC630/CSC730 Parallel & Distributed Computing

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Parallel Algorithm Design. CS595, Fall 2010

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

EE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem.

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Review: Creating a Parallel Program. Programming for Performance

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Interconnection Network

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Introduction to parallel computing

Patterns for! Parallel Programming II!

CS Elementary Graph Algorithms & Transform-and-Conquer

Parallelizing LU Factorization

Introduction to Multithreaded Algorithms

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

Overpartioning with the Rice dhpf Compiler

Multiprocessor Support

Search Algorithms for Discrete Optimization Problems

Parallel Programming Programowanie równoległe

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Lecture 4: Parallel Speedup, Efficiency, Amdahl s Law

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Parallelization Strategy

Online Course Evaluation. What we will do in the last week?

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Divide and Conquer. Algorithm Fall Semester

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

Chapter 11 Search Algorithms for Discrete Optimization Problems

Today: Amortized Analysis (examples) Multithreaded Algs.

Lesson 1 1 Introduction

Randomized Algorithms

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

I-1 Introduction. I-0 Introduction. Objectives:

CSL 860: Modern Parallel

DPHPC: Performance Recitation session

Lecture 23: Parallelism. To Use Library. Getting Good Results. CS 62 Fall 2016 Kim Bruce & Peter Mawhorter

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Numerical Algorithms

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Thread-Level Parallelism

Introduction to Parallel Computing

Portland State University ECE 588/688. Graphics Processors

Parallelization Strategy

COURSE: DATA STRUCTURES USING C & C++ CODE: 05BMCAR17161 CREDITS: 05

15-853:Algorithms in the Real World. Outline. Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

DESIGN AND ANALYSIS OF ALGORITHMS GREEDY METHOD

Thread-Level Parallelism. Instructor: Adam C. Champion, Ph.D. CSE 2431: Introduction to Operating Systems Reading: Section 12.

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Scan Primitives for GPU Computing

Introduction to parallel Computing

Programming as Successive Refinement. Partitioning for Performance

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

27. Parallel Programming I

27. Parallel Programming I

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

3/25/14. Lecture 25: Concurrency. + Today. n Reading. n P&C Section 6. n Objectives. n Concurrency

Transcription:

Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017

Parallel Algorithm Recipe to solve a problem using multiple processors Typical steps for constructing a parallel algorithm identify what pieces of work can be performed concurrently partition and map work onto independent processors distribute a program s input, output, and intermediate data coordinate accesses to shared data: avoid conflicts ensure proper order of work using synchronization Why typical? Some of the steps may be omitted. if data is in shared memory, distributing it may be unnecessary if using message passing, there may not be shared data the mapping of work to processors can be done statically by the programmer or dynamically by the runtime 2

Topics for Today Introduction to parallel algorithms tasks and decomposition threads and mapping threads versus cores Decomposition techniques - part 1 recursive decomposition data decomposition 3

Decomposing Work for Parallel Execution Divide work into tasks that can be executed concurrently Many different decompositions possible for any computation Tasks may be same, different, or even indeterminate sizes Tasks may be independent or have non-trivial order Conceptualize tasks and ordering as task dependency DAG node = task edge = control dependence T11 T2 T5 T9 T12 T15 T1 T3 T4 T6 T7 T8 T10 T13 T14 T16 T17 4

Example: Dense Matrix-Vector Product Task 1 2 1 A b y 2 n n Computing each element of output vector y is independent Easy to decompose dense matrix-vector product into tasks one per element in y Observations task size is uniform no control dependences between tasks tasks share b 5

Example: Database Query Processing Consider the execution of the query: MODEL = "CIVIC" AND YEAR = 2001 AND on the following database: (COLOR = "GREEN" OR COLOR = "WHITE") ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius 2001 Green CA $18,000 6734 Civic 2001 White OR $17,000 5342 Altima 2001 Green FL $19,000 3845 Maxima 2001 Blue NY $22,000 8354 Accord 2000 Green VT $18,000 4395 Civic 2001 Red CA $17,000 7352 Civic 2002 Red WA $18,000 6

Example: Database Query Processing Task: compute set of elements that satisfy a predicate task result = table of entries that satisfy the predicate Edge: output of one task serves as input to the next MODEL = "CIVIC" AND YEAR = 2001 AND (COLOR = "GREEN" OR COLOR = "WHITE") 7

Example: Database Query Processing Alternate task decomposition for query MODEL = "CIVIC" AND YEAR = 2001 AND (COLOR = "GREEN" OR COLOR = "WHITE") Different decompositions may yield different parallelism and different amounts of work 8

Granularity of Task Decompositions Granularity = task size depends on the number of tasks Fine-grain = large number of tasks Coarse-grain = small number of tasks Granularity examples for dense matrix-vector multiply fine-grain: each task represents an individual element in y coarser-grain: each task computes 3 elements in y 9

Degree of Concurrency Definition: number of tasks that can execute in parallel May change during program execution Metrics maximum degree of concurrency largest # concurrent tasks at any point in the execution average degree of concurrency average number of tasks that can be processed in parallel Degree of concurrency vs. task granularity inverse relationship 10

Example: Dense Matrix-Vector Task 1 2 1 A b y 2 n n Computing each element of output vector y is independent Easy to decompose dense matrix-vector product into tasks one per element in y Observations task size is uniform no control dependences between tasks tasks share b Question: Is n the maximum number of tasks possible? 11

Critical Path Edge in task dependency graph represents task serialization Critical path = longest weighted path though graph Critical path length = lower bound on parallel execution time 12

Critical Path Length Examples: database query task dependency graphs Questions: What are the tasks on the critical path for each dependency graph? What is the shortest parallel execution time for each decomposition? How many processors are needed to achieve the minimum time? What is the maximum degree of concurrency? What is the average parallelism? Note: number in vertex represents task cost 13

Critical Path Length Example: dependency graph for dense-matrix vector product Task 1 2 1 A b y 2 n Questions: What does a task dependency graph look like for DMVP? What is the shortest parallel execution time for the graph? How many processors are needed to achieve the minimum time? What is the maximum degree of concurrency? What is the average parallelism? n 14

Limits on Parallel Performance What bounds parallel execution time? minimum task granularity e.g. dense matrix-vector multiplication n 2 concurrent tasks dependencies between tasks parallelization overheads e.g., cost of communication between tasks fraction of application work that can t be parallelized more about Amdahl s law in a later lecture... Measures of parallel performance speedup = T1/Tp parallel efficiency = T1/(pTp) 15

Task Interaction Graphs Tasks generally exchange data with others example: dense matrix-vector multiply if vector b is not replicated in all tasks, tasks will have to communicate elements of b Task interaction graph node = task edge = interaction or data exchange Task interaction graphs vs. task dependency graphs task interaction graphs represent data dependences task dependency graphs represent control dependences 16

Task Interaction Graph Example Sparse matrix-vector multiplication Computation of each result element = independent task Only non-zero elements of sparse matrix A participate If, b is partitioned among tasks structure of the task interaction graph = graph of the matrix A (i.e. the graph for which A represents the adjacency structure) 17

Interaction Graphs, Granularity, & Communication Finer task granularity increases communication overhead Example: sparse matrix-vector product interaction graph Assumptions: each node takes unit time to process each interaction (edge) causes an overhead of a unit time If node 0 is a task: communication = 3; computation = 4 If nodes 0, 4, and 5 are a task: communication = 5; computation = 15 coarser-grain decomposition smaller communication/computation 18

Tasks, Threads, and Mapping Generally # of tasks > # threads available parallel algorithm must map tasks to threads Why threads rather than CPU cores? aggregate tasks into threads thread = processing or computing agent that performs work assign collection of tasks and associated data to a thread operating system maps threads to physical cores operating systems often enable one to bind a thread to a core for multithreaded cores, the OS can bind multiple software threads to distinct hardware threads associated with a core 19

Tasks, Threads, and Mapping Mapping tasks to threads is critical for parallel performance On what basis should one choose mappings? using task dependency graphs schedule independent tasks on separate threads minimum idling optimal load balance using task interaction graphs want threads to have minimum interaction with one another minimum communication 20

Tasks, Threads, and Mapping A good mapping must minimize parallel execution time by Mapping independent tasks to different threads Assigning tasks on critical path to threads ASAP Minimizing interactions between threads map tasks with dense interactions to the same thread Difficulty: criteria often conflict with one another e.g. no decomposition minimizes interactions but no speedup! 21

Tasks, Threads, and Mapping Example Example: mapping database queries to threads Consider the dependency graphs in levels no nodes in a level depend upon one another compute levels using topological sort Assign all tasks within a level to different threads 22

Topics for Today Introduction to parallel algorithms tasks and decomposition threads and mapping threads versus cores Decomposition techniques - part 1 recursive decomposition data decomposition 23

Decomposition Techniques How should one decompose a task into various subtasks? No single universal recipe In practice, a variety of techniques are used including recursive decomposition data decomposition exploratory decomposition speculative decomposition 24

Recursive Decomposition Suitable for problems solvable using divide-and-conquer Steps 1. decompose a problem into a set of sub-problems 2. recursively decompose each sub-problem 3. stop decomposition when minimum desired granularity reached!25

Recursive Decomposition for Quicksort Sort a vector v: 1. Select a pivot 2. Partition v around pivot into vleft and vright 3. In parallel, sort vleft and sort vright 26

Recursive Decomposition for Min Finding the minimum in a vector using divide-and-conquer procedure SERIAL_MIN(A, n) begin min = A[0]; for i := 1 to n 1 do if (A[i] < min) min := A[i]; return min; procedure RECURSIVE_MIN (A, n) begin if ( n = 1 ) then min := A[0]; else lmin := spawn RECURSIVE_MIN(&A[0], n/2 ); rmin := spawn RECURSIVE_MIN(&A[n/2], n-n/2); if (lmin < rmin) then min := lmin; else min := rmin; return min; Applicable to other associative operations, e.g. sum, AND 27

Data Decomposition Steps 1. identify the data on which computations are performed 2. partition the data across various tasks partitioning induces a decomposition of the problem Data can be partitioned in various ways appropriate partitioning is critical to parallel performance Decomposition based on input data output data input + output data intermediate data 28

Decomposition Based on Input Data Applicable if each output is computed as a function of the input May be the only natural decomposition if output is unknown examples finding the minimum in a set or other reductions sorting a vector Associate a task with each input data partition task performs computation on its part of the data subsequent processing combines partial results from earlier tasks 29

Example: Decomposition Based on Input Data Count the frequency of item sets in database transactions Partition computation by partitioning the set of transactions a task computes a local count for each item set for its transactions sum local count vectors for item sets to produce total count vector 30

Decomposition Based on Output Data If each element of the output can be computed independently Partition the output data across tasks Have each task perform the computation for its outputs Task 1 2 A b y 1 2 n Example: dense matrix-vector multiply n 31

Output Data Decomposition: Example Matrix multiplication: C = A x B Computation of C can be partitioned into four tasks Task 1: Task 2: Task 3: Task 4: Other task decompositions possible 32

Example: Decomposition Based on Output Data Count the frequency of item sets in database transactions Partition computation by partitioning the item sets to count each task computes total count for each of its item sets append total counts for item subsets to produce result 33

Partitioning Input and Output Data Partition on both input and output for more concurrency Example: item set counting 34

Intermediate Data Partitioning If computation is a sequence of transforms (from input data to output data) Can decompose based on data for intermediate stages 35

Example: Intermediate Data Partitioning Dense Matrix Multiply Decomposition of intermediate data: yields 8 + 4 tasks Stage I D1,1,1 D1,2,1 D2,1,1 D2,2,1 D1,1,2 D1,2,2 D2,1,2 D2,2,2 Stage II Task 01: D 1,1,1 = A 1,1 B 1,1 Task 02: D 2,1,1 = A 1,2 B 2,1 Task 03: D 1,1,2 = A 1,1 B 1,2 Task 04: D 2,1,2 = A 1,2 B 2,2 Task 05: D 1,2,1 = A 2,1 B 1,1 Task 06: D 2,2,1 = A 2,2 B 2,1 Task 07: D 1,2,2 = A 2,1 B 1,2 Task 08: D 2,2,2 = A 2,2 B 2,2 Task 09: C 1,1 = D 1,1,1 + D 2,1,1 Task 10: C 1,2 = D 1,1,2 + D 2,1,2 Task 11: C 2,1 = D 1,2,1 + D 2,2,1 Task 12: C 2,,2 = D 1,2,2 + D 2,2,2 36

Intermediate Data Partitioning: Example Tasks: dense matrix multiply decomposition of intermediate data Task 01: D 1,1,1 = A 1,1 B 1,1 Task 02: D 2,1,1 = A 1,2 B 2,1 Task 03: D 1,1,2 = A 1,1 B 1,2 Task 05: D 1,2,1 = A 2,1 B 1,1 Task 04: D 2,1,2 = A 1,2 B 2,2 Task 07: D 1,2,2 = A 2,1 B 1,2 Task 06: D 2,2,1 = A 2,2 B 2,1 Task 09: C 1,1 = D 1,1,1 + D 2,1,1 Task 08: D 2,2,2 = A 2,2 B 2,2 Task 10: C 1,2 = D 1,1,2 + D 2,1,2 Task 11: C 2,1 = D 1,2,1 + D 2,2,1 Task 12: C 2,,2 = D 1,2,2 + D 2,2,2 Task dependency graph 37

Owner Computes Rule Each datum is assigned to a thread Each thread computes values associated with its data Implications input data decomposition all computations using an input datum are performed by its thread output data decomposition an output is computed by the thread assigned to the output data 38

Topics for Next Class Decomposition techniques - part 2 exploratory decomposition hybrid decomposition Characteristics of tasks and interactions task generation, granularity, and context characteristics of task interactions Mapping techniques for load balancing static mappings dynamic mappings Methods for minimizing interaction overheads Parallel algorithm design templates 39

References Adapted from slides Principles of Parallel Algorithm Design by Ananth Grama Based on Chapter 3 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003 40