Introduction to Multithreaded Algorithms

Similar documents
Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 2. Dept. of Computer Science & Eng University of Moratuwa

Today: Amortized Analysis (examples) Multithreaded Algs.

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

Writing Parallel Programs; Cost Model.

Introduction to Algorithms

Shared-memory Parallel Programming with Cilk Plus

Multithreaded Parallelism and Performance Measures

The DAG Model; Analysis of For-Loops; Reduction

Shared-memory Parallel Programming with Cilk Plus

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

A Minicourse on Dynamic Multithreaded Algorithms

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson

CSE 260 Lecture 19. Parallel Programming Languages

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

Lesson 1 1 Introduction

Parallelism and Performance

CS 240A: Shared Memory & Multicore Programming with Cilk++

Multicore programming in CilkPlus

Cilk, Matrix Multiplication, and Sorting

Multithreaded Parallelism on Multicore Architectures

Hypercubes. (Chapter Nine)

Cost Model: Work, Span and Parallelism

EE/CSCI 451 Midterm 1

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

The PRAM (Parallel Random Access Memory) model. All processors operate synchronously under the control of a common CPU.

COMP Parallel Computing. SMM (4) Nested Parallelism

Table of Contents. Cilk

Homework3: Dynamic Programming - Answers

CS 140 : Numerical Examples on Shared Memory with Cilk++

CSE 613: Parallel Programming

Multithreaded Parallelism on Multicore Architectures

Compsci 590.3: Introduction to Parallel Computing

Cilk Plus GETTING STARTED

Multithreaded Programming in. Cilk LECTURE 2. Charles E. Leiserson

Unit-5 Dynamic Programming 2016

Advanced Algorithms and Data Structures

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications

Effective Performance Measurement and Analysis of Multithreaded Applications

Algorithm Design Techniques part I

CS473 - Algorithms I

Divide and Conquer Algorithms

Lecture 8. Dynamic Programming

Advanced Algorithms and Data Structures

Design and Analysis of Parallel Programs. Parallel Cost Models. Outline. Foster s Method for Design of Parallel Programs ( PCAM )

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Introduction to Algorithms 6.046J/18.401J/SMA5503

Practical Session No. 12 Graphs, BFS, DFS, Topological sort

Cilk programs as a DAG

Dense Matrix Algorithms

Chapter 3 Dynamic programming

Dynamic Programming. Design and Analysis of Algorithms. Entwurf und Analyse von Algorithmen. Irene Parada. Design and Analysis of Algorithms

Practice Problems for the Final

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Algorithm Analysis and Design

Analysis of Algorithms - Introduction -

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Analysis of Algorithms & Big-O. CS16: Introduction to Algorithms & Data Structures Spring 2018

Organisation. Assessment

Introduction to OpenMP

More Complicated Recursion CMPSC 122

Multithreaded Programming in Cilk. Matteo Frigo

15-853:Algorithms in the Real World. Outline. Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

27. Parallel Programming I

This course is intended for 3rd and/or 4th year undergraduate majors in Computer Science.

A Static Cut-off for Task Parallel Programs

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

The divide-and-conquer paradigm involves three steps at each level of the recursion: Divide the problem into a number of subproblems.

27. Parallel Programming I

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

Algorithms: Dynamic Programming

ALGORITHM DESIGN DYNAMIC PROGRAMMING. University of Waterloo

Introduction to Algorithms Third Edition

On the cost of managing data flow dependencies

Lecture 4: Graph Algorithms

CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms

CSC-140 Assignment 5

Scan and its Uses. 1 Scan. 1.1 Contraction CSE341T/CSE549T 09/17/2014. Lecture 8

A Framework for Space and Time Efficient Scheduling of Parallelism

Analysis of Algorithms Part I: Analyzing a pseudo-code

Scan and Quicksort. 1 Scan. 1.1 Contraction CSE341T 09/20/2017. Lecture 7

Introduction to Algorithms 6.046J/18.401J

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Theory and Algorithms Introduction: insertion sort, merge sort

I. Sorting Networks. Thomas Sauerwald. Easter 2015

I-1 Introduction. I-0 Introduction. Objectives:

ECE608 - Chapter 15 answers

CS302 Topic: Algorithm Analysis #2. Thursday, Sept. 21, 2006

CS583 Lecture 01. Jana Kosecka. some materials here are based on Profs. E. Demaine, D. Luebke A.Shehu, J-M. Lien and Prof. Wang s past lecture notes

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Parallel Graph Algorithms

CSC 505, Spring 2005 Week 6 Lectures page 1 of 9

Cilk Plus: Multicore extensions for C and C++

Introduction to Algorithms

Transcription:

Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd Edition, MIT Press, 2009. April 23, 2012 E. Orozco

Multithreaded Algorithms Learning objectives: at the end of this chapter students are expected to 1. understand the importance of parallel computation. 2. identify the abstract model of dynamic multithreading programming as a concurrency platform. 3. identify dynamic multithreading programming as an extension of sequential programming. 4. identify work, span, speedup, and parallelism as metrics for analyzing multithreading algorithms. 5. compute and interpret work, span, speedup, and parallelism of a given multithreaded algorithm on a particular input. 6. analyze simple multithreaded algorithms. 7. compute the running time of simple multithreaded algorithms. 8. represent a multithreaded computation as a directed acyclic graph. 9. identify concurrent for loops. 10. identify nested parallelism. 11. write simple parallel algorithms using spawn, sync, parallel for, and return control structures.

Multithreaded Algorithms Key concepts Abstract model of multithreaded computation Parallel control: Spawn Sync return Nested parallelism Concurrent for loops Performance metrics: work Span speedup Parallelism Directed acyclic graph (dag) Race condition

Dynamic multithreaded programming (DMP) Thread: software abstraction of a virtual processor Threads share a common memory. Maintains its own stack and program counter. Execute code independently of the others. OS loads a thread onto a processor, executes it and switch to another one when needed.

Dynamic multithreaded programming (DMP) Thread: software abstraction of a virtual processor Threads share a common memory. Maintains its own stack and program counter. Execute code independently of the others. OS loads a thread onto a processor, executes it and switch to another one when needed. Concurrency platform: Coordinates, Schedule, parallel computing resources Manages

Dynamic multithreaded programming (DMP) Thread: software abstraction of a virtual processor Threads share a common memory. Maintains its own stack and program counter. Execute code independently of the others. OS loads a thread onto a processor, executes it and switch to another one when needed. Concurrency platform: Coordinates, Schedule, parallel computing resources Manages DMP allows the programmer to specify parallelism without worrying about communication protocols and load balancing.

Dynamic multithreaded programming (DMP) Thread: software abstraction of a virtual processor Threads share a common memory. Maintains its own stack and program counter. Execute code independently of the others. OS loads a thread onto a processor, executes it and switch to another one when needed. Concurrency platform: Coordinates, Schedule, parallel computing resources Manages DMP allows the programmer to specify parallelism without worrying about communication protocols and load balancing. Concurrency platform for DMP: Contains a scheduler: automatic load-balance computation. Support nested parallelism: allows a subroutine to be spawned. Contains parallel loops: the iteration of the loop can execute concurrently.

Dynamic multithreaded programming (DMP) Programmer needs to specify only the logical parallelism within a computation: concurrency platform schedules and loadbalance the computation among the threads.

Reasons for choosing Multithreading as a parallel computation model It is a simple extension of sequential programming: A parallel algorithm can be described by adding the keywords parallel, spawn, and sync. It provides a theoretically clean way of quantifying parallelism by the notions of work and span. Divide-and-conquer paradigm: Nested parallelism. Multithreading platforms: Cilk OpenMP Task Parallel Library Threading Building Blocks [Intel TBB] Grand challenge problems are being solved with multithreading algorithms. Hardware: Super computers: Cray Multi-core.

Basics of dynamic multithreading Example: Multithreading the computation of Fibonacci numbers F 0 = 0, F 1 = 1, F i = F i-1 + F i-2 for i 2. Sequential version FIB( n ) FIB( n ) 1. if n 1 2. return n 3. else 4. x = FIB( n 1 ) 5. y = FIB( n 2 ) 6. return x + y If n<= 1 return n x = FIB(n 1) y = FIB(n 2) return x+y

Basics of dynamic multithreading Example: Recursive computation of the Fibonacci numbers. FIB(6) FIB(5) FIB(4) FIB(4) FIB(3) FIB(3) FIB(2) FIB(3) FIB(2) FIB(2) FIB(1) FIB(2) FIB(1) FIB(1) FIB(0) FIB(2) FIB(1) FIB(1) FIB(0) FIB(1) FIB(0) FIB(1) FIB(0) FIB(1) FIB(0) The tree of recursive calls to compute FIB(6).

Basics of dynamic multithreading Example: Recursive computation of the Fibonacci numbers. Running time analysis: T(n) = T(n-1) + T(n-2) + Θ(1). Exercise: Show that T(n) = Θ( Φ n ), where Φ = (1 + sqrt(5))/2 is the golden ratio. Poorly efficient algorithm... But illustrates the key concepts for analyzing multithreaded algorithms.

Basics of dynamic multithreading Example: Recursive computation of the Fibonacci numbers. Multithreading the computation of Fibonacci numbers Recursive calls to FIB(n-1) and FIB(n-2) in lines 4 and 5 are independent to each other. FIB( n ) 1 if n 1 2 return n 3 else 4 x = FIB( n 1 ) 5 y = FIB( n 2 ) 6 return x + y FIB( n ) if n 1 return n y = FIB( n 2 ) x = FIB( n 1 ) return x + y Task dependency graph for FIB( n )

Basics of dynamic multithreading Example: Multithreading the computation of Fibonacci numbers Parallel version P-FIB( n ) 1. if n 1 2. return n 3. else 4. x = spawn P-FIB( n 1 ) 5. y = P-FIB( n 2 ) 6. sync 7. return x + y Task dependency graph for FIB( n )

Basics of dynamic multithreading Nested parallelism occurs when the keyword spawn proceeds a procedure call.

A model of multithreaded execution Multithreaded computation: set of runtime instructions executed by a processor on behalf of a multithreaded program. A directed acyclic graph G=(V,E), called a computation dag, is used to represent multithreaded computation. Vertices in V represent instructions. Edges in E represent dependencies between instructions. Edge (u, v) in E means instruction u must execute before instruction v. Continuation edge (u, u ), drawn horizontally, connects a strand u to its successor u within the same procedure instance. Spawn edge (u, v), pointing downward, means strand u spawns strand v. Call edges, pointing downward, represent normal procedure calls. Return edge (u, x), pointing upward, means that a strand u returns to its calling procedure and x is the strand immediately following the next sync in the calling procedure.

A model of multithreaded execution Multithreaded computation: set of runtime instructions executed by a processor on behalf of a multithreaded program. A directed acyclic graph G=(V,E), called a computation dag, is used to represent multithreaded computation. Vertices in V represent instructions. Edges in E represent dependencies between instructions: Edge (u, v) E means instruction u must execute before instruction v. A strand represents a chain of instructions that does not contain any parallel control (spawn, sync, or return from a spawn). A computation starts with a single initial strand and ends with a single final strand. If there is a directed path from strand u to strand v, u and v are logically in series. Otherwise, they are logically in parallel. A multithreaded computation is a dag of strands embedded in a tree of procedure instances.

A model of multithreaded execution Example: Recursive computation of the Fibonacci numbers. FIB(4) FIB(3) FIB(2) FIB(2) FIB(1) FIB(1) FIB(0) FIB(1) FIB(0) The tree of recursive calls to compute FIB(4).

A model of multithreaded execution P-FIB(4) P-FIB(3) P-FIB(2) P-FIB(2) P-FIB(1) P-FIB(1) P-FIB(0) P-FIB(1) P-FIB(0) Circles represent strands. base case or the part of the procedure up to the spawn of P-FIB(n-1) in line 4. part of the procedure that calls P_FIB(n-2) in line 5 up to the sync in line 6, where it suspends until the spawn of P- FIB(n-1) returns. part of the procedure after the sync where it sums x and y up to the point where it returns the result. Rectangles contain strands belonging to the same procedure for spawned procedures for called procedures. Directed acyclic graph representing P-FIB(4)

A model of multithreaded execution Initial strand Final strand Spawn edge P-FIB(4) Returning edge Call edge Continuation edge P-FIB(3) P-FIB(2) P-FIB(2) P-FIB(1) P-FIB(1) P-FIB(0) P-FIB(1) P-FIB(0) Circles represent strands. base case or the part of the procedure up to the spawn of P-FIB(n-1) in line 4. part of the procedure that calls P_FIB(n-2) in line 5 up to the sync in line 6, where it suspends until the spawn of P- FIB(n-1) returns. part of the procedure after the sync where it sums x and y up to the point where it returns the result. Rectangles contain strands belonging to the same procedure for spawned procedures for called procedures. Directed acyclic graph representing P-FIB(4)

An ideal parallel computer for multithreaded computation Consists of Set of homogenous processors: all processors in the machine have equal computer power. Sequentially consistent shared memory: shared memory produces the same results as if at each step, exactly one instruction from one of the processors is executed. Performance assumption: No schedule overhead: scheduling cost is ignored.

Performance measures for multithreaded algorithms Work: total time to execute the entire computation on a single processor Work = Σt i, where t i is the time taken by strand i. Work = #vertices in the dag if each strand takes unit time. Span: longest time to execute the strands along any path in the dag. Span = #vertices on a longest or critical path in the dag if each strand takes unit time.

Performance measures for multithreaded algorithms For the computation dag for P-FIB( 4 ): Work = 17 time units Span = 8 time units P-FIB(4) P-FIB(3) P-FIB(2) P-FIB(2) P-FIB(1) P-FIB(1) P-FIB(0) P-FIB(1) P-FIB(0) Directed acyclic graph representing P-FIB(4)

Performance measures for multithreaded algorithms The running time of a multithreaded computation: T p = depends on work, span, p, scheduler where p = # processor available. T 1 = work on a single processor. T : running time if we could run each strand on its own processor (i.e., unlimited number of processors) Span = T

Performance measures for multithreaded algorithms Speedup = T 1 / T p Speedup p Linear speedup = Θ(p) Perfect linear speedup = p. Parallelism = T 1 / T Interpretations of parallelism: As an upper bound: maximum possible speedup that can be achieved on any number of processors. As a limit on the speedup: provides a limit on the possibility of attaining perfect linear speedup.

Performance measures for multithreaded algorithms Example: Computation of P-FIB(4) Assume each strand takes unit time. Work = 17, Span = 8, Parallelism = T 1 / T = 17 / 8 = 2.125. Interpretation: No matter how many processors are used for the computation of P-FIB(4), the speedup will not be more than two.

Performance measures for multithreaded algorithms Class exercise: Draw the computation dag that results from executing P-FIB(5). Assuming that each strand in the computation takes unit time, what are the work, span, and parallelism of the computation?

Analysis of multithreaded algorithms Analysis of the work: running time of a serial algorithm.» For P-FIB(n), T 1 (n) = Θ( Φ n )

Analysis of multithreaded algorithms Analysis of the work: running time of a serial algorithm.» For P-FIB(n), T 1 (n) = Θ( Φ n ) Analysis of the span: Subcomputations in series Subcomputations in parallel A B A B Work : T 1 (AUB) = T 1 (A) + T 1 (B) Span : T (AUB) = T (A) + T ( B) Work : T 1 (AUB) = T 1 (A) + T 1 (B) Span : T (AUB) = max(t (A), T ( B))

Analysis of multithreaded algorithms Example: Analysis of P-FIB(n) T 1 (n) of P-FIB(n) = Θ( Φ n ), P-FIB(n-1) P-FIB(n-2) Span : T (n)= max(t (n-1), T ( n-2)) + Θ(1) = T ( n-1) + Θ(1) = Θ(n).

Analysis of multithreaded algorithms Example: Analysis of P-FIB(n): T 1 (n) of P-FIB(n) = Θ( Φ n ), P-FIB(n-1) P-FIB(n-2) Span : T (n)= max(t (n-1), T ( n-2)) + Θ(1) = T ( n-1) + Θ(1) = Θ(n). Parallelism = T 1 (n) / T ( n ) = Θ( Φ n / n)

Parallel Loops Example: matrix-vector multiplication. Let A = (a ij ) be an nxn matrix and let x = (x i ) be an n-vector. The product of A times x is other n-vector y = (y i ) defined by MAT-VEC-SEQ( A, x ) 1 n = A.rows 2 Let y be a new vector of length n 3 for i = 1 to n 4 y i = 0 5 for i = 1 to n 6 for j = 1 to n 7 y i = y i + a ij x j 8 return y Serial code for matrix-vector multiplication MAT-VEC( A, x ) 1 n = A.rows 2 Let y be a new vector of length n 3 parallel for i = 1 to n 4 y i = 0 5 parallel for i = 1 to n 6 for j = 1 to n 7 y i = y i + a ij x j 8 return y Parallel code for matrix-vector multiplication

Parallel Loops Example: matrix-vector multiplication. Let A = (a ij ) be an nxn matrix and let x = (x i ) be an n-vector. The product of A times x is other n-vector y = (y i ) defined by MAT-VEC( A, x ) 1 n = A.rows 2 Let y be a new vector of length n 3 parallel for i = 1 to n 4 y i = 0 5 parallel for i = 1 to n 6 for j = 1 to n 7 y i = y i + a ij x j 8 return y Parallel code for matrix-vector multiplication

Parallel Loops Example: matrix-vector multiplication. A divide-and-conquer parallel implementation of MAT-VEC( A, x ) MAT-VEC-MAIN-LOOP( A, x, y, n, i, i ) 1 if i == i 2 for j = 1 to n 3 y i = y i + a ij x j 4 else mid = floor((i+ i )/2) 5 spawn MAT-VEC-MAIN-LOOP( A, x, y, n, i, mid) 6 MAT-VEC-MAIN-LOOP( A, x, y, n, mid+ 1, i ) 7 sync

Parallel Loops Example: matrix-vector multiplication. A divide-and-conquer parallel implementation of MAT-VEC( A, x ) 1, 8 1, 4 5, 8 1, 2 3, 4 5, 6 7, 8 1, 1 2, 2 3, 3 4, 4 5, 5 6, 6 7, 7 8, 8 DAG representing the computation of MAT-VEC-LOOP( A, x, y, 1, 8)

Parallel Loops Example: matrix-vector multiplication. Running time analysis of MAT-VEC( A, x ) Analysis of work: T 1 (n) = Θ(n 2 ) + h(n) = Θ(n 2 ) + Θ(n) = Θ(n 2 ) where h(n) is the overhead for recursive spawning in implementing the parallel loops. Analysis of span: Why is it that h(n) = Θ(n 2 )? span of iteration i Depth of recursive calling

Parallel Loops Example: matrix-vector multiplication. Running time analysis of MAT-VEC( A, x ) Analysis of work: T 1 (n) = Θ(n 2 ) + h(n) = Θ(n 2 ) + Θ(n) = Θ(n 2 ) where h(n) is the overhead for recursive spawning in implementing the parallel loops. Why is it that h(n) = Θ(n 2 )? Analysis of span: Analysis of parallelism:

Multithreaded Algorithms Assessment 1. Understanding the multithreaded model: Students will be given a sequential algorithm such as FIB( n ) and will be asked to answer questions such as: a. What parts of the algorithm present opportunities for parallelization? b. What type of parallelism, if any, is suitable for this algorithm (nested parallelism, loop parallelism)? Why? c. Write a multithreaded version of the algorithm. 5. Understanding the performance of multithreaded algorithms: Students will be given a multithreaded algorithm such as P-FIB( n ) or MAT-VEC-MAIN-LOOP( A, x, y, i, i ) and will be asked to answer questions such as: a. Identify weather this algorithm is or not race-condition free. b. Draw a dag for MAT-VEC-MAIN-LOOP( A, x, y, n, 1, 4) c. What is the work? d. What is the span? e. What is the speedup? f. What is the parallelism? 3. Understanding the performance of multithreaded algorithms: Students are asked to consider the P-TRANSPOSE( A ) procedure to transpose an nxn matrix A in place and are asked to analyze the work, span, and parallelism of this algorithm. P-TRANSPOSE( A ) 1 n = A. rows 2 parallel for j = 2 to n 3 parallel for i = 1 to j 1 4 exchange a ij with a ji

Multithreaded Algorithms Assessment Students will be given the opportunity to work in teams (2 to 3 students) in one of the following projects: Design of a multithreaded algorithm: Students will be ask to provide pseudo code for an efficient multithreaded algorithm that transposes an nxn matrix in place by using divide-and-conquer to divide the matrix recursively into four n/2 x n/2 submatrices. They should also provide an analysis of their algorithm. Parallel-loop to nested-loop conversion: Students are given a multithreaded algorithm for performing pairwise addition on n-element arrays A[1..n] and B[1..n], storing the sum in C[1..n] and are asked to rewrite the parallel loop in SUM-ARRAYS procedure using nested parallelism (spawn and sync). Also, they have to provide an analysis of the parallelism of the implementation. SUM-ARRAYS( A, B, C) 1 parallel for i = 1 to A.length 2 C[i] = A[i] + B[i]

Multithreaded Algorithms Assessment 3. Sequential to parallel conversion: Students are given the LU-DECOMPOSITION procedure and are asked to provide pseudocode for a corresponding multithreaded version. They are asked to make it as parallel as possible, and analyze its work, span, and parallelism. LU-DECOMPOSITION( A ) 1 n = A.rows 2 Let L and U be new nxn matrices 3 Initialize U with 0s below the diagonal 4 Initialize L with 1s on the diagonal and 0s above the diagonal 5 for k = 1 to n 6 u kk = a kk 7 for i = k + 1 to n 8 l ik = a ik / u kk //l ik holds v i 9 u ki = a ki //u ki holds w T i 10 for i = k + 1 to n 11 for j = k + 1 to n 12 a ij = a ij l ik u kj 13 return L and U