Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Similar documents
Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Dense Matrix Algorithms

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Chapter 8 Dense Matrix Algorithms

Parallel Numerical Algorithms

Matrix multiplication

Matrix Multiplication

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Matrix Multiplication

Lecture 3: Sorting 1

High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms

Basic Communication Operations (Chapter 4)

Parallel Computing. Parallel Algorithm Design

Numerical Algorithms

CSC630/CSC730 Parallel & Distributed Computing

Control flow graphs and loop optimizations. Thursday, October 24, 13

Lecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Rapid growth of massive datasets

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Matrix-vector Multiplication

Attendance (2) Performance (3) Oral (5) Total (10) Dated Sign of Subject Teacher

12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares

Algorithms and Applications

Chain Matrix Multiplication

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Lecture 13: Chain Matrix Multiplication

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Lecture 6: Parallel Matrix Algorithms (part 3)

x = 12 x = 12 1x = 16

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

Matrix Multiplication

CSL 860: Modern Parallel

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

Workload Characterization Techniques

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Lecture 17: Array Algorithms

Matrix Multiplication

Parallel Numerical Algorithms

Linear Arrays. Chapter 7

n = 1 What problems are interesting when n is just 1?

Project C/MPI: Matrix-Vector Multiplication

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Lecture 8 Parallel Algorithms II

Blocking SEND/RECEIVE

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Matrix Inverse 2 ( 2) 1 = 2 1 2

Introduction to Parallel Computing

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by

CS475 Parallel Programming

CS 140 : Numerical Examples on Shared Memory with Cilk++

Chapter 1: Number and Operations

Vector: A series of scalars contained in a column or row. Dimensions: How many rows and columns a vector or matrix has.

Divide and Conquer Algorithms. Sathish Vadhiyar

Mapping Algorithms to Hardware By Prawat Nagvajara

Data Structures and Algorithms Week 8

Pipelined Computations

CS473 - Algorithms I

Parallel Programming in C with MPI and OpenMP

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

14 Dynamic. Matrix-chain multiplication. P.D. Dr. Alexander Souza. Winter term 11/12

Sorting (Chapter 9) Alexandre David B2-206

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

15. The Software System ParaLab for Learning and Investigations of Parallel Methods

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Mathematical Operations with Arrays and Matrices

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms

Parallel Processing IMP Questions

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Parallelization of an Example Program

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Parallel Sorting. Sathish Vadhiyar

Information Coding / Computer Graphics, ISY, LiTH

Sorting (Chapter 9) Alexandre David B2-206

Exploring Parallelism At Different Levels

Parallel Longest Increasing Subsequences in Scalable Time and Memory

Transforming Imperfectly Nested Loops

Module 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm

Introduction to MatLab. Introduction to MatLab K. Craig 1

Lecture 5. Applications: N-body simulation, sorting, stencil methods

CHAPTER 5 Pipelined Computations

Basic Communication Ops

Message-Passing Computing Examples

All-Pairs Shortest Paths - Floyd s Algorithm

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

EE/CSCI 451: Parallel and Distributed Computation

211: Computer Architecture Summer 2016

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Lecture 8. Dynamic Programming

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Introduction to Algorithms

Introduction to Parallel Computing Errata

PARALLEL PROCESSING UNIT 3. Dr. Ahmed Sallam

Transcription:

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology

! Given associative operator!! a 0! a 1! a 2!! a n-1! Examples! Add! Multiply! And, Or! Maximum, Minimum! Parallel reduction divide & conquer 2

Parallel Reduction Evolution 3

Parallel Reduction Evolution 4

Parallel Reduction Evolution 5

Finding Global Sum 4 2 0 7-3 5-6 -3 8 1 2 3-4 4 6-1 6

Finding Global Sum 1 7-6 4 4 5 8 2 7

Finding Global Sum 8-2 9 10 8

Finding Global Sum 17 8 9

Finding Global Sum 25 Binomial Tree 10

Agglomeration 11

Agglomeration sum sum sum sum 12

Mapping Binomial Tree 13

Mapping 00 01 02 03 Question: how to add all up to node 00? 1 1 1 1 2 10 11 12 13 2 3 3 4 20 21 22 23 2 2 2D torus 1 1 1 1 30 31 32 33 14

15

Mapping 1111 1011 1010 1110 1 1 1 1 2 0111 0011 0010 0110 2 3 3 4 0101 0001 0000 0100 2 2 Question: how to broadcast a message from node 0000? Hypercube 1 1 1 1 1101 1001 1000 1100 16

! vector x with components x i i=0 (n-1) x = [x 0,x 1, x (n-2), x (n-1) ] T! Matrix A have m*n elements a 00 a 01 a 0(n-1) A = a 10 a 11 a 1(n-1)...... a (m-1)0 a (m-1)1 a (m-1)(n-1) 17

! Matrix-vector product: y = A*x (y is a vector) y 0 a 00 a 01 a 0(n-1) x 0 y 1 a 10 a 11 a 1(n-1) x 1 =...... Y (m-1) a (m-1)0 a (m-1)1 a (m-1)(n-1) x (n-1) 18

! An expanded form: y 0 = a 00 x 0 + a 01 x 1 + + a 0(n-1) x (n-1) y 1 = a 10 x 0 + a 11 x 1 + + a 1(n-1) x (n-1)... y i = a i0 x 0 + a i1 x 1 + + a i(n-1) x (n-1)... Y (m-1) = a (m-1)0 x 0 + a (m-1)1 x 1 + + a (m-1)(n-1) x (n-1) 19

! Addition:! Commutative: a+b = b+a! Associative: (a+b)+c = a+(b+c)! For Regular data structures:! keep data structure regular, e.g. use row- or column-wise or mixed partitioning schemes! Regular and local communication patterns 20

! Partition:! Divide matrix into rows! Each primitive task has one row and two scalar x i and y i! Communication:! Each primitive task must eventually see every x i s of X! Organize tasks into a ring 21

! Agglomeration and mapping! Fixed number of tasks, each requiring same amount of computation! Regular communication among tasks! Strategy: Assign each process a contiguous group of rows 22

! A(i) refers to the n/p by n block row that process i owns, (assume m = n)! x(i) and y(i) (both n/p by 1) similarly refer to segments of x, y owned by process i! A(i,j) is the n/p by n/p sub-block of A(i)! in columns j*n/p through (j+1)*n/p - 1! Process i uses the formula y(i) = y(i) + A(i)*x = y(i) + " j A(i,j)*x(j) 23

y 0 = a 00 a 01 x 0 + a 02 a 03 x 2 + a 0(n-2) a 0(n-1) x (n-2) y 1 = a 10 a 11 x 1 + a 12 a 13 x 3 + a 1(n-2) a 1(n-1) x (n-1) y 2 = a 20 a 21 x 0 + a 22 a 23 x 2 + a 2(n-2) a 2(n-1) x (n-2) y 3 = a 30 a 31 x 1 + a 32 a 33 x 3 + a 3(n-2) a 3(n-1) x (n-1) y(0) y(1) = A(0,0)x(0) + A(0,1)x(1) + A(0,p)x(p) = A(1,0)x(0) + A(1,1)x(1) + A(1,p)x(p) 24

! 1D array/ring system:! Algorithm 1 (broadcast): For processor i Broadcast x(i) Store all x(j)s in x Compute y(i) = y(i) + A(i)*x Need a temporary vector x of size n.! Algorithm 2 (broadcast): For j=0 to p-1 for processor j broadcast x(j) for all processors compute y(i) = y(i) +A(i,j)*x(j) Need only a temporary vector of size n/p. 25

A x y A x y p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) A(0,0) A(1,0) A(2,0) A(3,0) x(0) x(0) x(0) x(0) y(0) y(1) y(2) y(3) Initial situation A x y A x y A(0,2) A(1,2) A(2,2) A(3,2) x(2) x(2) x(2) x(2) y(0) y(1) y(2) y(3) A(0,1) A(1,1) A(2,1) A(3,1) x(1) x(1) x(1) x(1) y(0) y(1) y(2) y(3) A B C A(0,3) A(1,3) A(2,3) A(3,3) x(3) x(3) x(3) x(3) y(0) y(1) y(2) y(3) Broadcast: p i broadcast x (i) at step i 26

y(0) = A(0,0)x(0) + A(0,1)x(1) + A(0,2)x(2) + A(0,3)x(3) y(1) = A(1,0)x(0) + A(1,1)x(1) + A(1,2)x(2) + A(1,3)x(3) y(2) = A(2,0)x(0) + A(2,1)x(1) + A(2,2)x(2) + A(2,3)x(3) y(3) = A(3,0)x(0) + A(3,1)x(1) + A(3,2)x(2) + A(3,3)x(3) 27

! 1D array/ring system: (cont.)! Algorithm 3 (shift): For processor i Compute y(i) = y(i) +A(i,i)*x(i) for k=1 to p-1 shift x(i) to its left neighbor j = (i+k)%p Compute y(i) = y(i) +A(i,j)*x(j) No data broadcast (one-to-all: (t s + t w m)log p ) send/recv 2(t s + t w m) done concurrently in each step. 28

A x y A x y p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) A(0,0) A(1,1) A(2,2) A(3,3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) Initial situation A x y A x y A(0,2) A(1,3) A(2,0) A(3,1) x(2) x(3) x(0) x(1) y(0) y(1) y(2) y(3) A(0,1) A(1,2) A(2,3) A(3,0) x(1) x(2) x(3) x(0) y(0) y(1) y(2) y(3) A B C A(0,3) A(1,0) A(2,1) A(3,2) x(3) x(0) x(1) x(2) y(0) y(1) y(2) y(3) shift: Procs shift x(i)s around a ring step by step. 29

y(0) = A(0,0)x(0) + A(0,1)x(1) + A(0,2)x(2) + A(0,3)x(3) y(1) = A(1,0)x(0) + A(1,1)x(1) + A(1,2)x(2) + A(1,3)x(3) y(2) = A(2,0)x(0) + A(2,1)x(1) + A(2,2)x(2) + A(2,3)x(3) y(3) = A(3,0)x(0) + A(3,1)x(1) + A(3,2)x(2) + A(3,3)x(3) 30

! 2D mesh system:! A 2D blocked layout uses (column) broadcast and (row) reduction functions on a subset of processes! sqrt(p) for square processor grid P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 31

! Computing C=C+A*B (Assume matrices are n x n):! (assume c(i) the ith row of C and a(i) the ith row of A) for i=0 to n-1 c(i) = c(i) + a(i)*b //computing ith row! or for i=0 to n-1 for j=0 to n-1 c ij = c ij + " k a ik b kj! or for i=0 to n-1 for j=0 to n-1 for k=0 to n-1 c ij = c ij + a ik b kj 32

Row-oriented Algorithm! inner product (dot product) operations # = 33

! Comparing Sequential Performance! Block version is better because computing a row of C using the row version requires accessing every element of B and B will gets too big for cache 34

! Comparing Sequential Performance! Block version is better because computing a row of C using the row version requires accessing every element of B and B will gets too big for cache 35

Block Matrix Multiplication! Replace scalar multiplication with matrix multiplication! Replace scalar addition with matrix addition # = 36

! 2D block distribution:! Block matrix multiplication: 37

Block Matrix Multiplication! Continue to divide until blocks small enough 38

C(0,0) C(0,1) C(0,2) A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) C(1,0) C(1,1) C(1,2) =! A(1,0) A(1,1) A(1,2) *! B(1,0) B(1,1) B(1,2) C(2,0) C(2,1) C(2,2) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) 39

40

! parallel algorithm for 1D array/ring system:! Partitioning! Divide matrices into rows! Each primitive task has corresponding rows of three matrices! Communication! Each task must eventually see every row of B! Organize tasks into a ring! Agglomeration and mapping! Assign each process a contiguous group of rows 41

! Assume n is divisible by p! A(i) refers to the n/p by n block row that process i owns (similarly for B(i) and C(i))! A(i,j) is the n/p by n/p sub-block of A(i)! in columns j*n/p through (j+1)*n/p - 1! the formula C(i) = C(i) + A(i)*B = C(i) + " j A(i,j)*B(j)! Don t accumulate the whole B on each processor memory usage issue (not scalable)! A(i) is further divided into A(i,j)s and move B(i) during the computation. 42

! 1D block row-wise distribution:! A(i), B(i) and C(i) are n/p by n sub-blocks C(0) A(0) B(0) C(1) A(1) =! *! B(1) C(2) A(2) B(2) 43

! 1D block row-wise distribution:! A(i) is further partitioned (column-wise)! A(i,j) is n/p by n/p sub-blocks! C(i) = C(i) + " j A(i,j)*B(j) " (e.g., C(0) = C(0) + A(0,0)*B(0)+A(01)*B(1) +A(02)*B(2)) C(0) A(0,0) A(0,1) A(0,2) B(0) C(1) =! A(1,0) A(1,1) A(1,2) *! B(1) C(2) A(2,0) A(2,1) A(2,2) B(2) 44

A B C A B C p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) A(0,0) A(1,0) A(2,0) A(3,0) B(0) B(0) B(0) B(0) C(0) C(1) C(2) C(3) Initial situation A B C A B C A(0,2) A(1,2) A(2,2) A(3,2) B(2) B(2) B(2) B(2) C(0) C(1) C(2) C(3) A(0,1) A(1,1) A(2,1) A(3,1) B(1) B(1) B(1) B(1) C(0) C(1) C(2) C(3) A B C A(0,3) A(1,3) A(2,3) A(3,3) B(3) B(3) B(3) B(3) C(0) C(1) C(2) C(3) Broadcast: p i broadcast B (i) at step i 45

A B C A B C p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) A(0,0) A(1,1) A(2,2) A(3,3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) Initial situation A B C A B C A(0,2) A(1,3) A(2,0) A(3,1) B(2) B(3) B(0) B(1) C(0) C(1) C(2) C(3) A(0,1) A(1,2) A(2,3) A(3,0) B(1) B(2) B(3) B(0) C(0) C(1) C(2) C(3) A B C A(0,3) A(1,0) A(2,1) A(3,2) B(3) B(0) B(1) B(2) C(0) C(1) C(2) C(3) shift: Procs shift B(i)s around in a ring step by step. 46

! parallel algorithms for 2D mesh system:! Communication:Need to move both A(i,j) and B(i,j) 47

A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) A(0,0) A(0,0) A(0,0) B(0,0) B(0,1) B(0,2) A(1,0) A(1,1) A(1,2) B(1,0) B(1,1) B(1,2) (0) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) A(1,0) A(1,0) A(1,0) B(0,0) B(0,1) B(0,2) (1) A(2,0) A(2,0) A(2,0) B(0,0) B(0,1) B(0,2) A(0,1) A(0,1) A(0,1) B(1,0) B(1,1) B(1,2) A(0,2) A(0,2) A(0,2) B(2,0) B(2,1) B(2,2) A(1,1) A(1,1) A(1,1) B(1,0) B(1,1) B(1,2) (2) A(2,1) A(2,1) A(2,1) B(1,0) B(1,1) B(1,2) A(1,2) A(1,2) A(1,2) B(2,0) B(2,1) B(2,2) (3) A(2,2) A(2,2) A(2,2) B(2,0) B(2,1) B(2,2) C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2) 48

A(0,0) A(0,1) A(0,2) B(0,0) B(1,1) B(2,2) A(0,1) A(0,2) A(0,0) B(1,0) B(2,1) B(0,2) A(1,1) A(1,2) A(1,0) B(1,0) B(2,1) B(0,2) (0) A(2,2) A(2,0) A(2,1) B(2,0) B(0,1) B(1,2) A(1,2) A(1,0) A(1,1) B(2,0) B(0,1) B(1,2) (1) A(2,0) A(2,1) A(2,2) B(0,0) B(1,1) B(2,2) A(0,2) A(0,0) A(0,1) B(2,0) B(0,1) B(1,2) A(1,0) A(1,1) A(1,2) B(0,0) B(1,1) B(2,2) (2) A(2,1) A(2,2) A(2,0) B(1,0) B(2,1) B(0,2) Initialization: A(i,j) shifts left i steps and B(i,j) shifts up j steps C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2) 49

! One of the most commonly used and well-studied kernels.! Sorting can be comparison-based or noncomparison-based.! We focus here on comparison-based sorting algorithms.! The fundamental operation of comparison-based sorting is compare-exchange.! The lower bound on any comparison-based sort of n numbers is!(nlog n). 50

! Sequential algorithm: 51

! Sequential algorithm:! After n phases of odd-even exchanges, the sequence is sorted.! Each phase of the algorithm (either odd or even) requires!(n) comparisons.! Serial complexity is!(n 2 ). 52

! If each processor has one element, the compareexchange operation stores the smaller element at the processor with smaller id.! If we have more than one element per processor, we call this operation a compare-split (or mergesplit if the two partial lists were sorted).! Assume each of two processors have n/p elements. After the compare-split operation, the smaller n/p elements are at processor P i and the larger n/p elements at P j, where i < j. 53

Version 1: Version 2: 54

Version 1: Version 2: 55

! Parallel algorithm 1:! Divide A (of n elements) into p blocks of equal size A(i)! One block of n/p elements per processor.! The initial step is a local sort.! In each subsequent step, the compare-exchange operation is replaced by the merge-split operation.! After p steps the elements are sorted. 56

! Parallel algorithm:! Each parallel step: " Merge-split ops on 2 blocks " sent/reced ops on 2 blocks! Total cost ~ p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) p(2n/p) m-s ops + p(2n/p) comm ops =2n m-s + 2n comm 57

! Simple and quite efficient.! In p (or 2p) steps of merge-split the array is sorted out! The number of steps can be reduced if test array sorted but in general still in O(p).! merge-split operations only between neighbors.! Can we do merge-split operations between other processors? 58

! Let n be the number of elements to be sorted and p be the number of processors.! Two phases:! During the first phase, processors that are far away from each other in the array compare-split their elements.! During the second phase, the algorithm switches to an odd-even transposition sort. 59

! An example of the first phase of parallel shell sort! Each processor performs d = log p compare-split ops 0 3 4 5 6 7 2 1 0 2 4 5 6 7 3 1 0 2 4 5 1 3 7 6 60