Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology

! Given associative operator!! a 0! a 1! a 2!! a n-1! Examples! Add! Multiply! And, Or! Maximum, Minimum! Parallel reduction divide & conquer 2

Parallel Reduction Evolution 3

Parallel Reduction Evolution 4

Parallel Reduction Evolution 5

Finding Global Sum 4 2 0 7-3 5-6 -3 8 1 2 3-4 4 6-1 6

Finding Global Sum 1 7-6 4 4 5 8 2 7

Finding Global Sum 8-2 9 10 8

Finding Global Sum 17 8 9

Finding Global Sum 25 Binomial Tree 10

Agglomeration 11

Agglomeration sum sum sum sum 12

Mapping Binomial Tree 13

Mapping 00 01 02 03 Question: how to add all up to node 00? 1 1 1 1 2 10 11 12 13 2 3 3 4 20 21 22 23 2 2 2D torus 1 1 1 1 30 31 32 33 14

Mapping 1111 1011 1010 1110 1 1 1 1 2 0111 0011 0010 0110 2 3 3 4 0101 0001 0000 0100 2 2 Question: how to broadcast a message from node 0000? Hypercube 1 1 1 1 1101 1001 1000 1100 16

! vector x with components x i i=0 (n-1) x = [x 0,x 1, x (n-2), x (n-1) ] T! Matrix A have m*n elements a 00 a 01 a 0(n-1) A = a 10 a 11 a 1(n-1)...... a (m-1)0 a (m-1)1 a (m-1)(n-1) 17

! Matrix-vector product: y = A*x (y is a vector) y 0 a 00 a 01 a 0(n-1) x 0 y 1 a 10 a 11 a 1(n-1) x 1 =...... Y (m-1) a (m-1)0 a (m-1)1 a (m-1)(n-1) x (n-1) 18

! An expanded form: y 0 = a 00 x 0 + a 01 x 1 + + a 0(n-1) x (n-1) y 1 = a 10 x 0 + a 11 x 1 + + a 1(n-1) x (n-1)... y i = a i0 x 0 + a i1 x 1 + + a i(n-1) x (n-1)... Y (m-1) = a (m-1)0 x 0 + a (m-1)1 x 1 + + a (m-1)(n-1) x (n-1) 19

! Addition:! Commutative: a+b = b+a! Associative: (a+b)+c = a+(b+c)! For Regular data structures:! keep data structure regular, e.g. use row- or column-wise or mixed partitioning schemes! Regular and local communication patterns 20

! Partition:! Divide matrix into rows! Each primitive task has one row and two scalar x i and y i! Communication:! Each primitive task must eventually see every x i s of X! Organize tasks into a ring 21

! Agglomeration and mapping! Fixed number of tasks, each requiring same amount of computation! Regular communication among tasks! Strategy: Assign each process a contiguous group of rows 22

! A(i) refers to the n/p by n block row that process i owns, (assume m = n)! x(i) and y(i) (both n/p by 1) similarly refer to segments of x, y owned by process i! A(i,j) is the n/p by n/p sub-block of A(i)! in columns j*n/p through (j+1)*n/p - 1! Process i uses the formula y(i) = y(i) + A(i)*x = y(i) + " j A(i,j)*x(j) 23

y 0 = a 00 a 01 x 0 + a 02 a 03 x 2 + a 0(n-2) a 0(n-1) x (n-2) y 1 = a 10 a 11 x 1 + a 12 a 13 x 3 + a 1(n-2) a 1(n-1) x (n-1) y 2 = a 20 a 21 x 0 + a 22 a 23 x 2 + a 2(n-2) a 2(n-1) x (n-2) y 3 = a 30 a 31 x 1 + a 32 a 33 x 3 + a 3(n-2) a 3(n-1) x (n-1) y(0) y(1) = A(0,0)x(0) + A(0,1)x(1) + A(0,p)x(p) = A(1,0)x(0) + A(1,1)x(1) + A(1,p)x(p) 24

! 1D array/ring system:! Algorithm 1 (broadcast): For processor i Broadcast x(i) Store all x(j)s in x Compute y(i) = y(i) + A(i)*x Need a temporary vector x of size n.! Algorithm 2 (broadcast): For j=0 to p-1 for processor j broadcast x(j) for all processors compute y(i) = y(i) +A(i,j)*x(j) Need only a temporary vector of size n/p. 25

A x y A x y p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) A(0,0) A(1,0) A(2,0) A(3,0) x(0) x(0) x(0) x(0) y(0) y(1) y(2) y(3) Initial situation A x y A x y A(0,2) A(1,2) A(2,2) A(3,2) x(2) x(2) x(2) x(2) y(0) y(1) y(2) y(3) A(0,1) A(1,1) A(2,1) A(3,1) x(1) x(1) x(1) x(1) y(0) y(1) y(2) y(3) A B C A(0,3) A(1,3) A(2,3) A(3,3) x(3) x(3) x(3) x(3) y(0) y(1) y(2) y(3) Broadcast: p i broadcast x (i) at step i 26

y(0) = A(0,0)x(0) + A(0,1)x(1) + A(0,2)x(2) + A(0,3)x(3) y(1) = A(1,0)x(0) + A(1,1)x(1) + A(1,2)x(2) + A(1,3)x(3) y(2) = A(2,0)x(0) + A(2,1)x(1) + A(2,2)x(2) + A(2,3)x(3) y(3) = A(3,0)x(0) + A(3,1)x(1) + A(3,2)x(2) + A(3,3)x(3) 27

! 1D array/ring system: (cont.)! Algorithm 3 (shift): For processor i Compute y(i) = y(i) +A(i,i)*x(i) for k=1 to p-1 shift x(i) to its left neighbor j = (i+k)%p Compute y(i) = y(i) +A(i,j)*x(j) No data broadcast (one-to-all: (t s + t w m)log p ) send/recv 2(t s + t w m) done concurrently in each step. 28

A x y A x y p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) A(0,0) A(1,1) A(2,2) A(3,3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) Initial situation A x y A x y A(0,2) A(1,3) A(2,0) A(3,1) x(2) x(3) x(0) x(1) y(0) y(1) y(2) y(3) A(0,1) A(1,2) A(2,3) A(3,0) x(1) x(2) x(3) x(0) y(0) y(1) y(2) y(3) A B C A(0,3) A(1,0) A(2,1) A(3,2) x(3) x(0) x(1) x(2) y(0) y(1) y(2) y(3) shift: Procs shift x(i)s around a ring step by step. 29

! 2D mesh system:! A 2D blocked layout uses (column) broadcast and (row) reduction functions on a subset of processes! sqrt(p) for square processor grid P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 31

! Computing C=C+A*B (Assume matrices are n x n):! (assume c(i) the ith row of C and a(i) the ith row of A) for i=0 to n-1 c(i) = c(i) + a(i)*b //computing ith row! or for i=0 to n-1 for j=0 to n-1 c ij = c ij + " k a ik b kj! or for i=0 to n-1 for j=0 to n-1 for k=0 to n-1 c ij = c ij + a ik b kj 32

Row-oriented Algorithm! inner product (dot product) operations # = 33

! Comparing Sequential Performance! Block version is better because computing a row of C using the row version requires accessing every element of B and B will gets too big for cache 34

! Comparing Sequential Performance! Block version is better because computing a row of C using the row version requires accessing every element of B and B will gets too big for cache 35

Block Matrix Multiplication! Replace scalar multiplication with matrix multiplication! Replace scalar addition with matrix addition # = 36

! 2D block distribution:! Block matrix multiplication: 37

Block Matrix Multiplication! Continue to divide until blocks small enough 38

C(0,0) C(0,1) C(0,2) A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) C(1,0) C(1,1) C(1,2) =! A(1,0) A(1,1) A(1,2) *! B(1,0) B(1,1) B(1,2) C(2,0) C(2,1) C(2,2) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) 39

! parallel algorithm for 1D array/ring system:! Partitioning! Divide matrices into rows! Each primitive task has corresponding rows of three matrices! Communication! Each task must eventually see every row of B! Organize tasks into a ring! Agglomeration and mapping! Assign each process a contiguous group of rows 41

! Assume n is divisible by p! A(i) refers to the n/p by n block row that process i owns (similarly for B(i) and C(i))! A(i,j) is the n/p by n/p sub-block of A(i)! in columns j*n/p through (j+1)*n/p - 1! the formula C(i) = C(i) + A(i)*B = C(i) + " j A(i,j)*B(j)! Don t accumulate the whole B on each processor memory usage issue (not scalable)! A(i) is further divided into A(i,j)s and move B(i) during the computation. 42

! 1D block row-wise distribution:! A(i), B(i) and C(i) are n/p by n sub-blocks C(0) A(0) B(0) C(1) A(1) =! *! B(1) C(2) A(2) B(2) 43

! 1D block row-wise distribution:! A(i) is further partitioned (column-wise)! A(i,j) is n/p by n/p sub-blocks! C(i) = C(i) + " j A(i,j)*B(j) " (e.g., C(0) = C(0) + A(0,0)*B(0)+A(01)*B(1) +A(02)*B(2)) C(0) A(0,0) A(0,1) A(0,2) B(0) C(1) =! A(1,0) A(1,1) A(1,2) *! B(1) C(2) A(2,0) A(2,1) A(2,2) B(2) 44

A B C A B C p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) A(0,0) A(1,0) A(2,0) A(3,0) B(0) B(0) B(0) B(0) C(0) C(1) C(2) C(3) Initial situation A B C A B C A(0,2) A(1,2) A(2,2) A(3,2) B(2) B(2) B(2) B(2) C(0) C(1) C(2) C(3) A(0,1) A(1,1) A(2,1) A(3,1) B(1) B(1) B(1) B(1) C(0) C(1) C(2) C(3) A B C A(0,3) A(1,3) A(2,3) A(3,3) B(3) B(3) B(3) B(3) C(0) C(1) C(2) C(3) Broadcast: p i broadcast B (i) at step i 45

A B C A B C p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) A(0,0) A(1,1) A(2,2) A(3,3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) Initial situation A B C A B C A(0,2) A(1,3) A(2,0) A(3,1) B(2) B(3) B(0) B(1) C(0) C(1) C(2) C(3) A(0,1) A(1,2) A(2,3) A(3,0) B(1) B(2) B(3) B(0) C(0) C(1) C(2) C(3) A B C A(0,3) A(1,0) A(2,1) A(3,2) B(3) B(0) B(1) B(2) C(0) C(1) C(2) C(3) shift: Procs shift B(i)s around in a ring step by step. 46

! parallel algorithms for 2D mesh system:! Communication:Need to move both A(i,j) and B(i,j) 47

A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) A(0,0) A(0,0) A(0,0) B(0,0) B(0,1) B(0,2) A(1,0) A(1,1) A(1,2) B(1,0) B(1,1) B(1,2) (0) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) A(1,0) A(1,0) A(1,0) B(0,0) B(0,1) B(0,2) (1) A(2,0) A(2,0) A(2,0) B(0,0) B(0,1) B(0,2) A(0,1) A(0,1) A(0,1) B(1,0) B(1,1) B(1,2) A(0,2) A(0,2) A(0,2) B(2,0) B(2,1) B(2,2) A(1,1) A(1,1) A(1,1) B(1,0) B(1,1) B(1,2) (2) A(2,1) A(2,1) A(2,1) B(1,0) B(1,1) B(1,2) A(1,2) A(1,2) A(1,2) B(2,0) B(2,1) B(2,2) (3) A(2,2) A(2,2) A(2,2) B(2,0) B(2,1) B(2,2) C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2) 48

A(0,0) A(0,1) A(0,2) B(0,0) B(1,1) B(2,2) A(0,1) A(0,2) A(0,0) B(1,0) B(2,1) B(0,2) A(1,1) A(1,2) A(1,0) B(1,0) B(2,1) B(0,2) (0) A(2,2) A(2,0) A(2,1) B(2,0) B(0,1) B(1,2) A(1,2) A(1,0) A(1,1) B(2,0) B(0,1) B(1,2) (1) A(2,0) A(2,1) A(2,2) B(0,0) B(1,1) B(2,2) A(0,2) A(0,0) A(0,1) B(2,0) B(0,1) B(1,2) A(1,0) A(1,1) A(1,2) B(0,0) B(1,1) B(2,2) (2) A(2,1) A(2,2) A(2,0) B(1,0) B(2,1) B(0,2) Initialization: A(i,j) shifts left i steps and B(i,j) shifts up j steps C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2) 49

! One of the most commonly used and well-studied kernels.! Sorting can be comparison-based or noncomparison-based.! We focus here on comparison-based sorting algorithms.! The fundamental operation of comparison-based sorting is compare-exchange.! The lower bound on any comparison-based sort of n numbers is!(nlog n). 50

! Sequential algorithm: 51

! Sequential algorithm:! After n phases of odd-even exchanges, the sequence is sorted.! Each phase of the algorithm (either odd or even) requires!(n) comparisons.! Serial complexity is!(n 2 ). 52

! If each processor has one element, the compareexchange operation stores the smaller element at the processor with smaller id.! If we have more than one element per processor, we call this operation a compare-split (or mergesplit if the two partial lists were sorted).! Assume each of two processors have n/p elements. After the compare-split operation, the smaller n/p elements are at processor P i and the larger n/p elements at P j, where i < j. 53

Version 1: Version 2: 54

Version 1: Version 2: 55

! Parallel algorithm 1:! Divide A (of n elements) into p blocks of equal size A(i)! One block of n/p elements per processor.! The initial step is a local sort.! In each subsequent step, the compare-exchange operation is replaced by the merge-split operation.! After p steps the elements are sorted. 56

! Parallel algorithm:! Each parallel step: " Merge-split ops on 2 blocks " sent/reced ops on 2 blocks! Total cost ~ p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) p(2n/p) m-s ops + p(2n/p) comm ops =2n m-s + 2n comm 57

! Simple and quite efficient.! In p (or 2p) steps of merge-split the array is sorted out! The number of steps can be reduced if test array sorted but in general still in O(p).! merge-split operations only between neighbors.! Can we do merge-split operations between other processors? 58

! Let n be the number of elements to be sorted and p be the number of processors.! Two phases:! During the first phase, processors that are far away from each other in the array compare-split their elements.! During the second phase, the algorithm switches to an odd-even transposition sort. 59

! An example of the first phase of parallel shell sort! Each processor performs d = log p compare-split ops 0 3 4 5 6 7 2 1 0 2 4 5 6 7 3 1 0 2 4 5 1 3 7 6 60