Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Size: px
Start display at page:

Download "Parallel Computing: Parallel Algorithm Design Examples Jin, Hai"

Transcription

1 Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology

2 ! Given associative operator!! a 0! a 1! a 2!! a n-1! Examples! Add! Multiply! And, Or! Maximum, Minimum! Parallel reduction divide & conquer 2

3 Parallel Reduction Evolution 3

4 Parallel Reduction Evolution 4

5 Parallel Reduction Evolution 5

6 Finding Global Sum

7 Finding Global Sum

8 Finding Global Sum

9 Finding Global Sum

10 Finding Global Sum 25 Binomial Tree 10

11 Agglomeration 11

12 Agglomeration sum sum sum sum 12

13 Mapping Binomial Tree 13

14 Mapping Question: how to add all up to node 00? D torus

15 15

16 Mapping Question: how to broadcast a message from node 0000? Hypercube

17 ! vector x with components x i i=0 (n-1) x = [x 0,x 1, x (n-2), x (n-1) ] T! Matrix A have m*n elements a 00 a 01 a 0(n-1) A = a 10 a 11 a 1(n-1) a (m-1)0 a (m-1)1 a (m-1)(n-1) 17

18 ! Matrix-vector product: y = A*x (y is a vector) y 0 a 00 a 01 a 0(n-1) x 0 y 1 a 10 a 11 a 1(n-1) x 1 = Y (m-1) a (m-1)0 a (m-1)1 a (m-1)(n-1) x (n-1) 18

19 ! An expanded form: y 0 = a 00 x 0 + a 01 x a 0(n-1) x (n-1) y 1 = a 10 x 0 + a 11 x a 1(n-1) x (n-1)... y i = a i0 x 0 + a i1 x a i(n-1) x (n-1)... Y (m-1) = a (m-1)0 x 0 + a (m-1)1 x a (m-1)(n-1) x (n-1) 19

20 ! Addition:! Commutative: a+b = b+a! Associative: (a+b)+c = a+(b+c)! For Regular data structures:! keep data structure regular, e.g. use row- or column-wise or mixed partitioning schemes! Regular and local communication patterns 20

21 ! Partition:! Divide matrix into rows! Each primitive task has one row and two scalar x i and y i! Communication:! Each primitive task must eventually see every x i s of X! Organize tasks into a ring 21

22 ! Agglomeration and mapping! Fixed number of tasks, each requiring same amount of computation! Regular communication among tasks! Strategy: Assign each process a contiguous group of rows 22

23 ! A(i) refers to the n/p by n block row that process i owns, (assume m = n)! x(i) and y(i) (both n/p by 1) similarly refer to segments of x, y owned by process i! A(i,j) is the n/p by n/p sub-block of A(i)! in columns j*n/p through (j+1)*n/p - 1! Process i uses the formula y(i) = y(i) + A(i)*x = y(i) + " j A(i,j)*x(j) 23

24 y 0 = a 00 a 01 x 0 + a 02 a 03 x 2 + a 0(n-2) a 0(n-1) x (n-2) y 1 = a 10 a 11 x 1 + a 12 a 13 x 3 + a 1(n-2) a 1(n-1) x (n-1) y 2 = a 20 a 21 x 0 + a 22 a 23 x 2 + a 2(n-2) a 2(n-1) x (n-2) y 3 = a 30 a 31 x 1 + a 32 a 33 x 3 + a 3(n-2) a 3(n-1) x (n-1) y(0) y(1) = A(0,0)x(0) + A(0,1)x(1) + A(0,p)x(p) = A(1,0)x(0) + A(1,1)x(1) + A(1,p)x(p) 24

25 ! 1D array/ring system:! Algorithm 1 (broadcast): For processor i Broadcast x(i) Store all x(j)s in x Compute y(i) = y(i) + A(i)*x Need a temporary vector x of size n.! Algorithm 2 (broadcast): For j=0 to p-1 for processor j broadcast x(j) for all processors compute y(i) = y(i) +A(i,j)*x(j) Need only a temporary vector of size n/p. 25

26 A x y A x y p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) A(0,0) A(1,0) A(2,0) A(3,0) x(0) x(0) x(0) x(0) y(0) y(1) y(2) y(3) Initial situation A x y A x y A(0,2) A(1,2) A(2,2) A(3,2) x(2) x(2) x(2) x(2) y(0) y(1) y(2) y(3) A(0,1) A(1,1) A(2,1) A(3,1) x(1) x(1) x(1) x(1) y(0) y(1) y(2) y(3) A B C A(0,3) A(1,3) A(2,3) A(3,3) x(3) x(3) x(3) x(3) y(0) y(1) y(2) y(3) Broadcast: p i broadcast x (i) at step i 26

27 y(0) = A(0,0)x(0) + A(0,1)x(1) + A(0,2)x(2) + A(0,3)x(3) y(1) = A(1,0)x(0) + A(1,1)x(1) + A(1,2)x(2) + A(1,3)x(3) y(2) = A(2,0)x(0) + A(2,1)x(1) + A(2,2)x(2) + A(2,3)x(3) y(3) = A(3,0)x(0) + A(3,1)x(1) + A(3,2)x(2) + A(3,3)x(3) 27

28 ! 1D array/ring system: (cont.)! Algorithm 3 (shift): For processor i Compute y(i) = y(i) +A(i,i)*x(i) for k=1 to p-1 shift x(i) to its left neighbor j = (i+k)%p Compute y(i) = y(i) +A(i,j)*x(j) No data broadcast (one-to-all: (t s + t w m)log p ) send/recv 2(t s + t w m) done concurrently in each step. 28

29 A x y A x y p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) A(0,0) A(1,1) A(2,2) A(3,3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) Initial situation A x y A x y A(0,2) A(1,3) A(2,0) A(3,1) x(2) x(3) x(0) x(1) y(0) y(1) y(2) y(3) A(0,1) A(1,2) A(2,3) A(3,0) x(1) x(2) x(3) x(0) y(0) y(1) y(2) y(3) A B C A(0,3) A(1,0) A(2,1) A(3,2) x(3) x(0) x(1) x(2) y(0) y(1) y(2) y(3) shift: Procs shift x(i)s around a ring step by step. 29

30 y(0) = A(0,0)x(0) + A(0,1)x(1) + A(0,2)x(2) + A(0,3)x(3) y(1) = A(1,0)x(0) + A(1,1)x(1) + A(1,2)x(2) + A(1,3)x(3) y(2) = A(2,0)x(0) + A(2,1)x(1) + A(2,2)x(2) + A(2,3)x(3) y(3) = A(3,0)x(0) + A(3,1)x(1) + A(3,2)x(2) + A(3,3)x(3) 30

31 ! 2D mesh system:! A 2D blocked layout uses (column) broadcast and (row) reduction functions on a subset of processes! sqrt(p) for square processor grid P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 31

32 ! Computing C=C+A*B (Assume matrices are n x n):! (assume c(i) the ith row of C and a(i) the ith row of A) for i=0 to n-1 c(i) = c(i) + a(i)*b //computing ith row! or for i=0 to n-1 for j=0 to n-1 c ij = c ij + " k a ik b kj! or for i=0 to n-1 for j=0 to n-1 for k=0 to n-1 c ij = c ij + a ik b kj 32

33 Row-oriented Algorithm! inner product (dot product) operations # = 33

34 ! Comparing Sequential Performance! Block version is better because computing a row of C using the row version requires accessing every element of B and B will gets too big for cache 34

35 ! Comparing Sequential Performance! Block version is better because computing a row of C using the row version requires accessing every element of B and B will gets too big for cache 35

36 Block Matrix Multiplication! Replace scalar multiplication with matrix multiplication! Replace scalar addition with matrix addition # = 36

37 ! 2D block distribution:! Block matrix multiplication: 37

38 Block Matrix Multiplication! Continue to divide until blocks small enough 38

39 C(0,0) C(0,1) C(0,2) A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) C(1,0) C(1,1) C(1,2) =! A(1,0) A(1,1) A(1,2) *! B(1,0) B(1,1) B(1,2) C(2,0) C(2,1) C(2,2) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) 39

40 40

41 ! parallel algorithm for 1D array/ring system:! Partitioning! Divide matrices into rows! Each primitive task has corresponding rows of three matrices! Communication! Each task must eventually see every row of B! Organize tasks into a ring! Agglomeration and mapping! Assign each process a contiguous group of rows 41

42 ! Assume n is divisible by p! A(i) refers to the n/p by n block row that process i owns (similarly for B(i) and C(i))! A(i,j) is the n/p by n/p sub-block of A(i)! in columns j*n/p through (j+1)*n/p - 1! the formula C(i) = C(i) + A(i)*B = C(i) + " j A(i,j)*B(j)! Don t accumulate the whole B on each processor memory usage issue (not scalable)! A(i) is further divided into A(i,j)s and move B(i) during the computation. 42

43 ! 1D block row-wise distribution:! A(i), B(i) and C(i) are n/p by n sub-blocks C(0) A(0) B(0) C(1) A(1) =! *! B(1) C(2) A(2) B(2) 43

44 ! 1D block row-wise distribution:! A(i) is further partitioned (column-wise)! A(i,j) is n/p by n/p sub-blocks! C(i) = C(i) + " j A(i,j)*B(j) " (e.g., C(0) = C(0) + A(0,0)*B(0)+A(01)*B(1) +A(02)*B(2)) C(0) A(0,0) A(0,1) A(0,2) B(0) C(1) =! A(1,0) A(1,1) A(1,2) *! B(1) C(2) A(2,0) A(2,1) A(2,2) B(2) 44

45 A B C A B C p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) A(0,0) A(1,0) A(2,0) A(3,0) B(0) B(0) B(0) B(0) C(0) C(1) C(2) C(3) Initial situation A B C A B C A(0,2) A(1,2) A(2,2) A(3,2) B(2) B(2) B(2) B(2) C(0) C(1) C(2) C(3) A(0,1) A(1,1) A(2,1) A(3,1) B(1) B(1) B(1) B(1) C(0) C(1) C(2) C(3) A B C A(0,3) A(1,3) A(2,3) A(3,3) B(3) B(3) B(3) B(3) C(0) C(1) C(2) C(3) Broadcast: p i broadcast B (i) at step i 45

46 A B C A B C p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) A(0,0) A(1,1) A(2,2) A(3,3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) Initial situation A B C A B C A(0,2) A(1,3) A(2,0) A(3,1) B(2) B(3) B(0) B(1) C(0) C(1) C(2) C(3) A(0,1) A(1,2) A(2,3) A(3,0) B(1) B(2) B(3) B(0) C(0) C(1) C(2) C(3) A B C A(0,3) A(1,0) A(2,1) A(3,2) B(3) B(0) B(1) B(2) C(0) C(1) C(2) C(3) shift: Procs shift B(i)s around in a ring step by step. 46

47 ! parallel algorithms for 2D mesh system:! Communication:Need to move both A(i,j) and B(i,j) 47

48 A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) A(0,0) A(0,0) A(0,0) B(0,0) B(0,1) B(0,2) A(1,0) A(1,1) A(1,2) B(1,0) B(1,1) B(1,2) (0) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) A(1,0) A(1,0) A(1,0) B(0,0) B(0,1) B(0,2) (1) A(2,0) A(2,0) A(2,0) B(0,0) B(0,1) B(0,2) A(0,1) A(0,1) A(0,1) B(1,0) B(1,1) B(1,2) A(0,2) A(0,2) A(0,2) B(2,0) B(2,1) B(2,2) A(1,1) A(1,1) A(1,1) B(1,0) B(1,1) B(1,2) (2) A(2,1) A(2,1) A(2,1) B(1,0) B(1,1) B(1,2) A(1,2) A(1,2) A(1,2) B(2,0) B(2,1) B(2,2) (3) A(2,2) A(2,2) A(2,2) B(2,0) B(2,1) B(2,2) C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2) 48

49 A(0,0) A(0,1) A(0,2) B(0,0) B(1,1) B(2,2) A(0,1) A(0,2) A(0,0) B(1,0) B(2,1) B(0,2) A(1,1) A(1,2) A(1,0) B(1,0) B(2,1) B(0,2) (0) A(2,2) A(2,0) A(2,1) B(2,0) B(0,1) B(1,2) A(1,2) A(1,0) A(1,1) B(2,0) B(0,1) B(1,2) (1) A(2,0) A(2,1) A(2,2) B(0,0) B(1,1) B(2,2) A(0,2) A(0,0) A(0,1) B(2,0) B(0,1) B(1,2) A(1,0) A(1,1) A(1,2) B(0,0) B(1,1) B(2,2) (2) A(2,1) A(2,2) A(2,0) B(1,0) B(2,1) B(0,2) Initialization: A(i,j) shifts left i steps and B(i,j) shifts up j steps C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2) 49

50 ! One of the most commonly used and well-studied kernels.! Sorting can be comparison-based or noncomparison-based.! We focus here on comparison-based sorting algorithms.! The fundamental operation of comparison-based sorting is compare-exchange.! The lower bound on any comparison-based sort of n numbers is!(nlog n). 50

51 ! Sequential algorithm: 51

52 ! Sequential algorithm:! After n phases of odd-even exchanges, the sequence is sorted.! Each phase of the algorithm (either odd or even) requires!(n) comparisons.! Serial complexity is!(n 2 ). 52

53 ! If each processor has one element, the compareexchange operation stores the smaller element at the processor with smaller id.! If we have more than one element per processor, we call this operation a compare-split (or mergesplit if the two partial lists were sorted).! Assume each of two processors have n/p elements. After the compare-split operation, the smaller n/p elements are at processor P i and the larger n/p elements at P j, where i < j. 53

54 Version 1: Version 2: 54

55 Version 1: Version 2: 55

56 ! Parallel algorithm 1:! Divide A (of n elements) into p blocks of equal size A(i)! One block of n/p elements per processor.! The initial step is a local sort.! In each subsequent step, the compare-exchange operation is replaced by the merge-split operation.! After p steps the elements are sorted. 56

57 ! Parallel algorithm:! Each parallel step: " Merge-split ops on 2 blocks " sent/reced ops on 2 blocks! Total cost ~ p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) p(2n/p) m-s ops + p(2n/p) comm ops =2n m-s + 2n comm 57

58 ! Simple and quite efficient.! In p (or 2p) steps of merge-split the array is sorted out! The number of steps can be reduced if test array sorted but in general still in O(p).! merge-split operations only between neighbors.! Can we do merge-split operations between other processors? 58

59 ! Let n be the number of elements to be sorted and p be the number of processors.! Two phases:! During the first phase, processors that are far away from each other in the array compare-split their elements.! During the second phase, the algorithm switches to an odd-even transposition sort. 59

60 ! An example of the first phase of parallel shell sort! Each processor performs d = log p compare-split ops

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

Chapter 8 Dense Matrix Algorithms

Chapter 8 Dense Matrix Algorithms Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel

More information

Matrix multiplication

Matrix multiplication Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix

More information

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c

More information

Lecture 3: Sorting 1

Lecture 3: Sorting 1 Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:

More information

High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms

High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering (CiE) Scientific Computing

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Parallel Computing. Parallel Algorithm Design

Parallel Computing. Parallel Algorithm Design Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Parallel Sorting Chapter 9 1 Contents General issues Sorting network Bitonic sort Bubble sort and its variants Odd-even transposition Quicksort Other Sorting

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

Lecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur

Lecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur Lecture 5: Matrices Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur 29 th July, 2008 Types of Matrices Matrix Addition and Multiplication

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

Rapid growth of massive datasets

Rapid growth of massive datasets Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,

More information

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel

More information

Matrix-vector Multiplication

Matrix-vector Multiplication Matrix-vector Multiplication Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition Outline Sequential algorithm

More information

Attendance (2) Performance (3) Oral (5) Total (10) Dated Sign of Subject Teacher

Attendance (2) Performance (3) Oral (5) Total (10) Dated Sign of Subject Teacher Attendance (2) Performance (3) Oral (5) Total (10) Dated Sign of Subject Teacher Date of Performance:... Actual Date of Completion:... Expected Date of Completion:... ----------------------------------------------------------------------------------------------------------------

More information

12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares

12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares 12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares Optimal substructure Dynamic programming is typically applied to optimization problems. An optimal solution to the original

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

Chain Matrix Multiplication

Chain Matrix Multiplication Chain Matrix Multiplication Version of November 5, 2014 Version of November 5, 2014 Chain Matrix Multiplication 1 / 27 Outline Outline Review of matrix multiplication. The chain matrix multiplication problem.

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Lecture 13: Chain Matrix Multiplication

Lecture 13: Chain Matrix Multiplication Lecture 3: Chain Matrix Multiplication CLRS Section 5.2 Revised April 7, 2003 Outline of this Lecture Recalling matrix multiplication. The chain matrix multiplication problem. A dynamic programming algorithm

More information

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Lecture 6: Parallel Matrix Algorithms (part 3)

Lecture 6: Parallel Matrix Algorithms (part 3) Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of

More information

x = 12 x = 12 1x = 16

x = 12 x = 12 1x = 16 2.2 - The Inverse of a Matrix We've seen how to add matrices, multiply them by scalars, subtract them, and multiply one matrix by another. The question naturally arises: Can we divide one matrix by another?

More information

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms Part 2 1 3 Maximum Selection Problem : Given n numbers, x 1, x 2,, x

More information

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors 1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors on an EREW PRAM: See solution for the next problem. Omit the step where each processor sequentially computes the AND of

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

CSL 860: Modern Parallel

CSL 860: Modern Parallel CSL 860: Modern Parallel Computation PARALLEL ALGORITHM TECHNIQUES: BALANCED BINARY TREE Reduction n operands => log n steps Total work = O(n) How do you map? Balance Binary tree technique Reduction n

More information

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition Cache Memories Lecture, Oct. 30, 2018 1 General Cache Concept Cache 84 9 14 10 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block-sized transfer units Memory

More information

Lecture 17: Array Algorithms

Lecture 17: Array Algorithms Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Linear Arrays. Chapter 7

Linear Arrays. Chapter 7 Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between

More information

n = 1 What problems are interesting when n is just 1?

n = 1 What problems are interesting when n is just 1? What if n=1??? n = 1 What problems are interesting when n is just 1? Sorting? No Median finding? No Addition? How long does it take to add one pair of numbers? Multiplication? How long does it take to

More information

Project C/MPI: Matrix-Vector Multiplication

Project C/MPI: Matrix-Vector Multiplication Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette Matrix-vector multiplication is embedded in many algorithms for solving

More information

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.

Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8. CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds

More information

Matrix Inverse 2 ( 2) 1 = 2 1 2

Matrix Inverse 2 ( 2) 1 = 2 1 2 Name: Matrix Inverse For Scalars, we have what is called a multiplicative identity. This means that if we have a scalar number, call it r, then r multiplied by the multiplicative identity equals r. Without

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering

More information

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by Sorting Algorithms Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel

More information

CS475 Parallel Programming

CS475 Parallel Programming CS475 Parallel Programming Sorting Wim Bohm, Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license. Sorting

More information

CS 140 : Numerical Examples on Shared Memory with Cilk++

CS 140 : Numerical Examples on Shared Memory with Cilk++ CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)

More information

Chapter 1: Number and Operations

Chapter 1: Number and Operations Chapter 1: Number and Operations 1.1 Order of operations When simplifying algebraic expressions we use the following order: 1. Perform operations within a parenthesis. 2. Evaluate exponents. 3. Multiply

More information

Vector: A series of scalars contained in a column or row. Dimensions: How many rows and columns a vector or matrix has.

Vector: A series of scalars contained in a column or row. Dimensions: How many rows and columns a vector or matrix has. ASSIGNMENT 0 Introduction to Linear Algebra (Basics of vectors and matrices) Due 3:30 PM, Tuesday, October 10 th. Assignments should be submitted via e-mail to: matlabfun.ucsd@gmail.com You can also submit

More information

Divide and Conquer Algorithms. Sathish Vadhiyar

Divide and Conquer Algorithms. Sathish Vadhiyar Divide and Conquer Algorithms Sathish Vadhiyar Introduction One of the important parallel algorithm models The idea is to decompose the problem into parts solve the problem on smaller parts find the global

More information

Mapping Algorithms to Hardware By Prawat Nagvajara

Mapping Algorithms to Hardware By Prawat Nagvajara Electrical and Computer Engineering Mapping Algorithms to Hardware By Prawat Nagvajara Synopsis This note covers theory, design and implementation of the bit-vector multiplication algorithm. It presents

More information

Data Structures and Algorithms Week 8

Data Structures and Algorithms Week 8 Data Structures and Algorithms Week 8 Dynamic programming Fibonacci numbers Optimization problems Matrix multiplication optimization Principles of dynamic programming Longest Common Subsequence Algorithm

More information

Pipelined Computations

Pipelined Computations Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming. Each

More information

CS473 - Algorithms I

CS473 - Algorithms I CS473 - Algorithms I Lecture 4 The Divide-and-Conquer Design Paradigm View in slide-show mode 1 Reminder: Merge Sort Input array A sort this half sort this half Divide Conquer merge two sorted halves Combine

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three

More information

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses

More information

14 Dynamic. Matrix-chain multiplication. P.D. Dr. Alexander Souza. Winter term 11/12

14 Dynamic. Matrix-chain multiplication. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 14 Dynamic Programming (2) Matrix-chain multiplication P.D. Dr. Alexander Souza Optimal substructure Dynamic programming is typically applied to optimization problems. An optimal solution

More information

Sorting (Chapter 9) Alexandre David B2-206

Sorting (Chapter 9) Alexandre David B2-206 Sorting (Chapter 9) Alexandre David B2-206 1 Sorting Problem Arrange an unordered collection of elements into monotonically increasing (or decreasing) order. Let S = . Sort S into S =

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

15. The Software System ParaLab for Learning and Investigations of Parallel Methods

15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access

More information

Mathematical Operations with Arrays and Matrices

Mathematical Operations with Arrays and Matrices Mathematical Operations with Arrays and Matrices Array Operators (element-by-element) (important) + Addition A+B adds B and A - Subtraction A-B subtracts B from A.* Element-wise multiplication.^ Element-wise

More information

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning Parallel sparse matrix-vector product Lay out matrix and vectors by rows y(i) = sum(a(i,j)*x(j)) Only compute terms with A(i,j) 0 P0 P1

More information

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2

More information

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms

More information

CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms

CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms Dr Izadi CSE-40533 Introduction to Parallel Processing Chapter 5 PRAM and Basic Algorithms Define PRAM and its various submodels Show PRAM to be a natural extension of the sequential computer (RAM) Develop

More information

Parallel Processing IMP Questions

Parallel Processing IMP Questions Winter 14 Summer 14 Winter 13 Summer 13 180702 Parallel Processing IMP Questions Sr Chapter Questions Total 1 3 2 9 3 10 4 9 5 7 What is Data Decomposition? Explain Data Decomposition with proper example.

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

Parallelization of an Example Program

Parallelization of an Example Program Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.

More information

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel

More information

Parallel Sorting. Sathish Vadhiyar

Parallel Sorting. Sathish Vadhiyar Parallel Sorting Sathish Vadhiyar Parallel Sorting Problem The input sequence of size N is distributed across P processors The output is such that elements in each processor P i is sorted elements in P

More information

Information Coding / Computer Graphics, ISY, LiTH

Information Coding / Computer Graphics, ISY, LiTH Sorting on GPUs Revisiting some algorithms from lecture 6: Some not-so-good sorting approaches Bitonic sort QuickSort Concurrent kernels and recursion Adapt to parallel algorithms Many sorting algorithms

More information

Sorting (Chapter 9) Alexandre David B2-206

Sorting (Chapter 9) Alexandre David B2-206 Sorting (Chapter 9) Alexandre David B2-206 Sorting Problem Arrange an unordered collection of elements into monotonically increasing (or decreasing) order. Let S = . Sort S into S =

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

Parallel Longest Increasing Subsequences in Scalable Time and Memory

Parallel Longest Increasing Subsequences in Scalable Time and Memory Parallel Longest Increasing Subsequences in Scalable Time and Memory Peter Krusche Alexander Tiskin Department of Computer Science University of Warwick, Coventry, CV4 7AL, UK PPAM 2009 What is in this

More information

Transforming Imperfectly Nested Loops

Transforming Imperfectly Nested Loops Transforming Imperfectly Nested Loops 1 Classes of loop transformations: Iteration re-numbering: (eg) loop interchange Example DO 10 J = 1,100 DO 10 I = 1,100 DO 10 I = 1,100 vs DO 10 J = 1,100 Y(I) =

More information

Module 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm

Module 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm Module 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm This module 27 focuses on introducing dynamic programming design strategy and applying it to problems like chained matrix

More information

Introduction to MatLab. Introduction to MatLab K. Craig 1

Introduction to MatLab. Introduction to MatLab K. Craig 1 Introduction to MatLab Introduction to MatLab K. Craig 1 MatLab Introduction MatLab and the MatLab Environment Numerical Calculations Basic Plotting and Graphics Matrix Computations and Solving Equations

More information

Lecture 5. Applications: N-body simulation, sorting, stencil methods

Lecture 5. Applications: N-body simulation, sorting, stencil methods Lecture 5 Applications: N-body simulation, sorting, stencil methods Announcements Quiz #1 in section on 10/13 Midterm: evening of 10/30, 7:00 to 8:20 PM In Assignment 2, the following variation is suggested

More information

CHAPTER 5 Pipelined Computations

CHAPTER 5 Pipelined Computations CHAPTER 5 Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming.

More information

Basic Communication Ops

Basic Communication Ops CS 575 Parallel Processing Lecture 5: Ch 4 (GGKK) Sanjay Rajopadhye Colorado State University Basic Communication Ops n PRAM, final thoughts n Quiz 3 n Collective Communication n Broadcast & Reduction

More information

Message-Passing Computing Examples

Message-Passing Computing Examples Message-Passing Computing Examples Problems with a very large degree of parallelism: Image Transformations: Shifting, Rotation, Clipping etc. Mandelbrot Set: Sequential, static assignment, dynamic work

More information

All-Pairs Shortest Paths - Floyd s Algorithm

All-Pairs Shortest Paths - Floyd s Algorithm All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel

More information

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product) Parallel Programming Matrix Decomposition Options (Matrix-Vector Product) Matrix Decomposition Sequential algorithm and its complexity Design, analysis, and implementation of three parallel programs using

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright 25 Pearson-Addison Wesley. All rights reserved. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part

More information

Lecture 8. Dynamic Programming

Lecture 8. Dynamic Programming Lecture 8. Dynamic Programming T. H. Cormen, C. E. Leiserson and R. L. Rivest Introduction to Algorithms, 3rd Edition, MIT Press, 2009 Sungkyunkwan University Hyunseung Choo choo@skku.edu Copyright 2000-2018

More information

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Algorithm Design. Parallel Algorithm Design p. 1 Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html

More information

Introduction to Algorithms

Introduction to Algorithms Introduction to Algorithms Dynamic Programming Well known algorithm design techniques: Brute-Force (iterative) ti algorithms Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic

More information

Introduction to Parallel Computing Errata

Introduction to Parallel Computing Errata Introduction to Parallel Computing Errata John C. Kirk 27 November, 2004 Overview Book: Introduction to Parallel Computing, Second Edition, first printing (hardback) ISBN: 0-201-64865-2 Official book website:

More information

PARALLEL PROCESSING UNIT 3. Dr. Ahmed Sallam

PARALLEL PROCESSING UNIT 3. Dr. Ahmed Sallam PARALLEL PROCESSING 1 UNIT 3 Dr. Ahmed Sallam FUNDAMENTAL GPU ALGORITHMS More Patterns Reduce Scan 2 OUTLINES Efficiency Measure Reduce primitive Reduce model Reduce Implementation and complexity analysis

More information