Parallel Computing: Parallel Algorithm Design Examples Jin, Hai
|
|
- Samson Lawson
- 5 years ago
- Views:
Transcription
1 Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology
2 ! Given associative operator!! a 0! a 1! a 2!! a n-1! Examples! Add! Multiply! And, Or! Maximum, Minimum! Parallel reduction divide & conquer 2
3 Parallel Reduction Evolution 3
4 Parallel Reduction Evolution 4
5 Parallel Reduction Evolution 5
6 Finding Global Sum
7 Finding Global Sum
8 Finding Global Sum
9 Finding Global Sum
10 Finding Global Sum 25 Binomial Tree 10
11 Agglomeration 11
12 Agglomeration sum sum sum sum 12
13 Mapping Binomial Tree 13
14 Mapping Question: how to add all up to node 00? D torus
15 15
16 Mapping Question: how to broadcast a message from node 0000? Hypercube
17 ! vector x with components x i i=0 (n-1) x = [x 0,x 1, x (n-2), x (n-1) ] T! Matrix A have m*n elements a 00 a 01 a 0(n-1) A = a 10 a 11 a 1(n-1) a (m-1)0 a (m-1)1 a (m-1)(n-1) 17
18 ! Matrix-vector product: y = A*x (y is a vector) y 0 a 00 a 01 a 0(n-1) x 0 y 1 a 10 a 11 a 1(n-1) x 1 = Y (m-1) a (m-1)0 a (m-1)1 a (m-1)(n-1) x (n-1) 18
19 ! An expanded form: y 0 = a 00 x 0 + a 01 x a 0(n-1) x (n-1) y 1 = a 10 x 0 + a 11 x a 1(n-1) x (n-1)... y i = a i0 x 0 + a i1 x a i(n-1) x (n-1)... Y (m-1) = a (m-1)0 x 0 + a (m-1)1 x a (m-1)(n-1) x (n-1) 19
20 ! Addition:! Commutative: a+b = b+a! Associative: (a+b)+c = a+(b+c)! For Regular data structures:! keep data structure regular, e.g. use row- or column-wise or mixed partitioning schemes! Regular and local communication patterns 20
21 ! Partition:! Divide matrix into rows! Each primitive task has one row and two scalar x i and y i! Communication:! Each primitive task must eventually see every x i s of X! Organize tasks into a ring 21
22 ! Agglomeration and mapping! Fixed number of tasks, each requiring same amount of computation! Regular communication among tasks! Strategy: Assign each process a contiguous group of rows 22
23 ! A(i) refers to the n/p by n block row that process i owns, (assume m = n)! x(i) and y(i) (both n/p by 1) similarly refer to segments of x, y owned by process i! A(i,j) is the n/p by n/p sub-block of A(i)! in columns j*n/p through (j+1)*n/p - 1! Process i uses the formula y(i) = y(i) + A(i)*x = y(i) + " j A(i,j)*x(j) 23
24 y 0 = a 00 a 01 x 0 + a 02 a 03 x 2 + a 0(n-2) a 0(n-1) x (n-2) y 1 = a 10 a 11 x 1 + a 12 a 13 x 3 + a 1(n-2) a 1(n-1) x (n-1) y 2 = a 20 a 21 x 0 + a 22 a 23 x 2 + a 2(n-2) a 2(n-1) x (n-2) y 3 = a 30 a 31 x 1 + a 32 a 33 x 3 + a 3(n-2) a 3(n-1) x (n-1) y(0) y(1) = A(0,0)x(0) + A(0,1)x(1) + A(0,p)x(p) = A(1,0)x(0) + A(1,1)x(1) + A(1,p)x(p) 24
25 ! 1D array/ring system:! Algorithm 1 (broadcast): For processor i Broadcast x(i) Store all x(j)s in x Compute y(i) = y(i) + A(i)*x Need a temporary vector x of size n.! Algorithm 2 (broadcast): For j=0 to p-1 for processor j broadcast x(j) for all processors compute y(i) = y(i) +A(i,j)*x(j) Need only a temporary vector of size n/p. 25
26 A x y A x y p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) A(0,0) A(1,0) A(2,0) A(3,0) x(0) x(0) x(0) x(0) y(0) y(1) y(2) y(3) Initial situation A x y A x y A(0,2) A(1,2) A(2,2) A(3,2) x(2) x(2) x(2) x(2) y(0) y(1) y(2) y(3) A(0,1) A(1,1) A(2,1) A(3,1) x(1) x(1) x(1) x(1) y(0) y(1) y(2) y(3) A B C A(0,3) A(1,3) A(2,3) A(3,3) x(3) x(3) x(3) x(3) y(0) y(1) y(2) y(3) Broadcast: p i broadcast x (i) at step i 26
27 y(0) = A(0,0)x(0) + A(0,1)x(1) + A(0,2)x(2) + A(0,3)x(3) y(1) = A(1,0)x(0) + A(1,1)x(1) + A(1,2)x(2) + A(1,3)x(3) y(2) = A(2,0)x(0) + A(2,1)x(1) + A(2,2)x(2) + A(2,3)x(3) y(3) = A(3,0)x(0) + A(3,1)x(1) + A(3,2)x(2) + A(3,3)x(3) 27
28 ! 1D array/ring system: (cont.)! Algorithm 3 (shift): For processor i Compute y(i) = y(i) +A(i,i)*x(i) for k=1 to p-1 shift x(i) to its left neighbor j = (i+k)%p Compute y(i) = y(i) +A(i,j)*x(j) No data broadcast (one-to-all: (t s + t w m)log p ) send/recv 2(t s + t w m) done concurrently in each step. 28
29 A x y A x y p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) A(0,0) A(1,1) A(2,2) A(3,3) x(0) x(1) x(2) x(3) y(0) y(1) y(2) y(3) Initial situation A x y A x y A(0,2) A(1,3) A(2,0) A(3,1) x(2) x(3) x(0) x(1) y(0) y(1) y(2) y(3) A(0,1) A(1,2) A(2,3) A(3,0) x(1) x(2) x(3) x(0) y(0) y(1) y(2) y(3) A B C A(0,3) A(1,0) A(2,1) A(3,2) x(3) x(0) x(1) x(2) y(0) y(1) y(2) y(3) shift: Procs shift x(i)s around a ring step by step. 29
30 y(0) = A(0,0)x(0) + A(0,1)x(1) + A(0,2)x(2) + A(0,3)x(3) y(1) = A(1,0)x(0) + A(1,1)x(1) + A(1,2)x(2) + A(1,3)x(3) y(2) = A(2,0)x(0) + A(2,1)x(1) + A(2,2)x(2) + A(2,3)x(3) y(3) = A(3,0)x(0) + A(3,1)x(1) + A(3,2)x(2) + A(3,3)x(3) 30
31 ! 2D mesh system:! A 2D blocked layout uses (column) broadcast and (row) reduction functions on a subset of processes! sqrt(p) for square processor grid P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 31
32 ! Computing C=C+A*B (Assume matrices are n x n):! (assume c(i) the ith row of C and a(i) the ith row of A) for i=0 to n-1 c(i) = c(i) + a(i)*b //computing ith row! or for i=0 to n-1 for j=0 to n-1 c ij = c ij + " k a ik b kj! or for i=0 to n-1 for j=0 to n-1 for k=0 to n-1 c ij = c ij + a ik b kj 32
33 Row-oriented Algorithm! inner product (dot product) operations # = 33
34 ! Comparing Sequential Performance! Block version is better because computing a row of C using the row version requires accessing every element of B and B will gets too big for cache 34
35 ! Comparing Sequential Performance! Block version is better because computing a row of C using the row version requires accessing every element of B and B will gets too big for cache 35
36 Block Matrix Multiplication! Replace scalar multiplication with matrix multiplication! Replace scalar addition with matrix addition # = 36
37 ! 2D block distribution:! Block matrix multiplication: 37
38 Block Matrix Multiplication! Continue to divide until blocks small enough 38
39 C(0,0) C(0,1) C(0,2) A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) C(1,0) C(1,1) C(1,2) =! A(1,0) A(1,1) A(1,2) *! B(1,0) B(1,1) B(1,2) C(2,0) C(2,1) C(2,2) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) 39
40 40
41 ! parallel algorithm for 1D array/ring system:! Partitioning! Divide matrices into rows! Each primitive task has corresponding rows of three matrices! Communication! Each task must eventually see every row of B! Organize tasks into a ring! Agglomeration and mapping! Assign each process a contiguous group of rows 41
42 ! Assume n is divisible by p! A(i) refers to the n/p by n block row that process i owns (similarly for B(i) and C(i))! A(i,j) is the n/p by n/p sub-block of A(i)! in columns j*n/p through (j+1)*n/p - 1! the formula C(i) = C(i) + A(i)*B = C(i) + " j A(i,j)*B(j)! Don t accumulate the whole B on each processor memory usage issue (not scalable)! A(i) is further divided into A(i,j)s and move B(i) during the computation. 42
43 ! 1D block row-wise distribution:! A(i), B(i) and C(i) are n/p by n sub-blocks C(0) A(0) B(0) C(1) A(1) =! *! B(1) C(2) A(2) B(2) 43
44 ! 1D block row-wise distribution:! A(i) is further partitioned (column-wise)! A(i,j) is n/p by n/p sub-blocks! C(i) = C(i) + " j A(i,j)*B(j) " (e.g., C(0) = C(0) + A(0,0)*B(0)+A(01)*B(1) +A(02)*B(2)) C(0) A(0,0) A(0,1) A(0,2) B(0) C(1) =! A(1,0) A(1,1) A(1,2) *! B(1) C(2) A(2,0) A(2,1) A(2,2) B(2) 44
45 A B C A B C p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) A(0,0) A(1,0) A(2,0) A(3,0) B(0) B(0) B(0) B(0) C(0) C(1) C(2) C(3) Initial situation A B C A B C A(0,2) A(1,2) A(2,2) A(3,2) B(2) B(2) B(2) B(2) C(0) C(1) C(2) C(3) A(0,1) A(1,1) A(2,1) A(3,1) B(1) B(1) B(1) B(1) C(0) C(1) C(2) C(3) A B C A(0,3) A(1,3) A(2,3) A(3,3) B(3) B(3) B(3) B(3) C(0) C(1) C(2) C(3) Broadcast: p i broadcast B (i) at step i 45
46 A B C A B C p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) A(0,0) A(1,1) A(2,2) A(3,3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) Initial situation A B C A B C A(0,2) A(1,3) A(2,0) A(3,1) B(2) B(3) B(0) B(1) C(0) C(1) C(2) C(3) A(0,1) A(1,2) A(2,3) A(3,0) B(1) B(2) B(3) B(0) C(0) C(1) C(2) C(3) A B C A(0,3) A(1,0) A(2,1) A(3,2) B(3) B(0) B(1) B(2) C(0) C(1) C(2) C(3) shift: Procs shift B(i)s around in a ring step by step. 46
47 ! parallel algorithms for 2D mesh system:! Communication:Need to move both A(i,j) and B(i,j) 47
48 A(0,0) A(0,1) A(0,2) B(0,0) B(0,1) B(0,2) A(0,0) A(0,0) A(0,0) B(0,0) B(0,1) B(0,2) A(1,0) A(1,1) A(1,2) B(1,0) B(1,1) B(1,2) (0) A(2,0) A(2,1) A(2,2) B(2,0) B(2,1) B(2,2) A(1,0) A(1,0) A(1,0) B(0,0) B(0,1) B(0,2) (1) A(2,0) A(2,0) A(2,0) B(0,0) B(0,1) B(0,2) A(0,1) A(0,1) A(0,1) B(1,0) B(1,1) B(1,2) A(0,2) A(0,2) A(0,2) B(2,0) B(2,1) B(2,2) A(1,1) A(1,1) A(1,1) B(1,0) B(1,1) B(1,2) (2) A(2,1) A(2,1) A(2,1) B(1,0) B(1,1) B(1,2) A(1,2) A(1,2) A(1,2) B(2,0) B(2,1) B(2,2) (3) A(2,2) A(2,2) A(2,2) B(2,0) B(2,1) B(2,2) C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2) 48
49 A(0,0) A(0,1) A(0,2) B(0,0) B(1,1) B(2,2) A(0,1) A(0,2) A(0,0) B(1,0) B(2,1) B(0,2) A(1,1) A(1,2) A(1,0) B(1,0) B(2,1) B(0,2) (0) A(2,2) A(2,0) A(2,1) B(2,0) B(0,1) B(1,2) A(1,2) A(1,0) A(1,1) B(2,0) B(0,1) B(1,2) (1) A(2,0) A(2,1) A(2,2) B(0,0) B(1,1) B(2,2) A(0,2) A(0,0) A(0,1) B(2,0) B(0,1) B(1,2) A(1,0) A(1,1) A(1,2) B(0,0) B(1,1) B(2,2) (2) A(2,1) A(2,2) A(2,0) B(1,0) B(2,1) B(0,2) Initialization: A(i,j) shifts left i steps and B(i,j) shifts up j steps C(1,2) = A(1,0) * B(0,2) + A(1,1) * B(1,2) + A(1,2) * B(2,2) 49
50 ! One of the most commonly used and well-studied kernels.! Sorting can be comparison-based or noncomparison-based.! We focus here on comparison-based sorting algorithms.! The fundamental operation of comparison-based sorting is compare-exchange.! The lower bound on any comparison-based sort of n numbers is!(nlog n). 50
51 ! Sequential algorithm: 51
52 ! Sequential algorithm:! After n phases of odd-even exchanges, the sequence is sorted.! Each phase of the algorithm (either odd or even) requires!(n) comparisons.! Serial complexity is!(n 2 ). 52
53 ! If each processor has one element, the compareexchange operation stores the smaller element at the processor with smaller id.! If we have more than one element per processor, we call this operation a compare-split (or mergesplit if the two partial lists were sorted).! Assume each of two processors have n/p elements. After the compare-split operation, the smaller n/p elements are at processor P i and the larger n/p elements at P j, where i < j. 53
54 Version 1: Version 2: 54
55 Version 1: Version 2: 55
56 ! Parallel algorithm 1:! Divide A (of n elements) into p blocks of equal size A(i)! One block of n/p elements per processor.! The initial step is a local sort.! In each subsequent step, the compare-exchange operation is replaced by the merge-split operation.! After p steps the elements are sorted. 56
57 ! Parallel algorithm:! Each parallel step: " Merge-split ops on 2 blocks " sent/reced ops on 2 blocks! Total cost ~ p 0 p 1 p 2 p 3 A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) A(0) A(1) A(2) A(3) p(2n/p) m-s ops + p(2n/p) comm ops =2n m-s + 2n comm 57
58 ! Simple and quite efficient.! In p (or 2p) steps of merge-split the array is sorted out! The number of steps can be reduced if test array sorted but in general still in O(p).! merge-split operations only between neighbors.! Can we do merge-split operations between other processors? 58
59 ! Let n be the number of elements to be sorted and p be the number of processors.! Two phases:! During the first phase, processors that are far away from each other in the array compare-split their elements.! During the second phase, the algorithm switches to an odd-even transposition sort. 59
60 ! An example of the first phase of parallel shell sort! Each processor performs d = log p compare-split ops
Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop
Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationChapter 8 Dense Matrix Algorithms
Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel
More informationMatrix multiplication
Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:
More informationMatrix Multiplication
Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix
More informationBasic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationMatrix Multiplication
Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c
More informationLecture 3: Sorting 1
Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:
More informationHigh Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms
High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering (CiE) Scientific Computing
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationParallel Computing. Parallel Algorithm Design
Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationCSC630/CSC730 Parallel & Distributed Computing
CSC630/CSC730 Parallel & Distributed Computing Parallel Sorting Chapter 9 1 Contents General issues Sorting network Bitonic sort Bubble sort and its variants Odd-even transposition Quicksort Other Sorting
More informationControl flow graphs and loop optimizations. Thursday, October 24, 13
Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange
More informationLecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur
Lecture 5: Matrices Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur 29 th July, 2008 Types of Matrices Matrix Addition and Multiplication
More informationContents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet
Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage
More informationRapid growth of massive datasets
Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,
More informationSorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel
More informationMatrix-vector Multiplication
Matrix-vector Multiplication Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition Outline Sequential algorithm
More informationAttendance (2) Performance (3) Oral (5) Total (10) Dated Sign of Subject Teacher
Attendance (2) Performance (3) Oral (5) Total (10) Dated Sign of Subject Teacher Date of Performance:... Actual Date of Completion:... Expected Date of Completion:... ----------------------------------------------------------------------------------------------------------------
More information12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares
12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares Optimal substructure Dynamic programming is typically applied to optimization problems. An optimal solution to the original
More informationAlgorithms and Applications
Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers
More informationChain Matrix Multiplication
Chain Matrix Multiplication Version of November 5, 2014 Version of November 5, 2014 Chain Matrix Multiplication 1 / 27 Outline Outline Review of matrix multiplication. The chain matrix multiplication problem.
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationLecture 13: Chain Matrix Multiplication
Lecture 3: Chain Matrix Multiplication CLRS Section 5.2 Revised April 7, 2003 Outline of this Lecture Recalling matrix multiplication. The chain matrix multiplication problem. A dynamic programming algorithm
More informationCSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationLecture 6: Parallel Matrix Algorithms (part 3)
Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of
More informationx = 12 x = 12 1x = 16
2.2 - The Inverse of a Matrix We've seen how to add matrices, multiply them by scalars, subtract them, and multiply one matrix by another. The question naturally arises: Can we divide one matrix by another?
More informationParallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms
Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms Part 2 1 3 Maximum Selection Problem : Given n numbers, x 1, x 2,, x
More information1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors
1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors on an EREW PRAM: See solution for the next problem. Omit the step where each processor sequentially computes the AND of
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationCSL 860: Modern Parallel
CSL 860: Modern Parallel Computation PARALLEL ALGORITHM TECHNIQUES: BALANCED BINARY TREE Reduction n operands => log n steps Total work = O(n) How do you map? Balance Binary tree technique Reduction n
More informationCSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing
HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;
More informationWorkload Characterization Techniques
Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/
More informationCache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition
Cache Memories Lecture, Oct. 30, 2018 1 General Cache Concept Cache 84 9 14 10 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block-sized transfer units Memory
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationLinear Arrays. Chapter 7
Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between
More informationn = 1 What problems are interesting when n is just 1?
What if n=1??? n = 1 What problems are interesting when n is just 1? Sorting? No Median finding? No Addition? How long does it take to add one pair of numbers? Multiplication? How long does it take to
More informationProject C/MPI: Matrix-Vector Multiplication
Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette Matrix-vector multiplication is embedded in many algorithms for solving
More informationLecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.
CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview
More informationLecture 8 Parallel Algorithms II
Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More informationMemory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds
More informationMatrix Inverse 2 ( 2) 1 = 2 1 2
Name: Matrix Inverse For Scalars, we have what is called a multiplicative identity. This means that if we have a scalar number, call it r, then r multiplied by the multiplicative identity equals r. Without
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering
More informationSorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by
Sorting Algorithms Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel
More informationCS475 Parallel Programming
CS475 Parallel Programming Sorting Wim Bohm, Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license. Sorting
More informationCS 140 : Numerical Examples on Shared Memory with Cilk++
CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)
More informationChapter 1: Number and Operations
Chapter 1: Number and Operations 1.1 Order of operations When simplifying algebraic expressions we use the following order: 1. Perform operations within a parenthesis. 2. Evaluate exponents. 3. Multiply
More informationVector: A series of scalars contained in a column or row. Dimensions: How many rows and columns a vector or matrix has.
ASSIGNMENT 0 Introduction to Linear Algebra (Basics of vectors and matrices) Due 3:30 PM, Tuesday, October 10 th. Assignments should be submitted via e-mail to: matlabfun.ucsd@gmail.com You can also submit
More informationDivide and Conquer Algorithms. Sathish Vadhiyar
Divide and Conquer Algorithms Sathish Vadhiyar Introduction One of the important parallel algorithm models The idea is to decompose the problem into parts solve the problem on smaller parts find the global
More informationMapping Algorithms to Hardware By Prawat Nagvajara
Electrical and Computer Engineering Mapping Algorithms to Hardware By Prawat Nagvajara Synopsis This note covers theory, design and implementation of the bit-vector multiplication algorithm. It presents
More informationData Structures and Algorithms Week 8
Data Structures and Algorithms Week 8 Dynamic programming Fibonacci numbers Optimization problems Matrix multiplication optimization Principles of dynamic programming Longest Common Subsequence Algorithm
More informationPipelined Computations
Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming. Each
More informationCS473 - Algorithms I
CS473 - Algorithms I Lecture 4 The Divide-and-Conquer Design Paradigm View in slide-show mode 1 Reminder: Merge Sort Input array A sort this half sort this half Divide Conquer merge two sorted halves Combine
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three
More informationCS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 33 Caches CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Cache Performance Metrics Miss rate fraction of memory references not found in cache (misses
More information14 Dynamic. Matrix-chain multiplication. P.D. Dr. Alexander Souza. Winter term 11/12
Algorithms Theory 14 Dynamic Programming (2) Matrix-chain multiplication P.D. Dr. Alexander Souza Optimal substructure Dynamic programming is typically applied to optimization problems. An optimal solution
More informationSorting (Chapter 9) Alexandre David B2-206
Sorting (Chapter 9) Alexandre David B2-206 1 Sorting Problem Arrange an unordered collection of elements into monotonically increasing (or decreasing) order. Let S = . Sort S into S =
More informationDistributed-memory Algorithms for Dense Matrices, Vectors, and Arrays
Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for
More information15. The Software System ParaLab for Learning and Investigations of Parallel Methods
15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.
More informationSE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access
More informationMathematical Operations with Arrays and Matrices
Mathematical Operations with Arrays and Matrices Array Operators (element-by-element) (important) + Addition A+B adds B and A - Subtraction A-B subtracts B from A.* Element-wise multiplication.^ Element-wise
More informationCS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning
CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning Parallel sparse matrix-vector product Lay out matrix and vectors by rows y(i) = sum(a(i,j)*x(j)) Only compute terms with A(i,j) 0 P0 P1
More informationEssential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2
Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2
More informationLast Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort
Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms
More informationCSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms
Dr Izadi CSE-40533 Introduction to Parallel Processing Chapter 5 PRAM and Basic Algorithms Define PRAM and its various submodels Show PRAM to be a natural extension of the sequential computer (RAM) Develop
More informationParallel Processing IMP Questions
Winter 14 Summer 14 Winter 13 Summer 13 180702 Parallel Processing IMP Questions Sr Chapter Questions Total 1 3 2 9 3 10 4 9 5 7 What is Data Decomposition? Explain Data Decomposition with proper example.
More informationChallenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,
More informationParallelization of an Example Program
Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.
More informationHIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel
More informationParallel Sorting. Sathish Vadhiyar
Parallel Sorting Sathish Vadhiyar Parallel Sorting Problem The input sequence of size N is distributed across P processors The output is such that elements in each processor P i is sorted elements in P
More informationInformation Coding / Computer Graphics, ISY, LiTH
Sorting on GPUs Revisiting some algorithms from lecture 6: Some not-so-good sorting approaches Bitonic sort QuickSort Concurrent kernels and recursion Adapt to parallel algorithms Many sorting algorithms
More informationSorting (Chapter 9) Alexandre David B2-206
Sorting (Chapter 9) Alexandre David B2-206 Sorting Problem Arrange an unordered collection of elements into monotonically increasing (or decreasing) order. Let S = . Sort S into S =
More informationExploring Parallelism At Different Levels
Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities
More informationParallel Longest Increasing Subsequences in Scalable Time and Memory
Parallel Longest Increasing Subsequences in Scalable Time and Memory Peter Krusche Alexander Tiskin Department of Computer Science University of Warwick, Coventry, CV4 7AL, UK PPAM 2009 What is in this
More informationTransforming Imperfectly Nested Loops
Transforming Imperfectly Nested Loops 1 Classes of loop transformations: Iteration re-numbering: (eg) loop interchange Example DO 10 J = 1,100 DO 10 I = 1,100 DO 10 I = 1,100 vs DO 10 J = 1,100 Y(I) =
More informationModule 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm
Module 27: Chained Matrix Multiplication and Bellman-Ford Shortest Path Algorithm This module 27 focuses on introducing dynamic programming design strategy and applying it to problems like chained matrix
More informationIntroduction to MatLab. Introduction to MatLab K. Craig 1
Introduction to MatLab Introduction to MatLab K. Craig 1 MatLab Introduction MatLab and the MatLab Environment Numerical Calculations Basic Plotting and Graphics Matrix Computations and Solving Equations
More informationLecture 5. Applications: N-body simulation, sorting, stencil methods
Lecture 5 Applications: N-body simulation, sorting, stencil methods Announcements Quiz #1 in section on 10/13 Midterm: evening of 10/30, 7:00 to 8:20 PM In Assignment 2, the following variation is suggested
More informationCHAPTER 5 Pipelined Computations
CHAPTER 5 Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming.
More informationBasic Communication Ops
CS 575 Parallel Processing Lecture 5: Ch 4 (GGKK) Sanjay Rajopadhye Colorado State University Basic Communication Ops n PRAM, final thoughts n Quiz 3 n Collective Communication n Broadcast & Reduction
More informationMessage-Passing Computing Examples
Message-Passing Computing Examples Problems with a very large degree of parallelism: Image Transformations: Shifting, Rotation, Clipping etc. Mandelbrot Set: Sequential, static assignment, dynamic work
More informationAll-Pairs Shortest Paths - Floyd s Algorithm
All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel
More informationParallel Programming. Matrix Decomposition Options (Matrix-Vector Product)
Parallel Programming Matrix Decomposition Options (Matrix-Vector Product) Matrix Decomposition Sequential algorithm and its complexity Design, analysis, and implementation of three parallel programs using
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationParallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication
CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned
More informationHomework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization
ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor
More informationChapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.
Chapter 5 Divide and Conquer Slides by Kevin Wayne. Copyright 25 Pearson-Addison Wesley. All rights reserved. Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve each part
More informationLecture 8. Dynamic Programming
Lecture 8. Dynamic Programming T. H. Cormen, C. E. Leiserson and R. L. Rivest Introduction to Algorithms, 3rd Edition, MIT Press, 2009 Sungkyunkwan University Hyunseung Choo choo@skku.edu Copyright 2000-2018
More informationParallel Algorithm Design. Parallel Algorithm Design p. 1
Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html
More informationIntroduction to Algorithms
Introduction to Algorithms Dynamic Programming Well known algorithm design techniques: Brute-Force (iterative) ti algorithms Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic
More informationIntroduction to Parallel Computing Errata
Introduction to Parallel Computing Errata John C. Kirk 27 November, 2004 Overview Book: Introduction to Parallel Computing, Second Edition, first printing (hardback) ISBN: 0-201-64865-2 Official book website:
More informationPARALLEL PROCESSING UNIT 3. Dr. Ahmed Sallam
PARALLEL PROCESSING 1 UNIT 3 Dr. Ahmed Sallam FUNDAMENTAL GPU ALGORITHMS More Patterns Reduce Scan 2 OUTLINES Efficiency Measure Reduce primitive Reduce model Reduce Implementation and complexity analysis
More information