Chapter 8 Dense Matrix Algorithms
|
|
- Archibald Lawrence
- 5 years ago
- Views:
Transcription
1 Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23.
2 Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations
3 Matix Algorithms: Introduction Due to their regular structure, parallel computations involving matrices and vectors can readily use data decomposition. Most algorithms use one- and two-dimensional block, cyclic, and block-cyclic partitionings.
4 Matrix-Vector Multiplication Multiply a dense n n matrix A with an n input vector x to yield an n result vector y. The serial algorithm requires n 2 multiplications and additions. W = n 2. ()
5 Matrix-Vector Multiplication: Rowwise -D artitioning The n n matrix A is partitioned among n processes, each process stores one row of the matrix. The n vector x is distributed such that each process owns one of its elements.
6 Matrix-Vector Multiplication: Rowwise -D artitioning Matrix A Vector x rocesses.. n/p.. n p- p- p- p- (a) Initial partitioning of the matrix and the starting vector x (b) Distribution of the full vector among all the processes by all-to-all broadcast Matrix A Vector y p-.. p- p- p-.. p- p- p- p- (c) Entire vector distributed to each process after the broadcast (d) Final distribution of the matrix and the result vector y Multiplication of an n n matrix with an n vector using rowwise block -D partitioning. For the one-row-per-process case, p = n.
7 Matrix-Vector Multiplication: Rowwise -D artitioning Each process starts with only one element of x, thus an all-to-all broadcast is required to distribute all the elements of x to all the processes. rocess i now computes y[i] = Σ n j= (A[i,j] x[j]). The all-to-all broadcast and the computation of y[i] both take time Θ(n). Therefore, the overall parallel time is Θ(n).
8 Matrix-Vector Multiplication: Rowwise -D artitioning Consider now the case when p < n and we use block D partitioning. Each process initially stores n/p rows of the A matrix and a portion of the x vector of size n/p. The all-to-all broadcast takes place among p processes and involves messages of size n/p. This is followed by n/p local dot products. Thus, the parallel run time of this procedure is This is cost-optimal. T = n2 p +t slogp+t w n. (2)
9 Matrix-Vector Multiplication: Rowwise -D artitioning Scalability Analysis: We know that T o = pt W, therefore, we have, T o = t s plogp+t w np. (3) For isoefficiency, we have W = KT o, where K = E/( E) for desired efficiency E. From this, we have W = O(p 2 ) (from the t w term). There is also a bound on isoefficiency because of concurrency. In this case, p < n, therefore, W = n 2 = Ω(p 2 ). Overall isoefficiency is W = O(p 2 ).
10 Matrix-Vector Multiplication: 2-D artitioning The n n matrix A is partitioned among n 2 processors such that each processor owns a single element. The n vector x is distributed only in the last column of n processors.
11 Matrix-vector multiplication with block 2-D partitioning. For the one-element-per-process case, p = n 2 if the matrix size is n n. Matrix-Vector Multiplication: 2-D artitioning p 2 p Matrix A.. p-.. p- Vector x n p n/ p n (a) Initial data distribution and communication steps to align the vector along the diagonal (b) One-to-all broadcast of portions of the vector along process columns Matrix A Vector y.. p- p 2 p.. p- (c) All-to-one reduction of partial results (d) Final distribution of the result vector
12 Matrix-Vector Multiplication: 2-D artitioning We must first distribute the vector x appropriately. The first communication step for the 2-D partitioning aligns the vector x along the principal diagonal of the matrix. The second communication step copies the vector elements from each diagonal process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column. Finally, the result vector is computed by performing an all-toone reduction along each row.
13 Matrix-Vector Multiplication: 2-D artitioning Three basic communication operations are used in this algorithm: one-to-one communication to align the vector along the main diagonal, one-to-all broadcast of each vector element among the n processes of each column, and all-toone reduction in each row. Bothstep2andstep3takeΘ(logn)time,sothatoverallparallel time is Θ(logn). The cost (process-time product) is Θ(n 2 logn); hence, the algorithm is not cost-optimal.
14 Matrix-Vector Multiplication: 2-D artitioning When using fewer than n 2 processors, each process owns an (n/ p) (n/ p) block of the matrix A. The vector x is distributed in portions of n/ p elements in the last process-column only. In this case, the message sizes for the alignment, broadcast, and reduction are all (n/ p). The computation is a product of an (n/ p) (n/ p) submatrix with a vector of length (n/ p).
15 Matrix-Vector Multiplication: 2-D artitioning The first alignment step takes time t s +t w n/ p. Thebroadcastandreductionstepstaketime(t s +t w n/ p)log( p). Local matrix-vector products take time t c n 2 /p. Total time is T n2 p +t slogp+t w n p logp (4)
16 Matrix-Vector Multiplication: 2-D artitioning Scalability Analysis: T o = pt p W = t s plogp+t w n plogp. Equating T o with W, term by term, for isoefficiency, we have, W = K 2 t 2 wplog 2 p as the dominant term. The isoefficiency due to concurrency is O(p). The overall isoefficiency is O(plog 2 p) (due to the network bandwidth). For cost optimality, ( ) we have, W = n 2 = plog 2 p. For this, we have, p = O. n 2 log 2 n
17 Matrix-Matrix Multiplication Consider the problem of multiplying two n n dense, square matrices A and B to yield the product matrix C = A B. The serial complexity is O(n 3 ). A useful concept in this case is called block operations. In this view, an n n matrix A can be regarded as a q q array of blocks A i,j ( i,j < q) such that each block is an (n/q) (n/q) submatrix. In this view, we perform q 3 matrix multiplications, each involving (n/q) (n/q) matrices.
18 Matrix-Matrix Multiplication Consider two n n matrices A and B partitioned into p blocks A i,j and B i,j ( i,j < p) of size (n/ p) (n/ p) each. rocess i,j initially stores A i,j and B i,j and computes block C i,j of the result matrix. Computing submatrix C i,j requires all submatrices A i,k and B k,j for k < p. All-to-all broadcast blocks of A along rows and B along columns. erform local submatrix multiplication.
19 Matrix-Matrix Multiplication The two broadcasts take time 2(t s log( p)+t w (n 2 /p)( p )). The computation requires p multiplications of (n/ p) (n/ p) sized submatrices. The parallel run time is approximately T = n3 p +t slogp+2t w n 2 p. (5) ThealgorithmiscostoptimalandtheisoefficiencyisO(p.5 )due to bandwidth term t w and concurrency. Major drawback is that it is not memory optimal.
20 Matrix-Matrix Multiplication: Cannon s Algorithm In this algorithm, we schedule the computations of the p processes of the ith row such that, at any given time, each process is using a different block A i,k. These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh A i,k after each rotation.
21 communication steps in Cannon s algorithm on 6 processes. Matrix-Matrix Multiplication: Cannon s Algorithm A, A, A,2 A,3 B, B B B,,2,3 A, A, A,2 A,3 B, B B B,,2,3 A 2, A 2, A 2,2 A 2,3 B 2, B B B 2, 2,2 2,3 A 3, A 3, A 3,2 A 3,3 B 3, B B B 3, 3,2 3,3 (a) Initial alignment of A (b) Initial alignment of B A, A, A,2 A,3 A, A,2 A,3 A, B, B, B 2,2 B 3,3 B, B 2, B 3,2 B,3 A, B, A,2 A,3 A, B 2, B 3,2 B,3 A,2 A,3 A, A, B 2, B 3, B,2 B,3 A 2,2 A 2,3 A 2, A 2, B 2, B 3, B,2 B,3 A 2,3 A 2, A 2, A 2,2 B 3, B, B,2 B 2,3 A 3,3 A 3, A 3, A 3,2 A 3, A 3, A 3,2 A 3,3 B 3, B, B,2 B 2,3 B, B, B 2,2 B 3,3 (c) A and B after initial alignment (d) Submatrix locations after first shift A,2 A,3 A, A, B 2, B 3, B,2 B,3 A,3 A, B 3, B, A, B,2 B A,2 2,3 A,3 A, A, A,2 B 3, B, B,2 B 2,3 A, A, A,2 A,3 B, B, B 2,2 B 3,3 A 2, A 2, A 2,2 A 2,3 A 2, A 2,2 A 2,3 A 2, B, B, B 2,2 B 3,3 B, B 2, B 3,2 B,3 A 3, B, A 3,2 A 3,3 A 3, B 2, B 3,2 B,3 A 3,2 A 3,3 A 3, A 3, B 2, B 3, B,2 B,3 (e) Submatrix locations after second shift (f) Submatrix locations after third shift
22 Matrix-Matrix Multiplication: Cannon s Algorithm Align the blocks of A and B in such a way that each process multiplies its local submatrices. This is done by shifting all submatrices A i,j to the left (with wraparound) by i steps and all submatrices B i,j up (with wraparound) by j steps. erform local block multiplication. Each block of A moves one step left and each block of B moves one step up (again with wraparound). erform next block multiplication, add to partial result, repeat until all p blocks have been multiplied.
23 Matrix-Matrix Multiplication: Cannon s Algorithm In the alignment step, since the maximum distance over which a block shifts is p, the two shift operations require a total of 2(t s +t w n 2 /p) time. Eachofthe psingle-stepshiftsinthecompute-and-shiftphase of the algorithm takes t s +t w n 2 /p time. The computation time for multiplying p matrices of size (n/ p) (n/ p) is n 3 /p. The parallel time is approximately: T = n3 p +2 pt s +2t w n 2 p. (6) The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm, except, this is memory optimal.
24 Matrix-Matrix Multiplication: DNS Algorithm Uses a 3-D partitioning. Visualize the matrix multiplication algorithm as a cube matrices A and B come in two orthogonal faces and result C comes out the other orthogonal face. Each internal node in the cube represents a single add-multiply operation (and thus the complexity). DNS algorithm partitions this cube using a 3-D block scheme.
25 Matrix-Matrix Multiplication: DNS Algorithm Assume an n n n mesh of processors. Move the columns of A and rows of B and perform broadcast. Each processor computes a single add-multiply. This is followed by an accumulation along the C dimension. Since each add-multiply takes constant time and accumulation and broadcast takes logn time, the total runtime is logn. This is not cost optimal. It can be made cost optimal by using n/ log n processors along the direction of accumulation.
26 Matrix-Matrix Multiplication: DNS Algorithm k = 3,3,3 2,3 3,3 A, B k = 2,2,2 2,2 3,2 A k k =,, 2, 3, j,,,,2,3,3 2,3,,2 2, 3, (a) Initial distribution of A and B i 2, 2,2 3, 3,2 3,3 k =,, 2, 3, (b) After movinga[i,j] from i,j, to i,j,j,3,3,3 2,3,3,3 2,3 3,3,3,3 2,3 3,3,3 2,3 3,3 3,3 C[,] = A[,3] x B[3,] 3, 3,3 3,3 3,3 3,2 3,2 3,2 3,2 3, 3, 3, 3, 3, 3, 3, 3,3 A,2,2,2 2,2,2,2 2,2 3,2,2,2 2,2 3,2,2 2,2 3,2 3,2 + A[,2] x B[2,] 2, 2,3 2,3 2,3 2,2 2,2 2,2 2,2 2, 2, 2, 2, 2, 2, 2, 2,3 B,,, 2, 3,,, 2, 3,,, 2, 3, A[,] x B[,], 2, 3, +,,3,3,3,2,2,2,2,,,,,,,,3,,, 2, 3,,, 2,,, 2,, 2, 3, 3, + 3, A[,] x B[,],,3,3,3,2,2,2,2,,,,,,,,3 (c) After broadcastinga[i,j] along j axis (d) Corresponding distribution of B The communication steps in the DNS algorithm while multiplying 4 4 matrices A and B on 64 processes.
27 Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n 3 processors. Assume that the number of processes p is equal to q 3 for some q < n. The two matrices are partitioned into blocks of size(n/q) (n/q). Each matrix can thus be regarded as a q q two-dimensional square array of blocks. The algorithm follows from the previous one, except, in this case, we operate on blocks rather than on individual elements.
28 Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n 3 processors. The first one-to-one communication step is performed for both A and B, and takes t s +t w (n/q) 2 time for each matrix. Thetwoone-to-allbroadcaststake2(t s logq+t w (n/q) 2 logq)time for each matrix. The reduction takes time t s logq +t w (n/q) 2 logq. Multiplication of (n/q) (n/q) submatrices takes (n/q) 3 time. The parallel time is approximated by: T = n3 p +t n 2 slogp+t w p2/3logp. (7) The isoefficiency function is Θ(p(logp) 3 ).
29 Solving a System of Linear Equations Consider the problem of solving linear equations of the kind: a, x + a, x + + a,n x n = b, a, x. + a, x. + + a,n x n. = b,. a n, x + a n, x + + a n,n x n = b n. This is written as Ax = b, where A is an n n matrix with A[i,j] = a i,j, b is an n vector [b,b,...,b n ] T, and x is the solution.
30 Solving a System of Linear Equations Two steps in solution are: reduction to triangular form, and back-substitution. The triangular form is as: x + u, x + u,2 x u,n x n = y, x + u,2 x u,n x n. = y,. x n = y n. We write this as: Ux = y. A commonly used method for transforming a given matrix into an upper-triangular matrix is Gaussian Elimination.
31 Gaussian Elimimation. procedure GAUSSIAN ELIMINATION (A, b, y) 2. begin 3. for k := to n do /* Outer loop */ 4. begin 5. for j := k + to n do 6. A[k,j] := A[k,j]/A[k,k]; /* Division step */ 7. y[k] := b[k]/a[k,k]; 8. A[k,k] := ; 9. for i := k + to n do. begin. for j := k + to n do 2. A[i,j] := A[i,j] A[i,k] A[k,j]; /* Elimination step */ 3. b[i] := b[i] A[i,k] y[k]; 4. A[i,k] := ; 5. endfor; /* Line 9 */ 6. endfor; /* Line 3 */ 7. end GAUSSIAN ELIMINATION Serial Gaussian Elimination
32 Gaussian Elimination Thecomputationhasthreenestedloops inthekthiterationof the outer loop, the algorithm performs (n k) 2 computations. Summing from k =..n, we have roughly (n 3 /3) multiplicationssubtractions. Inactive part Column k Column j Row k (k,k) (k,j) A[k,j] := A[k,j]/A[k,k] Active part Row i (i,k) (i,j) A[i,j] := A[i,j] - A[i,k] x A[k,j] A typical computation in Gaussian elimination.
33 arallel Gaussian Elimination Assume p = n with each row assigned to a processor. The first step of the algorithm normalizes the row. This is a serial operation and takes time (n k) in the kth iteration. In the second step, the normalized row is broadcast to all the processors. This takes time (t s +t w (n k ))logn. Each processor can independently eliminate this row from its own. This requires (n k ) multiplications and subtractions. The total parallel time can be computed by summing from k =..n as T = 3 2 n(n )+t snlogn+ 2 t wn(n )logn. (8) The formulation is not cost optimal because of the t w term.
34 arallel Gaussian Elimination (,) (,2) (,3) (,4) (,5) (,6) (,7) (,2) (,3) (,4) (,5) (,6) (,7) (2,3) (2,4) (2,5) (2,6) (2,7) (3,3) (3,4) (3,5) (3,6) (3,7) (4,3) (4,4) (4,5) (4,6) (4,7) (5,3) (5,4) (5,5) (5,6) (5,7) (6,3) (6,4) (6,5) (6,6) (6,7) (7,3) (7,4) (7,5) (7,6) (7,7) (a) Computation: (i) A[k,j] := A[k,j]/A[k,k] for k < j < n (ii) A[k,k] := (,) (,2) (,3) (,4) (,5) (,6) (,7) (,2) (,3) (,4) (,5) (,6) (,7) (2,3) (2,4) (2,5) (2,6) (2,7) (3,4) (3,5) (3,6) (3,7) (4,3) (4,4) (4,5) (4,6) (4,7) (5,3) (5,4) (5,5) (5,6) (5,7) (6,3) (6,4) (6,5) (6,6) (6,7) (7,3) (7,4) (7,5) (7,6) (7,7) (b) Communication: One-to-all brodcast of row A[k,*] (,) (,2) (,3) (,4) (,5) (,6) (,7) (,2) (,3) (,4) (,5) (,6) (,7) (2,3) (2,4) (2,5) (2,6) (2,7) (3,4) (3,5) (3,6) (3,7) (4,3) (4,4) (4,5) (4,6) (4,7) (5,3) (5,4) (5,5) (5,6) (5,7) (6,3) (6,4) (6,5) (6,6) (6,7) (7,3) (7,4) (7,5) (7,6) (7,7) (c) Computation: (i) A[i,j] := A[i,j] - A[i,k] x A[k,j] for k < i < n and k < j < n (ii) A[i,k] := for k < i < n Gaussian elimination steps during the iteration corresponding to k = 3 for an 8 8 matrix partitioned rowwise among eight processes.
35 arallel Gaussian Elimination: ipelined Execution In the previous formulation, the(k +)st iteration starts only after all the computation and communication for the kth iteration is complete. In the pipelined version, there are three steps normalization of a row, communication, and elimination. These steps are performed in an asynchronous fashion. Aprocessor k waitstoreceiveandeliminateallrowspriortok. Once it has done this, it forwards its own row to processor k+.
36 ipelined Gaussian elimination on a 5 5 matrix partitioned with one row per process. arallel Gaussian Elimination: ipelined Execution (,) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (,) (,) (,2) (,3) (,4) (2,) (2,) (2,2) (2,3) (2,4) (2,) (2,) (2,2) (2,3) (2,4) (2,) (2,) (2,2) (2,3) (2,4) (2,) (2,) (2,2) (2,3) (2,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,) (3,2) (3,3) (3,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (a) Iteration k = starts (b) (c) (d) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,2) (,3) (,4) (2,) (2,) (2,2) (2,3) (2,4) (2,) (2,2) (2,3) (2,4) (2,) (2,2) (2,3) (2,4) (2,) (2,2) (2,3) (2,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,) (3,2) (3,3) (3,4) (3,) (3,2) (3,3) (3,4) (3,) (3,2) (3,3) (3,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,) (4,2) (4,3) (4,4) (4,) (4,2) (4,3) (4,4) (e) Iteration k = starts (f) (g) Iteration k = ends (h) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (2,2) (2,3) (2,4) (2,3) (2,4) (2,3) (2,4) (2,3) (2,4) (3,) (3,2) (3,3) (3,4) (3,2) (3,3) (3,4) (3,2) (3,3) (3,4) (3,2) (3,3) (3,4) (4,) (4,2) (4,3) (4,4) (4,) (4,2) (4,3) (4,4) (4,2) (4,3) (4,4) (4,2) (4,3) (4,4) (i) Iteration k = 2 starts (j) Iteration k = ends (k) (l) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (,2) (,3) (,4) (2,3) (2,4) (2,3) (2,4) (2,3) (2,4) (2,3) (2,4) (3,3) (3,4) (3,4) (3,4) (3,4) (4,2) (4,3) (4,4) (4,3) (4,4) (4,3) (4,4) (4,4) (m) Iteration k = 3 starts (n) (o) Iteration k = 3 ends (p) Iteration k = 4 Communication for k =, 3 Computation for k =, 3 Communication for k = Computation for k =, 4 Communication for k = 2 Computation for k = 2
37 arallel Gaussian Elimination: ipelined Execution The total number of steps in the entire pipelined procedure is Θ(n). In any step, either O(n) elements are communicated between directly-connected processes, or a division step is performed on O(n) elements of a row, or an elimination step is performed on O(n) elements of a row. The parallel time is therefore O(n 2 ). This is cost optimal.
38 arallel Gaussian Elimination: Block D with p < n The above algorithm can be easily adapted to the case when p < n. In the kth iteration, a processor with all rows belonging to the active part of the matrix performs (n k )n/p multiplications and subtractions. In the pipelined version, for n > p, computation dominates communication. The parallel time is given by: 2(n/p)Σ n k= (n k ), or approximately, n 3 /p. While the algorithm is cost optimal, the cost of the parallel algorithm is higher than the sequential run time by a factor of 3/2.
39 arallel Gaussian Elimination: Block D with p < n (,) (,2) (,3) (,4) (,5) (,6) (,7) (,) (,2) (,3) (,4) (,5) (,6) (,7) (,2) (,3) (,4) (,5) (,6) (,7) (4,3) (4,4) (4,5) (4,6) (4,7) (2,3) (2,4) (2,5) (2,6) (2,7) (3,3) (3,4) (3,5) (3,6) (3,7) (,2) (,3) (,4) (,5) (,6) (,7) (5,3) (5,4) (5,5) (5,6) (5,7) 2 (4,3) (4,4) (4,5) (4,6) (4,7) (5,3) (5,4) (5,5) (5,6) (5,7) (2,3) (2,4) (2,5) (2,6) (2,7) (6,3) (6,4) (6,5) (6,6) (6,7) 2 (6,3) (6,4) (6,5) (6,6) (6,7) (3,3) (3,4) (3,5) (3,6) (3,7) 3 (7,3) (7,4) (7,5) (7,6) (7,7) (7,3) (7,4) (7,5) (7,6) (7,7) 3 (a) Block -D mapping (b) Cyclic -D mapping Computation load on different processes in block and cyclic -D partitioning of an 8 8 matrix on four processes during the Gaussian elimination iteration corresponding to k = 3.
40 arallel Gaussian Elimination: Cyclic D Mapping The load imbalance problem can be alleviated by using a cyclic mapping. In this case, other than processing of the last p rows, there is no load imbalance.
Dense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationLecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.
CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview
More informationMatrix multiplication
Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationDistributed-memory Algorithms for Dense Matrices, Vectors, and Arrays
Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for
More informationLecture 6: Parallel Matrix Algorithms (part 3)
Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationBasic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel
More informationSorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel
More informationParallel Computing: Parallel Algorithm Design Examples Jin, Hai
Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a
More informationMatrix Multiplication
Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c
More informationChapter 5: Analytical Modelling of Parallel Programs
Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationParallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop
Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationChapter Introduction
Chapter 4.1 Introduction After reading this chapter, you should be able to 1. define what a matrix is. 2. identify special types of matrices, and 3. identify when two matrices are equal. What does a matrix
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More informationExam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3
UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationSorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by
Sorting Algorithms Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel
More informationParallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering
Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel
More informationChapter 11 Search Algorithms for Discrete Optimization Problems
Chapter Search Algorithms for Discrete Optimization Problems (Selected slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003.
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationContents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet
Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage
More informationCOMMUNICATION IN HYPERCUBES
PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)
More informationLecture 8 Parallel Algorithms II
Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationMatrix Multiplication
Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix
More informationSearch Algorithms for Discrete Optimization Problems
Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic
More informationParallelizing LU Factorization
Parallelizing LU Factorization Scott Ricketts December 3, 2006 Abstract Systems of linear equations can be represented by matrix equations of the form A x = b LU Factorization is a method for solving systems
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationLinear Arrays. Chapter 7
Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between
More informationParallel Programming with MPI and OpenMP
Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 6 Floyd s Algorithm Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading
More informationSCALABILITY ANALYSIS
SCALABILITY ANALYSIS PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS Evaluation Sequential: runtime (execution time) Ts =T (InputSize) Parallel: runtime (start-->last PE ends) Tp =T (InputSize,p,architecture)
More information(Sparse) Linear Solvers
(Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert
More informationCSC 447: Parallel Programming for Multi- Core and Cluster Systems
CSC 447: Parallel Programming for Multi- Core and Cluster Systems Parallel Sorting Algorithms Instructor: Haidar M. Harmanani Spring 2016 Topic Overview Issues in Sorting on Parallel Computers Sorting
More informationIntroduction to Parallel. Programming
University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 9. Programming Parallel Methods for Solving Linear Systems Gergel V.P., Professor, D.Sc.,
More informationAnalytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.
Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationLecture 3: Sorting 1
Lecture 3: Sorting 1 Sorting Arranging an unordered collection of elements into monotonically increasing (or decreasing) order. S = a sequence of n elements in arbitrary order After sorting:
More information15. The Software System ParaLab for Learning and Investigations of Parallel Methods
15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationIntroduction to Parallel Computing Errata
Introduction to Parallel Computing Errata John C. Kirk 27 November, 2004 Overview Book: Introduction to Parallel Computing, Second Edition, first printing (hardback) ISBN: 0-201-64865-2 Official book website:
More informationSearch Algorithms for Discrete Optimization Problems
Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic
More informationLecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC
Lecture 10: Performance Metrics Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 5 slides of the text, by A. Grama w/ a few changes, augmentations and corrections in colored text by Shantanu
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationCE 601: Numerical Methods Lecture 5. Course Coordinator: Dr. Suresh A. Kartha, Associate Professor, Department of Civil Engineering, IIT Guwahati.
CE 601: Numerical Methods Lecture 5 Course Coordinator: Dr. Suresh A. Kartha, Associate Professor, Department of Civil Engineering, IIT Guwahati. Elimination Methods For a system [A]{x} = {b} where [A]
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationFor example, the system. 22 may be represented by the augmented matrix
Matrix Solutions to Linear Systems A matrix is a rectangular array of elements. o An array is a systematic arrangement of numbers or symbols in rows and columns. Matrices (the plural of matrix) may be
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationPipelined Computations
Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming. Each
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationLinear systems of equations
Linear systems of equations Michael Quinn Parallel Programming with MPI and OpenMP material do autor Terminology Back substitution Gaussian elimination Outline Problem System of linear equations Solve
More informationLecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur
Lecture 5: Matrices Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur 29 th July, 2008 Types of Matrices Matrix Addition and Multiplication
More informationIntroduction to Parallel & Distributed Computing Parallel Graph Algorithms
Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationCSC630/CSC730 Parallel & Distributed Computing
CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2
More informationProject Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting
Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run
More informationMATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.
MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. Row echelon form A matrix is said to be in the row echelon form if the leading entries shift to the
More informationLecture 4: Graph Algorithms
Lecture 4: Graph Algorithms Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationMatrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University
Parallel LU Factorization of Block-Diagonal-Bordered Sparse Matrices D. P. Koester, S. Ranka, and G. C. Fox School of Computer and Information Science and The Northeast Parallel Architectures Center (NPAC)
More informationAim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview
Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure
More informationA New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer
A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer Orieb AbuAlghanam, Mohammad Qatawneh Computer Science Department University of Jordan Hussein A. al Ofeishat
More informationHIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More information1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3
6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require
More informationCSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing
HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;
More informationDense LU Factorization
Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...
More informationSequential and Parallel Algorithms for Cholesky Factorization of Sparse Matrices
Sequential and Parallel Algorithms for Cholesky Factorization of Sparse Matrices Nerma Baščelija Sarajevo School of Science and Technology Department of Computer Science Hrasnicka Cesta 3a, 71000 Sarajevo
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationSummary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.
Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP
More informationCHAPTER 5 Pipelined Computations
CHAPTER 5 Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming.
More informationSection 3.1 Gaussian Elimination Method (GEM) Key terms
Section 3.1 Gaussian Elimination Method (GEM) Key terms Rectangular systems Consistent system & Inconsistent systems Rank Types of solution sets RREF Upper triangular form & back substitution Nonsingular
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing George Karypis Sorting Outline Background Sorting Networks Quicksort Bucket-Sort & Sample-Sort Background Input Specification Each processor has n/p elements A ordering
More informationCS 770G - Parallel Algorithms in Scientific Computing
CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,
More informationCSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms
Dr Izadi CSE-40533 Introduction to Parallel Processing Chapter 5 PRAM and Basic Algorithms Define PRAM and its various submodels Show PRAM to be a natural extension of the sequential computer (RAM) Develop
More informationCME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 03/02/17
CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 Due at the beginning of class Thursday 03/02/17 1. Consider a model of a nonbipartite undirected graph in which
More informationHypercubes. (Chapter Nine)
Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization
CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization Edgar Solomonik University of Illinois at Urbana-Champaign September 14, 2016 Parallel Householder
More information1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors
1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors on an EREW PRAM: See solution for the next problem. Omit the step where each processor sequentially computes the AND of
More informationAddition/Subtraction flops. ... k k + 1, n (n k)(n k) (n k)(n + 1 k) n 1 n, n (1)(1) (1)(2)
1 CHAPTER 10 101 The flop counts for LU decomposition can be determined in a similar fashion as was done for Gauss elimination The major difference is that the elimination is only implemented for the left-hand
More informationPAPER Design of Optimal Array Processors for Two-Step Division-Free Gaussian Elimination
1503 PAPER Design of Optimal Array Processors for Two-Step Division-Free Gaussian Elimination Shietung PENG and Stanislav G. SEDUKHIN Nonmembers SUMMARY The design of array processors for solving linear
More informationCS 140 : Numerical Examples on Shared Memory with Cilk++
CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)
More informationMatrix Multiplication on an Experimental Parallel System With Hybrid Architecture
Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture SOTIRIOS G. ZIAVRAS and CONSTANTINE N. MANIKOPOULOS Department of Electrical and Computer Engineering New Jersey Institute
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationLecture 57 Dynamic Programming. (Refer Slide Time: 00:31)
Programming, Data Structures and Algorithms Prof. N.S. Narayanaswamy Department of Computer Science and Engineering Indian Institution Technology, Madras Lecture 57 Dynamic Programming (Refer Slide Time:
More informationParallel Reduction from Block Hessenberg to Hessenberg using MPI
Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson
More informationSparse matrices, graphs, and tree elimination
Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More information0_PreCNotes17 18.notebook May 16, Chapter 12
Chapter 12 Notes BASIC MATRIX OPERATIONS Matrix (plural: Matrices) an n x m array of elements element a ij Example 1 a 21 = a 13 = Multiply Matrix by a Scalar Distribute scalar to all elements Addition
More informationCSC630/CSC730 Parallel & Distributed Computing
CSC630/CSC730 Parallel & Distributed Computing Parallel Sorting Chapter 9 1 Contents General issues Sorting network Bitonic sort Bubble sort and its variants Odd-even transposition Quicksort Other Sorting
More informationMessage-Passing Computing Examples
Message-Passing Computing Examples Problems with a very large degree of parallelism: Image Transformations: Shifting, Rotation, Clipping etc. Mandelbrot Set: Sequential, static assignment, dynamic work
More informationSoftware Pipelining 1/15
Software Pipelining 1/15 Pipelining versus Parallel Suppose it takes 3 units of time to assemble a widget. Furthermore, suppose the assembly consists of three steps A, B, and C and each step requires exactly
More informationQuaternion Rotations AUI Course Denbigh Starkey
Major points of these notes: Quaternion Rotations AUI Course Denbigh Starkey. What I will and won t be doing. Definition of a quaternion and notation 3 3. Using quaternions to rotate any point around an
More informationOverview of Parallel Programming
Overview of arallel rogramming Architectural Issues in arallel rocessing Convex Exemplar Architecture: Cache CU Memory Cache CU Memory rivate Memory : Hypernode Agent Agent Hypernode Interconnect Global
More informationFail-Stop Failure ABFT for Cholesky Decomposition
Fail-Stop Failure ABFT for Cholesky Decomposition Doug Hakkarinen, Student Member, IEEE, anruo Wu, Student Member, IEEE, and Zizhong Chen, Senior Member, IEEE Abstract Modeling and analysis of large scale
More information1.5D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY
.D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY ENVER KAYAASLAN, BORA UÇAR, AND CEVDET AYKANAT Abstract. There are three common parallel sparse matrix-vector multiply algorithms: D row-parallel, D column-parallel
More informationScalability of Heterogeneous Computing
Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor
More informationn = 1 What problems are interesting when n is just 1?
What if n=1??? n = 1 What problems are interesting when n is just 1? Sorting? No Median finding? No Addition? How long does it take to add one pair of numbers? Multiplication? How long does it take to
More information