CS 770G - Parallel Algorithms in Scientific Computing
|
|
- Bathsheba Carson
- 5 years ago
- Views:
Transcription
1 CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6
2 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis, Benjamin Cummings. Numerical inear lgebra for High-Performance Computers Dongarra, Duff, Sorensen, van der Vorst, SIM. portion of the notes comes from Prof. J. Demmel s CS267 course at C Berkeley. 2
3 Review of Gaussian Elimination (GE) for Solving xb dd multiples of each row to later rows to make upper triangular. Solve resulting triangular system x c by substitution. for each column i zero it out below the diagonal by adding multiples of row i to later rows for i 1 to n-1 for each row j below row i for j i+1 to n add a multiple of row i to row j for k i to n (j,k) (j,k) - ((j,i)/(i,i)) * (i,k) 3
4 Initial version: Refine GE lgorithm (1) for each column i zero it out below the diagonal by adding multiples of row i to later rows for i 1 to n-1 for each row j below row i for j i+1 to n add a multiple of row i to row j for k i to n (j,k) (j,k) - ((j,i)/(i,i)) * (i,k) Remove computation of constant (j,i)/(i,i) from inner loop: for i 1 to n-1 for j i+1 to n m (j,i)/(i,i) for k i to n (j,k) (j,k) - m * (i,k) 4
5 ast version: Refine GE lgorithm (2) for i 1 to n-1 for j i+1 to n m (j,i)/(i,i) for k i to n (j,k) (j,k) - m * (i,k) Don t compute what we already know: zeros below diagonal in column i for i 1 to n-1 for j i+1 to n m (j,i)/(i,i) for k i+1 to n (j,k) (j,k) - m * (i,k) 5
6 ast version: Refine GE lgorithm (3) for i 1 to n-1 for j i+1 to n m (j,i)/(i,i) for k i+1 to n (j,k) (j,k) - m * (i,k) Store multipliers m below diagonal in zeroed entries for later use: for i 1 to n-1 for j i+1 to n (j,i) (j,i)/(i,i) for k i+1 to n (j,k) (j,k) - (j,i) * (i,k) 6
7 ast version: Refine GE lgorithm (4) for i 1 to n-1 for j i+1 to n (j,i) (j,i)/(i,i) for k i+1 to n (j,k) (j,k) - (j,i) * (i,k) Express using matrix operations (BS) for i 1 to n-1 (i+1:n,i) (i+1:n,i) / (i,i) (i+1:n,i+1:n) (i+1:n, i+1:n ) - (i+1:n, i) * (i, i+1:n) 7
8 What GE Really Computes for i 1 to n-1 (i+1:n,i) (i+1:n,i) / (i,i) (i+1:n,i+1:n) (i+1:n, i+1:n ) - (i+1:n, i) * (i, i+1:n) Call the strictly lower triangular matrix of multipliers M, and let I+M. Call the upper triangle of the final matrix. emma ( Factorization): If the above algorithm terminates (does not divide by zero) then *. Solving *xb using GE Factorize * using GE (cost 2/3 n 3 flops) Solve *y b for y, using substitution (cost n 2 flops) Solve *x y for x, using substitution (cost n 2 flops) Thus *x (*)*x *(*x) *y b as desired 8
9 Problems with Basic GE lgorithm What if some (i,i) is zero? Or very small? Result may not exist, or be unstable, so need to pivot Current computation all BS 1 or BS 2, but we know that BS 3 (matrix multiply) is fastest for i 1 to n-1 (i+1:n,i) (i+1:n,i) / (i,i) BS 1 (scale a vector) (i+1:n,i+1:n) (i+1:n, i+1:n ) BS 2 (rank-1 update) - (i+1:n, i) * (i, i+1:n) Peak BS 3 BS 2 BS 1 9
10 Pivoting in Gaussian Elimination [ 0 1 ] fails completely, even though is easy [ 1 0 ] Illustrate problems in 3-decimal digit arithmetic: [ 1e-4 1 ] and b [ 1 ], correct answer to 3 places is x [ 1 ] [ 1 1 ] [ 2 ] [ 1 ] Result of decomposition is [ 1 0 ] [ 1 0 ] [ fl(1/1e-4) 1 ] [ 1e4 1 ] No roundoff error yet [ 1e-4 1 ] [ 1e-4 1 ] Error in 4th decimal place [ 0 fl(1-1e4*1) ] [ 0-1e4 ] Check if * [ 1e-4 1 ] (2,2) entry entirely wrong [ 1 0 ] lgorithm forgets (2,2) entry, gets same and for all (2,2) <5 Numerical instability Computed solution x totally inaccurate Cure: Pivot (swap rows of ) so entries of and bounded 10
11 Gaussian Elimination with Partial Pivoting (GEPP) Partial Pivoting: swap rows so that each multiplier satisfies: (i,j) (j,i)/(i,i) < 1 for i 1 to n-1 find and record k where (k,i) max{i < j < n} (j,i) i.e. largest entry in rest of column i if (k,i) 0 exit with a warning that is singular, or nearly so elseif k! i swap rows i and k of end if (i+1:n,i) (i+1:n,i) / (i,i) each quotient lies in [-1,1] (i+1:n,i+1:n) (i+1:n, i+1:n ) - (i+1:n, i) * (i, i+1:n) emma: This algorithm computes P**, where P is a permutation matrix Since each entry of (i,j) < 1, this algorithm is considered numerically stable. For details see PCK code at and Dongarra s book.
12 Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would have been optimized for a given hardware platform ny other matrix operation that you may think of (matrixmatrix multiply, Cholesky factorization, QR, Householder method, etc) can be constructed from BS subprograms in a similar fashion and in fact have been constructed in the package called PCK. Note that only evel 1 and 2 BS routines have been used in decomposition. Efficiency considerations? 12
13 Overview of PCK Standard library for dense/banded linear algebra inear systems: *xb east squares problems: min x *x-b 2 Eigenvalue problems: x λx, x λbx Singular value decomposition (SVD): ΣV T lgorithms reorganized to use BS3 as much as possible. Basis of math libraries on many computers. Many algorithmic innovations remain. 13
14 Performance of PCK (n1000) 14
15 Performance of PCK (n100) 15
16 Summary, Cont d Need to devise algorithms that can make use of evel 3 BS (matrix-matrix) routines, for several reasons: evel 3 routines are known to run much more efficiently due to larger ratio of computation to memory references/communication Parallel algorithms on distributed memory machines will require that we decompose the original matrix into blocks which reside in each processor (similar to HW1) Parallel algorithms will require that we minimize the surface-to volume ratio of our decompositions, and blocking becomes the natural approach. 16
17 Converting BS2 to BS3 in GEPP Blocking se to optimize matrix-multiplication. Harder here because of data dependencies in GEPP. Delayed pdates Save updates to trailing matrix from several consecutive BS2 updates. pply many saved updates simultaneously in one BS3 operation. Same idea works for much of dense linear algebra Open questions remain. Need to choose a block size b (k in Dongarra s book) lgorithm will save and apply b updates. b must be small enough so that active submatrix consisting of b columns of fits in cache. b must be large enough to make BS3 fast. 17
18 18 Blocked lgorithms - Factorization
19 Blocked lgorithms - Factorization With these relationships, we can develop different algorithms by choosing the order in which operations are performed. Blocksize, b, needs to be chosen carefully. b1 produces the usual algorithm, which b>1 will improve performance on a single-processor Three natural variants: eft-ooking Right-ooking Crout 19
20 Blocked lgorithms - Factorization ssume you have already done the first row and column of the GEPP nd you have the sub-block below left to work on Notice that the decomposition in this sub-block is independent of the portion you have already completed 20
21 Blocked lgorithms - Factorization For simplicity, change notation of sub-block to P P Notice that, once you have done Gaussian Elimination on and 21, you have already obtained,, and 21. Now you can re-arrange the block equations by substituting: ~ 12 ( nd repeat the procedure recursively. )
22 Blocked lgorithms - Factorization graphical view of what is going on is given by:
23 Blocked lgorithms - Factorization eft-ooking Right-ooking Crout Variations in algorithm are due to the order in which submatrix operations are performed. Slight advantages to Crout s algorithm (hybrid of the first two). Pre-computed sub-blocks Currently being operated on sub-blocks 23
24 Review: BS 3 (Blocked) GEPP BS 3 for ib 1 to n-1 step b Process matrix b columns at a time end ib + b-1 Point to end of block of b columns apply BS2 version of GEPP to get (ib:n, ib:end) P * * let denote the strict lower triangular part of (ib:end, ib:end) + I (ib:end, end+1:n) -1 * (ib:end, end+1:n) update next b rows of (end+1:n, end+1:n ) (end+1:n, end+1:n ) - (end+1:n, ib:end) * (ib:end, end+1:n) apply delayed updates with single matrix-multiply with inner dimension b 24
25 Parallel lgorithms for Dense Matrices ll that follows is applicable to dense or full matrices only. Square matrices discussed, but arguments valid for rectangular matrices as well. Typical parallelization steps: Decomposition: identify parallel work and partitioning. Mapping: which procs execute which portion of the work. ssignment: load balance work among procs. Organization: communication and synchronization. 25
26 Parallel lgorithms for Dense Matrices The proc that owns a given portion of a matrix is responsible for doing all of the computation that involved that portion of the matrix. This is the sensible thing to so, since communication is minimized (although, due to data dependencies within the matrix, it will still be necessary) The question is: how should we subdivide a matrix so that parallel efficiency is maximized? There are various options. 26
27 Different Data ayouts for Parallel GE Bad load balance: P 0 idle after first n/4 steps oad balanced, but can t easily use BS2 or BS3 Can trade load balance and BS2/3 performance by choosing b, but factorization of block column is a bottleneck The winner! Complicated addressing 27
28 Row and Column Block Cyclic ayout Matrix is composed of brow-by-bcol blocks. Procs are distributed in a 2D array indexed by (pi, pj), 0 pi < Prow, 0 pj < Pcol. i,j is mapped to proc (pi, pj) using the formulae: pi pj floor( i / brow) floor( j / bcol) mod mod Prow Pcol In the figure, p4, ProwPcolbrowbcol2. Pcol-fold parallelism in any column, and calls to the BS2 and BS3 on matrices of size brow-bybcol. Serial bottleneck is eased. Need not be symmetric in rows and columns. 28
29 Row and Column Block Cyclic ayout In factorization, distribution of work becomes uneven as the computation progresses. arger block sizes result in greater load imbalance but reduce frequency of communication between procs. Block size controls these tradeoffs. Some procs need to do more work between synchronization points than others (e.g. partial pivoting over rows in a single block-col, other procs stay idle. The computation of each block row of the factorization requires the solution of a lower triangular system across procs in a single row). Processor decomposition controls this type of tradeoff. 29
30 Parallel GE with a 2D Block Cyclic ayout Block size, b, in the algorithm and the block sizes brow and bcol in the layout satisfy bbrowbcol. Shaded regions indicate busy processors or communication performed. nnecessary to have a barrier between each step of the algorithm, e.g.. step 9, 10, and can be pipelined. See Dongarra s book for more details. 30
31 31
32 32 Matrix multiply of green green - blue * pink
33 33
34 Parallel Matrix Transpose The transpose of a matrix is defined as: T i, j, 0 i, j, i ll elements below diagonal move above the diagonal and vice versa. ssume it takes 1 unit time to exchange a pair of matrix elements. Sequential time of transposing an n n matrix is: T s n 2 n 2 Consider parallel architectures organized in both a 2D mesh and hypercube structures. j n 34
35 Parallel Matrix Transpose - 2D Mesh P P 0 1 P P 2 3 P 0 P P 4 8 P 12 P 4 P 5 P 6 P 7 P 1 P 5 P 9 P 13 P8 P P P 9 10 P 2 P 6 P 10 P 14 P 12 P 13 P 14 P 15 P P 3 7 P P 15 Initial Matrix Final Matrix Elements/blocks on lower-left part of matrix move up to the diagonal, and then right to their final location. Each step taken requires communication. Elements/blocks on upper-right part of matrix move down to the diagonal, and then left to their final location. Each step taken requires communication. 35
36 Parallel Matrix Transpose - 2D Mesh P P 0 1 P P 2 3 P 0 P P 4 8 P 12 P 4 P 5 P 6 P 7 P 1 P 5 P 9 P 13 P8 P P P 9 10 P 2 P 6 P 10 P 14 P 12 P 13 P 14 P 15 P P 3 7 P P 15 Initial Matrix Final Matrix If each of the p procs contains a single number, after all of these communication steps, the matrix has been transposed. However, if each proc contains a sub-block of the matrix, after all blocks have been communicated to their final locations, they need to be locally transposed. Each sub-block will contain (n/ p) (n/ p) elements and the cost of communication will be higher than before. Cost of communication is dominated by elements/blocks that reside in the top-right and bottom-left corners. They have to take an approximate number of hops equal to 2 p. 36
37 Parallel Matrix Transpose - 2D Mesh 2 Each block contains n / p elements, so it takes at most 2 2( t + t n / p) p s w for all blocks to move to their final destinations. fter that, the local blocks need to be transposed, which can be done in an amount of time approximately equal to 2 n /(2 p) Thus, a total wall clock time equals to 2 n T + 2t p + 2t P s 2 p Summing over all p processors, the total time consumed by the parallel algorithm is of order T ( 2 TOT Θ n p) w 2 n p which is higher than the sequential complexity (order n^2). This algorithm, on a 2D mesh is not cost optimal. The same is true regardless of whether store-and-forward or cut-through routing schemes are used. 37
38 Parallel Matrix Transpose - Hypercube P P 0 1 P P 2 3 P 0 P 1 P P 8 9 P 4 P 5 P 6 P 7 P 4 P 5 P 12 P 13 P8 P P P 9 10 P P 2 3 P P 10 P 12 P 13 P 14 P 15 P 6 P 7 P 14 P 15 This algorithm is called recursive subdivision and maps naturally onto a hypercube. fter the blocks have all been transposed, the elements inside each block (local to a proc) still need to be transposed. Wall clock time Total time 2 2 n n T + ( t + t ) log p T ( 2 log ) P s w TOT Θ n p 2 p p P 0 P 1 P 2 P 5 P 6 P P 3 7 P P 4 8 P 9 P 10 P 12 P 13 P 14 P P 15 38
39 39 Parallel Matrix Transpose - Hypercube With cut-through routing, the timing improves slightly to ) log ( )log 2 ( 2 2 p n T p t p n t t T TOT h w s P Θ + + which is still suboptimal. sing striped partitioning (a.k.a column blocked layout) and cut-through routing on a hypercube, the total time becomes cost-optimal. ) ( log 2 1 1) ( n T p p t p n t p t p n T TOT h w s P Θ Note that this type of partitioning may be cost-optimal for the transpose operation. But not necessarily for other matrix operations, such as factorization and matrix-matrix multiply.
Dense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationParallelizing LU Factorization
Parallelizing LU Factorization Scott Ricketts December 3, 2006 Abstract Systems of linear equations can be represented by matrix equations of the form A x = b LU Factorization is a method for solving systems
More informationLecture 7: Linear Algebra Algorithms
Outline Lecture 7: Linear Algebra Algorithms Jack Dongarra, U of Tennessee Slides are adapted from Jim Demmel, UCB s Lecture on Linear Algebra Algorithms 1 Motivation, overview for Dense Linear Algebra
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel
More informationLecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur
Lecture 5: Matrices Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur 29 th July, 2008 Types of Matrices Matrix Addition and Multiplication
More informationComputational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science
Computational Methods CMSC/AMSC/MAPL 460 Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Zero elements of first column below 1 st row multiplying 1 st
More informationSequential and Parallel Algorithms for Cholesky Factorization of Sparse Matrices
Sequential and Parallel Algorithms for Cholesky Factorization of Sparse Matrices Nerma Baščelija Sarajevo School of Science and Technology Department of Computer Science Hrasnicka Cesta 3a, 71000 Sarajevo
More informationParallel Reduction from Block Hessenberg to Hessenberg using MPI
Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson
More informationDense LU Factorization
Dense LU Factorization Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 6 Contents 1. Dense LU Factorization...
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationExam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3
UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationCS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra
CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?
More information15. The Software System ParaLab for Learning and Investigations of Parallel Methods
15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.
More informationParallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering
Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering George Karypis and Vipin Kumar Brian Shi CSci 8314 03/09/2017 Outline Introduction Graph Partitioning Problem Multilevel
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationNumerical Linear Algebra
Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,
More informationHypercubes. (Chapter Nine)
Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is
More informationIntroduction to Parallel. Programming
University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 9. Programming Parallel Methods for Solving Linear Systems Gergel V.P., Professor, D.Sc.,
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationMatrix multiplication
Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:
More informationChapter 8 Dense Matrix Algorithms
Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview
More informationLinear Arrays. Chapter 7
Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between
More informationProject Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting
Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More information(Sparse) Linear Solvers
(Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert
More informationDistributed-memory Algorithms for Dense Matrices, Vectors, and Arrays
Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationParallel Implementation of QRD Algorithms on the Fujitsu AP1000
Parallel Implementation of QRD Algorithms on the Fujitsu AP1000 Zhou, B. B. and Brent, R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 0200 Abstract This paper addresses
More informationSparse matrices, graphs, and tree elimination
Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);
More informationChemical Engineering 541
Chemical Engineering 541 Computer Aided Design Methods Direct Solution of Linear Systems 1 Outline 2 Gauss Elimination Pivoting Scaling Cost LU Decomposition Thomas Algorithm (Iterative Improvement) Overview
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationFast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems
Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems Irina F. Gorodnitsky Cognitive Sciences Dept. University of California, San Diego La Jolla, CA 9293-55 igorodni@ece.ucsd.edu Dmitry
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationBasic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationBlocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.
Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course
More informationLecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations. Reference: Introduction to Parallel Computing Chapter 8.
CZ4102 High Performance Computing Lecture 12 (Last): Parallel Algorithms for Solving a System of Linear Equations - Dr Tay Seng Chuan Reference: Introduction to Parallel Computing Chapter 8. 1 Topic Overview
More informationVector: A series of scalars contained in a column or row. Dimensions: How many rows and columns a vector or matrix has.
ASSIGNMENT 0 Introduction to Linear Algebra (Basics of vectors and matrices) Due 3:30 PM, Tuesday, October 10 th. Assignments should be submitted via e-mail to: matlabfun.ucsd@gmail.com You can also submit
More informationLecture 27: Fast Laplacian Solvers
Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationON DATA LAYOUT IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM WITH PRE PROCESSING
Proceedings of ALGORITMY 2009 pp. 449 458 ON DATA LAYOUT IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM WITH PRE PROCESSING MARTIN BEČKA, GABRIEL OKŠA, MARIÁN VAJTERŠIC, AND LAURA GRIGORI Abstract. An efficient
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationA class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines
Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving
More informationCS 664 Structure and Motion. Daniel Huttenlocher
CS 664 Structure and Motion Daniel Huttenlocher Determining 3D Structure Consider set of 3D points X j seen by set of cameras with projection matrices P i Given only image coordinates x ij of each point
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationLeast-Squares Fitting of Data with B-Spline Curves
Least-Squares Fitting of Data with B-Spline Curves David Eberly, Geometric Tools, Redmond WA 98052 https://www.geometrictools.com/ This work is licensed under the Creative Commons Attribution 4.0 International
More informationBasic matrix math in R
1 Basic matrix math in R This chapter reviews the basic matrix math operations that you will need to understand the course material and how to do these operations in R. 1.1 Creating matrices in R Create
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization
CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization Edgar Solomonik University of Illinois at Urbana-Champaign September 14, 2016 Parallel Householder
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationBlocked Schur Algorithms for Computing the Matrix Square Root
Blocked Schur Algorithms for Computing the Matrix Square Root Edvin Deadman 1, Nicholas J. Higham 2,andRuiRalha 3 1 Numerical Algorithms Group edvin.deadman@nag.co.uk 2 University of Manchester higham@maths.manchester.ac.uk
More informationComputational Methods CMSC/AMSC/MAPL 460. Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science
Computational Methods CMSC/AMSC/MAPL 460 Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Matrix norms Can be defined using corresponding vector norms Two norm One norm Infinity
More informationChapter Introduction
Chapter 4.1 Introduction After reading this chapter, you should be able to 1. define what a matrix is. 2. identify special types of matrices, and 3. identify when two matrices are equal. What does a matrix
More informationParallel Numerics, WT 2017/ Introduction. page 1 of 127
Parallel Numerics, WT 2017/2018 1 Introduction page 1 of 127 Scope Revise standard numerical methods considering parallel computations! Change method or implementation! page 2 of 127 Scope Revise standard
More informationAnalytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.
Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationFail-Stop Failure ABFT for Cholesky Decomposition
Fail-Stop Failure ABFT for Cholesky Decomposition Doug Hakkarinen, Student Member, IEEE, anruo Wu, Student Member, IEEE, and Zizhong Chen, Senior Member, IEEE Abstract Modeling and analysis of large scale
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationIntroduction to Parallel Computing Errata
Introduction to Parallel Computing Errata John C. Kirk 27 November, 2004 Overview Book: Introduction to Parallel Computing, Second Edition, first printing (hardback) ISBN: 0-201-64865-2 Official book website:
More informationHPC Algorithms and Applications
HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationSUPERFAST MULTIFRONTAL METHOD FOR STRUCTURED LINEAR SYSTEMS OF EQUATIONS
SUPERFAS MULIFRONAL MEHOD FOR SRUCURED LINEAR SYSEMS OF EQUAIONS S. CHANDRASEKARAN, M. GU, X. S. LI, AND J. XIA Abstract. In this paper we develop a fast direct solver for discretized linear systems using
More informationCopyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.
Iterative Improvement Algorithm design technique for solving optimization problems Start with a feasible solution Repeat the following step until no improvement can be found: change the current feasible
More informationLecture 9. Introduction to Numerical Techniques
Lecture 9. Introduction to Numerical Techniques Ivan Papusha CDS270 2: Mathematical Methods in Control and System Engineering May 27, 2015 1 / 25 Logistics hw8 (last one) due today. do an easy problem
More informationPerformance Evaluation of a New Parallel Preconditioner
Performance Evaluation of a New Parallel Preconditioner Keith D. Gremban Gary L. Miller Marco Zagha School of Computer Science Carnegie Mellon University 5 Forbes Avenue Pittsburgh PA 15213 Abstract The
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationAim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview
Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure
More informationCOMP 558 lecture 19 Nov. 17, 2010
COMP 558 lecture 9 Nov. 7, 2 Camera calibration To estimate the geometry of 3D scenes, it helps to know the camera parameters, both external and internal. The problem of finding all these parameters is
More informationLecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC
Lecture 10: Performance Metrics Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 5 slides of the text, by A. Grama w/ a few changes, augmentations and corrections in colored text by Shantanu
More informationCOMPUTER OPTIMIZATION
COMPUTER OPTIMIZATION Storage Optimization: Since Normal Matrix is a symmetric matrix, store only half of it in a vector matrix and develop a indexing scheme to map the upper or lower half to the vector.
More informationCOSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science
Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationParallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting
Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks
More informationComputational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science
Computational Methods CMSC/AMSC/MAPL 460 Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Some special matrices Matlab code How many operations and memory
More informationSparse Linear Systems
1 Sparse Linear Systems Rob H. Bisseling Mathematical Institute, Utrecht University Course Introduction Scientific Computing February 22, 2018 2 Outline Iterative solution methods 3 A perfect bipartite
More informationMatrices. A Matrix (This one has 2 Rows and 3 Columns) To add two matrices: add the numbers in the matching positions:
Matrices A Matrix is an array of numbers: We talk about one matrix, or several matrices. There are many things we can do with them... Adding A Matrix (This one has 2 Rows and 3 Columns) To add two matrices:
More informationCS 6210 Fall 2016 Bei Wang. Review Lecture What have we learnt in Scientific Computing?
CS 6210 Fall 2016 Bei Wang Review Lecture What have we learnt in Scientific Computing? Let s recall the scientific computing pipeline observed phenomenon mathematical model discretization solution algorithm
More information5. Direct Methods for Solving Systems of Linear Equations. They are all over the place... and may have special needs
5. Direct Methods for Solving Systems of Linear Equations They are all over the place... and may have special needs They are all over the place... and may have special needs, December 13, 2012 1 5.3. Cholesky
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationSolving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems
AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination
More informationDense matrix algebra and libraries (and dealing with Fortran)
Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)
More informationFor example, the system. 22 may be represented by the augmented matrix
Matrix Solutions to Linear Systems A matrix is a rectangular array of elements. o An array is a systematic arrangement of numbers or symbols in rows and columns. Matrices (the plural of matrix) may be
More informationA Few Numerical Libraries for HPC
A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear
More informationCHAPTER 2 SENSITIVITY OF LINEAR SYSTEMS; EFFECTS OF ROUNDOFF ERRORS
CHAPTER SENSITIVITY OF LINEAR SYSTEMS; EFFECTS OF ROUNDOFF ERRORS The two main concepts involved here are the condition (of a problem) and the stability (of an algorithm). Both of these concepts deal with
More informationChapter 4. Matrix and Vector Operations
1 Scope of the Chapter Chapter 4 This chapter provides procedures for matrix and vector operations. This chapter (and Chapters 5 and 6) can handle general matrices, matrices with special structure and
More informationA Study of Numerical Methods for Simultaneous Equations
A Study of Numerical Methods for Simultaneous Equations Er. Chandan Krishna Mukherjee B.Sc.Engg., ME, MBA Asstt. Prof. ( Mechanical ), SSBT s College of Engg. & Tech., Jalgaon, Maharashtra Abstract: -
More informationBLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker
BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de
More informationSorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Issues in Sorting on Parallel
More informationCOMPUTATIONAL LINEAR ALGEBRA
COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim
More information