on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,

Size: px
Start display at page:

Download "on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,"

Transcription

1 \Quick" Implementation of Block LU Algorithms on the CM-200. Claus Bendtsen Abstract The CMSSL library only includes a limited amount of mathematical algorithms. Hence, when writing code for the Connection Machine one often has to implement LAPACK-style routines already developed for other architectures, in the hope that acceptable performance can thus be obtained relatively quickly. Due to the massively parallel structure of the CM, algorithms with a serial or global structure such as LU factorization tend to yield poor performance (global here meaning that elements do not only interact with elements in their neighborhood). The purpose of this note is partly to show what performance one can typically obtain when implementing a global algorithm on the CM-200 in a relatively limited period of time, and partly to examine the pros and contras for using a block algorithm. The testing has been performed on the LU factorization in normal as well as in a blocked version. The implementation has been performed by the use of BLAS level 3 equivalent routines and the results have been compared to the LU factorization present in the CMSSL library. The implementation and optimization of both the normal and blocked version have altogether been carried out within 14 days. The obtained performance is very disappointing: only 4% of the CMSSL routine for large matrices. 1 LU Factorization and Solver The implementation of the LU factorization is done by walking down the diagonal and subtracting outer products along the way. The outer products are calculated by means of spreads in the unoptimized version and by the use of a mask and scans in the optimized version. The bottleneck of the operation is undoubtedly the spread/scan operations since these demand a tremendous amount of communication and are thus very time consuming even though the communication is along a single axis. Pivoting and equilibration are not implemented. Since the functions need global communication the best results have been obtained by choosing a :EWS layout. The solver is implemented in a way similar to the factorization. The backward as well as the forward substitution is performed be means of spreads (unoptimized) and scans (optimized). In the optimized version care is taken not to create unnecessary temporaries. Random test systems lead to the performance shown in Fig. 1. The timings shown are elapsed times computed on a 8K CM-200 using double precision. The timings show a complexity for large of 2:6. For the factorization the optimized version is a little faster than the unoptimized (typically a factor of two) UIC (The Danish Computer Center for Research and Education, DTH, DK-2800 Lyngby, Denmark), claus.bendtsen@uni-c.dk 1

2 LU Factorization and Solver sec LU unopt. Solve unopt. LU opt. Solve opt Figure 1: Performance of normal LU factorization and solver. 2

3 whereas the optimized solver is the slowest. The reason is that the use of masks in the solver is more complicated 1 than for the factorization and the overhead is too large to outperform the spreads. 2 Block LU Factorization and Solver The block factorization algorithm described in [1] and [2] reorganizes the Gaussian elimination so that matrix multiplication becomes the dominant operation. Since the matrix multiplication is implemented eciently on the Connection Machine this implies a gain when changing to the block algorithm. The implementation is performed as described in [1] except for the fact that the recursive structure has been changed to an iterative one due to the nature of CM-FORTRA. The unoptimized version is using a :EWS layout of the matrix and uses subscript triplets to identify the dierent blocks 2. The optimized version uses a :EWS layout within each block and a :SERIAL layout for all the blocks which means that each block will be spread across the entire machine and that identifying the dierent blocks is done without communication. Both the optimized an unoptimized version use the optimized version of the normal factorization and the unoptimized version of the solver described in Section 1 for doing operations on the blocks. The implementation has been restricted to systems with a size of a multiple of the block size. Random test systems lead to the performance shown in Figs. 2 and 3. The timings shown are elapsed times computed on a 8K CM-200 using double precision and various block sizes. For the optimized version the time required for packing a normal layout into the combined :SERIAL and :EWS layout has not been included in the timings since most other calculations could be performed on the packed array using multiple instance calls which imply that the user could keep the data packed all the time 3. Each version was tested with dierent block sizes. The complexity for the block factorization is 2:5 for large and the complexity for the block solver is 2:4 for large. Surprisingly the \unoptimized" version of the factorization performs better than the optimized one even though it involves large amounts of unneeded communication. It seems that a block size of 16 for small systems and a size of 32 or 64 for large ones yields the best performance on a 8K machine. I would have expected the block size of 32 to be the best since a block size of 32 maps perfectly onto the machine, using a vector size of 4 for each PE. The reason why the optimized version is performing so poorly seems to be the time used calculating the Schur complement. At a block size of 16, 66% of the time is spent calculating the Schur complement; at a block size of 32 it is 63%, but at a block size of 64 it drops to 39%. This explains why the block size of 64 outperforms the smaller block sizes. It is also the main reason why the optimized version is slower than the unoptimized one. In the unoptimized version the Schur complement is calculated by the use of one matrix-matrix multiplication as opposed to the optimized version where the layout results in a multiple instance matrixmatrix multiplication each with a size equal to the block size 4. 1 Since two scans are needed, one scanning upward and one scanning downward along each axis as opposed to only one scan along each axis. 2 This leads to a lot of communication due to the creation of temporaries. 3 It is actually possible to avoid this packing by rearranging equations but it requires exact knowledge of where the dierent elements live as well as a longer implementation time. 4 At a lower level than CMF or with a dierent layout this multiple instance multiplication could be performed as a single matrix-matrix multiplication and thus increase execution time signicantly. 3

4 sec Performance of Block LU Factorization Type & Block Size Opt. 2 Opt. 8 Opt. 16 Opt. 32 Opt. 64 Unopt. 2 Unopt. 8 Unopt. 16 Unopt. 32 Unopt. 64 Figure 2: Performance of Block LU factorization. 4

5 sec Performance of Block LU Solver Type & Block Size Opt. 2 Opt. 8 Opt. 16 Opt. 32 Opt. 64 Unopt. 2 Unopt. 8 Unopt. 16 Unopt. 32 Unopt. 64 Figure 3: Performance of Block LU solver. 5

6 3 The CMSSL LU Factorization In the CMSSL library the Gaussian elimination is implemented by the use of a point algorithm and block cyclic ordering. The performance for test runs equal to the ones performed in Section 2 is shown in Figs. 4 and 5. The results have been obtained without the use of pivoting and equilibration and by the use of CMSSL Version 3.1 Beta 2. The complexity for large is 1:8 for the factorization and 1:5 for the solver. sec Performance of CMSSL LU Factorization Block Size Figure 4: Performance of CMSSL 3.1 LU factorization. 4 Concluding Remarks The best results of each of the sections above are shown in Fig. 6. Flop rates have been displayed instead of timings using the approximation of (2=3) 3 oatingpoint operations for the factorization routines and (2 2 no: of right hand sides) oating-point operations for the solver routines [3]. Regarding the CMSSL library it is seen that the routines are doing well on large matrices and that our peak performance on large systems is only roughly 4% of the CMSSL peak performance. The reasons why are discussed in Section 2. There seems to be a fair gain when using a block algorithm instead of a normal algorithm and the time of implementation needed to obtain this gain is relatively small. The 6

7 sec Performance of CMSSL LU Solver Block Size Figure 5: Performance of CMSSL 3.1 LU solver. 7

8 MFlops 10 3 Performance of LU Factorizers and Solvers CMSSL LU fact. CMSSL LU solver LU fact. LU solver BLU fact. BLU solver Figure 6: MFlop rates for dierent implementations of LU factorization and solver. 8

9 overall conclusion is that even though a \quick-and-dirty" implementation of block algorithms seems to work better than a similar implementation of point algorithms it takes a long time of implementation to obtain high performance. Since the block algorithm seems to map quite well onto the Connection Machine it should be possible with proper knowledge of the machine architecture, low level programming and sucient time to obtain a much higher performance rate. 5 Further Work If one wishes to obtain a better performance on the LU factorization it could be interesting to try using the :BLOCK layout now supported by CMF. This would enable one to make a single instance matrix-matrix multiplication when computing the Schur complement and still maintain the layout of the optimized block algorithm. Furthermore one could probably optimize communication by doing multiple instance scans rather than single instance scans. References [1] Demmel, James W., Higham, icholas J., Schreiber, Robert S. Block LU Factorization. LAPACK Working ote 40, Febr [2] Golub, Gene H., van Loan, Charles F. Matrix Computations. Second edition, The Johns Hopkins University Press, [3] Thinking Machines Corporation. CMSSL Release otes for the CM-200, Preliminary Documentation for Version 3.1 Beta 2. TMC, Cambridge, Massachusetts, February

Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996.

Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996. Implementation of QR Up- and Downdating on a Massively Parallel Computer Claus Btsen y Per Christian Hansen y Kaj Madsen z Hans Bruun Nielsen z Mustafa Pnar z July 8, 1996 Abstract We describe an implementation

More information

Parallelizing LU Factorization

Parallelizing LU Factorization Parallelizing LU Factorization Scott Ricketts December 3, 2006 Abstract Systems of linear equations can be represented by matrix equations of the form A x = b LU Factorization is a method for solving systems

More information

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987 Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is

More information

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition

More information

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of New York Bualo, NY 14260 Abstract The Connection Machine

More information

A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields

A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields HÅVARD RUE DEPARTMENT OF MATHEMATICAL SCIENCES NTNU, NORWAY FIRST VERSION: FEBRUARY 23, 1999 REVISED: APRIL 23, 1999 SUMMARY

More information

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System

Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System The Harvard community has made this article openly available. Please share how this

More information

Computational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science

Computational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Computational Methods CMSC/AMSC/MAPL 460 Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Zero elements of first column below 1 st row multiplying 1 st

More information

Comments on A structure-preserving method for the quaternion LU decomposition in quaternionic quantum theory by Minghui Wang and Wenhao Ma

Comments on A structure-preserving method for the quaternion LU decomposition in quaternionic quantum theory by Minghui Wang and Wenhao Ma Comments on A structure-preserving method for the quaternion LU decomposition in quaternionic quantum theory by Minghui Wang and Wenhao Ma Stephen J. Sangwine a a School of Computer Science and Electronic

More information

F04EBFP.1. NAG Parallel Library Routine Document

F04EBFP.1. NAG Parallel Library Routine Document F04 Simultaneous Linear Equations F04EBFP NAG Parallel Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check for implementation-dependent

More information

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract

A High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract A High Performance Sparse holesky Factorization Algorithm For Scalable Parallel omputers George Karypis and Vipin Kumar Department of omputer Science University of Minnesota Minneapolis, MN 55455 Technical

More information

High-Performance Implementation of the Level-3 BLAS

High-Performance Implementation of the Level-3 BLAS High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization

CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization Edgar Solomonik University of Illinois at Urbana-Champaign August 31, 2016 Review of

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

Recursive Blocked Algorithms for Solving Triangular Systems Part I: One-Sided and Coupled Sylvester-Type Matrix Equations

Recursive Blocked Algorithms for Solving Triangular Systems Part I: One-Sided and Coupled Sylvester-Type Matrix Equations Recursive Blocked Algorithms for Solving Triangular Systems Part I: One-Sided and Coupled Sylvester-Type Matrix Equations ISAK JONSSON and BO KA GSTRÖM Umeå University Triangular matrix equations appear

More information

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear

More information

Computing the rank of big sparse matrices modulo p using gaussian elimination

Computing the rank of big sparse matrices modulo p using gaussian elimination Computing the rank of big sparse matrices modulo p using gaussian elimination Charles Bouillaguet 1 Claire Delaplace 2 12 CRIStAL, Université de Lille 2 IRISA, Université de Rennes 1 JNCF, 16 janvier 2017

More information

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012. Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Scan Primitives for GPU Computing

Scan Primitives for GPU Computing Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing

More information

Parallelisation of Surface-Related Multiple Elimination

Parallelisation of Surface-Related Multiple Elimination Parallelisation of Surface-Related Multiple Elimination G. M. van Waveren High Performance Computing Centre, Groningen, The Netherlands and I.M. Godfrey Stern Computing Systems, Lyon,

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Fast Tridiagonal Solvers on GPU

Fast Tridiagonal Solvers on GPU Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

High performance matrix inversion of SPD matrices on graphics processors

High performance matrix inversion of SPD matrices on graphics processors High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems

More information

A GPU Sparse Direct Solver for AX=B

A GPU Sparse Direct Solver for AX=B 1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks

More information

B(FOM) 2. Block full orthogonalization methods for functions of matrices. Kathryn Lund. December 12, 2017

B(FOM) 2. Block full orthogonalization methods for functions of matrices. Kathryn Lund. December 12, 2017 B(FOM) 2 Block full orthogonalization methods for functions of matrices Kathryn Lund December 12, 2017 The block full orthogonalization methods for functions of matrices (denoted B(FOM) 2, for short) are

More information

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving

More information

Blocked Schur Algorithms for Computing the Matrix Square Root

Blocked Schur Algorithms for Computing the Matrix Square Root Blocked Schur Algorithms for Computing the Matrix Square Root Edvin Deadman 1, Nicholas J. Higham 2,andRuiRalha 3 1 Numerical Algorithms Group edvin.deadman@nag.co.uk 2 University of Manchester higham@maths.manchester.ac.uk

More information

Optimizations of BLIS Library for AMD ZEN Core

Optimizations of BLIS Library for AMD ZEN Core Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was

More information

Introduction to Analysis of Algorithms

Introduction to Analysis of Algorithms Introduction to Analysis of Algorithms Analysis of Algorithms To determine how efficient an algorithm is we compute the amount of time that the algorithm needs to solve a problem. Given two algorithms

More information

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3 UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis

More information

Parallel Linear Algebra in Julia

Parallel Linear Algebra in Julia Parallel Linear Algebra in Julia Britni Crocker and Donglai Wei 18.337 Parallel Computing 12.17.2012 1 Table of Contents 1. Abstract... 2 2. Introduction... 3 3. Julia Implementation...7 4. Performance...

More information

Coarse-to-fine image registration

Coarse-to-fine image registration Today we will look at a few important topics in scale space in computer vision, in particular, coarseto-fine approaches, and the SIFT feature descriptor. I will present only the main ideas here to give

More information

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe

More information

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?

More information

Sparse matrices, graphs, and tree elimination

Sparse matrices, graphs, and tree elimination Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

THE application of advanced computer architecture and

THE application of advanced computer architecture and 544 IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, VOL. 45, NO. 3, MARCH 1997 Scalable Solutions to Integral-Equation and Finite-Element Simulations Tom Cwik, Senior Member, IEEE, Daniel S. Katz, Member,

More information

SCALABLE ALGORITHMS for solving large sparse linear systems of equations

SCALABLE ALGORITHMS for solving large sparse linear systems of equations SCALABLE ALGORITHMS for solving large sparse linear systems of equations CONTENTS Sparse direct solvers (multifrontal) Substructuring methods (hybrid solvers) Jacko Koster, Bergen Center for Computational

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse

More information

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.

A High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin. A High Performance Parallel Strassen Implementation Brian Grayson Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 787 bgrayson@pineeceutexasedu Ajay Pankaj

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

COSC 311: ALGORITHMS HW1: SORTING

COSC 311: ALGORITHMS HW1: SORTING COSC 311: ALGORITHMS HW1: SORTIG Solutions 1) Theoretical predictions. Solution: On randomly ordered data, we expect the following ordering: Heapsort = Mergesort = Quicksort (deterministic or randomized)

More information

Automatic Tuning of Sparse Matrix Kernels

Automatic Tuning of Sparse Matrix Kernels Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley

More information

Lecture 9. Introduction to Numerical Techniques

Lecture 9. Introduction to Numerical Techniques Lecture 9. Introduction to Numerical Techniques Ivan Papusha CDS270 2: Mathematical Methods in Control and System Engineering May 27, 2015 1 / 25 Logistics hw8 (last one) due today. do an easy problem

More information

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units

Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract

More information

Using PASSION System on LU Factorization

Using PASSION System on LU Factorization Syracuse University SURFACE Electrical Engineering and Computer Science Technical Reports College of Engineering and Computer Science 11-1995 Using PASSION System on LU Factorization Haluk Rahmi Topcuoglu

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization

CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization Edgar Solomonik University of Illinois at Urbana-Champaign September 14, 2016 Parallel Householder

More information

Project Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting

Project Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run

More information

2. Use elementary row operations to rewrite the augmented matrix in a simpler form (i.e., one whose solutions are easy to find).

2. Use elementary row operations to rewrite the augmented matrix in a simpler form (i.e., one whose solutions are easy to find). Section. Gaussian Elimination Our main focus in this section is on a detailed discussion of a method for solving systems of equations. In the last section, we saw that the general procedure for solving

More information

A Parallel Implementation of a Hidden Markov Model. Carl D. Mitchell, Randall A. Helzerman, Leah H. Jamieson, and Mary P. Harper

A Parallel Implementation of a Hidden Markov Model. Carl D. Mitchell, Randall A. Helzerman, Leah H. Jamieson, and Mary P. Harper A Parallel Implementation of a Hidden Markov Model with Duration Modeling for Speech Recognition y Carl D. Mitchell, Randall A. Helzerman, Leah H. Jamieson, and Mary P. Harper School of Electrical Engineering,

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Addition/Subtraction flops. ... k k + 1, n (n k)(n k) (n k)(n + 1 k) n 1 n, n (1)(1) (1)(2)

Addition/Subtraction flops. ... k k + 1, n (n k)(n k) (n k)(n + 1 k) n 1 n, n (1)(1) (1)(2) 1 CHAPTER 10 101 The flop counts for LU decomposition can be determined in a similar fashion as was done for Gauss elimination The major difference is that the elimination is only implemented for the left-hand

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Lecture 17: Array Algorithms

Lecture 17: Array Algorithms Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting

More information

PARDISO Version Reference Sheet Fortran

PARDISO Version Reference Sheet Fortran PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,

More information

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma

1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma MULTIPLE SUBSPACE ULV ALGORITHM AND LMS TRACKING S. HOSUR, A. H. TEWFIK, D. BOLEY University of Minnesota 200 Union St. S.E. Minneapolis, MN 55455 U.S.A fhosur@ee,tewk@ee,boley@csg.umn.edu ABSTRACT. The

More information

Fast PCA Computation in a DBMS with Aggregate UDFs and LAPACK

Fast PCA Computation in a DBMS with Aggregate UDFs and LAPACK Fast PCA Computation in a DBMS with Aggregate UDFs and LAPACK Carlos Ordonez Naveen Mohanam Carlos Garcia-Alvarado Predrag T. Tosic Edgar Martinez University of Houston Houston, TX 77204, USA ABSTRACT

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

THE DEVELOPMENT OF THE POTENTIAL AND ACADMIC PROGRAMMES OF WROCLAW UNIVERISTY OF TECH- NOLOGY ITERATIVE LINEAR SOLVERS

THE DEVELOPMENT OF THE POTENTIAL AND ACADMIC PROGRAMMES OF WROCLAW UNIVERISTY OF TECH- NOLOGY ITERATIVE LINEAR SOLVERS ITERATIVE LIEAR SOLVERS. Objectives The goals of the laboratory workshop are as follows: to learn basic properties of iterative methods for solving linear least squares problems, to study the properties

More information

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are

More information

AMS209 Final Project

AMS209 Final Project AMS209 Final Project Xingchen Yu Department of Applied Mathematics and Statistics, University of California, Santa Cruz November 2015 1 Abstract In the project, we explore LU decomposition with or without

More information

DIVIDE & CONQUER. Problem of size n. Solution to sub problem 1

DIVIDE & CONQUER. Problem of size n. Solution to sub problem 1 DIVIDE & CONQUER Definition: Divide & conquer is a general algorithm design strategy with a general plan as follows: 1. DIVIDE: A problem s instance is divided into several smaller instances of the same

More information

Ch 09 Multidimensional arrays & Linear Systems. Andrea Mignone Physics Department, University of Torino AA

Ch 09 Multidimensional arrays & Linear Systems. Andrea Mignone Physics Department, University of Torino AA Ch 09 Multidimensional arrays & Linear Systems Andrea Mignone Physics Department, University of Torino AA 2017-2018 Multidimensional Arrays A multidimensional array is an array containing one or more arrays.

More information

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.

Summary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model. Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP

More information

Parallel Sparse LU Factorization on Different Message Passing Platforms

Parallel Sparse LU Factorization on Different Message Passing Platforms Parallel Sparse LU Factorization on Different Message Passing Platforms Kai Shen Department of Computer Science, University of Rochester Rochester, NY 1467, USA Abstract Several message passing-based parallel

More information

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University

Matrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University Parallel LU Factorization of Block-Diagonal-Bordered Sparse Matrices D. P. Koester, S. Ranka, and G. C. Fox School of Computer and Information Science and The Northeast Parallel Architectures Center (NPAC)

More information

Julian Hall School of Mathematics University of Edinburgh. June 15th Parallel matrix inversion for the revised simplex method - a study

Julian Hall School of Mathematics University of Edinburgh. June 15th Parallel matrix inversion for the revised simplex method - a study Parallel matrix inversion for the revised simplex method - A study Julian Hall School of Mathematics University of Edinburgh June 5th 006 Parallel matrix inversion for the revised simplex method - a study

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

(Sparse) Linear Solvers

(Sparse) Linear Solvers (Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 1 Don t you just invert

More information

Notes on efficient Matlab programming

Notes on efficient Matlab programming Notes on efficient Matlab programming Dag Lindbo dag@csc.kth.se June 11, 2010 1 Introduction - is Matlab slow? It is a common view that Matlab is slow. This is a very crude statement that holds both some

More information

Chapter 23. Linear Motion Motion of a Bug

Chapter 23. Linear Motion Motion of a Bug Chapter 23 Linear Motion The simplest example of a parametrized curve arises when studying the motion of an object along a straight line in the plane We will start by studying this kind of motion when

More information

An Evaluation of High Performance Fortran Compilers Using the HPFBench Benchmark Suite

An Evaluation of High Performance Fortran Compilers Using the HPFBench Benchmark Suite An Evaluation of High Performance Fortran Compilers Using the HPFBench Benchmark Suite Guohua Jin and Y. Charlie Hu Department of Computer Science Rice University 61 Main Street, MS 132 Houston, TX 775

More information

Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000

Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000 Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000 Richard P. Brent and Peter E. Strazdins Computer Sciences Laboratory and Department of Computer Science Australian National University

More information

Least-Squares Fitting of Data with B-Spline Curves

Least-Squares Fitting of Data with B-Spline Curves Least-Squares Fitting of Data with B-Spline Curves David Eberly, Geometric Tools, Redmond WA 98052 https://www.geometrictools.com/ This work is licensed under the Creative Commons Attribution 4.0 International

More information

Example 1: Give the coordinates of the points on the graph.

Example 1: Give the coordinates of the points on the graph. Ordered Pairs Often, to get an idea of the behavior of an equation, we will make a picture that represents the solutions to the equation. A graph gives us that picture. The rectangular coordinate plane,

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Run Times. Efficiency Issues. Run Times cont d. More on O( ) notation

Run Times. Efficiency Issues. Run Times cont d. More on O( ) notation Comp2711 S1 2006 Correctness Oheads 1 Efficiency Issues Comp2711 S1 2006 Correctness Oheads 2 Run Times An implementation may be correct with respect to the Specification Pre- and Post-condition, but nevertheless

More information

LU FACTORIZATION AND THE LINPACK BENCHMARK ON THE INTEL PARAGON

LU FACTORIZATION AND THE LINPACK BENCHMARK ON THE INTEL PARAGON LU FACTORIZATION AND THE LINPACK BENCHMARK ON THE INTEL PARAGON DAVID WOMBLE y, DAVID GREENBERG y, STEPHEN WHEAT y, AND ROLF RIESEN y Abstract. An implementation of the LINPACK benchmark is described which

More information

ALMA Memo No An Imaging Study for ACA. Min S. Yun. University of Massachusetts. April 1, Abstract

ALMA Memo No An Imaging Study for ACA. Min S. Yun. University of Massachusetts. April 1, Abstract ALMA Memo No. 368 An Imaging Study for ACA Min S. Yun University of Massachusetts April 1, 2001 Abstract 1 Introduction The ALMA Complementary Array (ACA) is one of the several new capabilities for ALMA

More information

Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited LAPACK Working Note #208

Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited LAPACK Working Note #208 Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited LAPACK Working Note #208 Hatem Ltaief 1, Jakub Kurzak 1, and Jack Dongarra 1,2,3 1 Department of Electrical

More information

Numerical considerations

Numerical considerations Numerical considerations CHAPTER 6 CHAPTER OUTLINE 6.1 Floating-Point Data Representation...13 Normalized Representation of M...13 Excess Encoding of E...133 6. Representable Numbers...134 6.3 Special

More information

CS 770G - Parallel Algorithms in Scientific Computing

CS 770G - Parallel Algorithms in Scientific Computing CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,

More information

A Parallel Implementation of the BDDC Method for Linear Elasticity

A Parallel Implementation of the BDDC Method for Linear Elasticity A Parallel Implementation of the BDDC Method for Linear Elasticity Jakub Šístek joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík Institute of Mathematics of the AS CR, Prague

More information

Algorithms and Data Structures

Algorithms and Data Structures Charles A. Wuethrich Bauhaus-University Weimar - CogVis/MMC June 22, 2017 1/51 Introduction Matrix based Transitive hull All shortest paths Gaussian elimination Random numbers Interpolation and Approximation

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

NAG Fortran Library Routine Document F07AAF (DGESV).1

NAG Fortran Library Routine Document F07AAF (DGESV).1 NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent

More information

SIMULATING ARBITRARY-GEOMETRY ULTRASOUND TRANSDUCERS USING TRIANGLES

SIMULATING ARBITRARY-GEOMETRY ULTRASOUND TRANSDUCERS USING TRIANGLES Jørgen Arendt Jensen 1 Paper presented at the IEEE International Ultrasonics Symposium, San Antonio, Texas, 1996: SIMULATING ARBITRARY-GEOMETRY ULTRASOUND TRANSDUCERS USING TRIANGLES Jørgen Arendt Jensen,

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information