on the CM-200. Claus Bendtsen Abstract to implement LAPACK-style routines already developed for other architectures,
|
|
- Randolph Gaines
- 5 years ago
- Views:
Transcription
1 \Quick" Implementation of Block LU Algorithms on the CM-200. Claus Bendtsen Abstract The CMSSL library only includes a limited amount of mathematical algorithms. Hence, when writing code for the Connection Machine one often has to implement LAPACK-style routines already developed for other architectures, in the hope that acceptable performance can thus be obtained relatively quickly. Due to the massively parallel structure of the CM, algorithms with a serial or global structure such as LU factorization tend to yield poor performance (global here meaning that elements do not only interact with elements in their neighborhood). The purpose of this note is partly to show what performance one can typically obtain when implementing a global algorithm on the CM-200 in a relatively limited period of time, and partly to examine the pros and contras for using a block algorithm. The testing has been performed on the LU factorization in normal as well as in a blocked version. The implementation has been performed by the use of BLAS level 3 equivalent routines and the results have been compared to the LU factorization present in the CMSSL library. The implementation and optimization of both the normal and blocked version have altogether been carried out within 14 days. The obtained performance is very disappointing: only 4% of the CMSSL routine for large matrices. 1 LU Factorization and Solver The implementation of the LU factorization is done by walking down the diagonal and subtracting outer products along the way. The outer products are calculated by means of spreads in the unoptimized version and by the use of a mask and scans in the optimized version. The bottleneck of the operation is undoubtedly the spread/scan operations since these demand a tremendous amount of communication and are thus very time consuming even though the communication is along a single axis. Pivoting and equilibration are not implemented. Since the functions need global communication the best results have been obtained by choosing a :EWS layout. The solver is implemented in a way similar to the factorization. The backward as well as the forward substitution is performed be means of spreads (unoptimized) and scans (optimized). In the optimized version care is taken not to create unnecessary temporaries. Random test systems lead to the performance shown in Fig. 1. The timings shown are elapsed times computed on a 8K CM-200 using double precision. The timings show a complexity for large of 2:6. For the factorization the optimized version is a little faster than the unoptimized (typically a factor of two) UIC (The Danish Computer Center for Research and Education, DTH, DK-2800 Lyngby, Denmark), claus.bendtsen@uni-c.dk 1
2 LU Factorization and Solver sec LU unopt. Solve unopt. LU opt. Solve opt Figure 1: Performance of normal LU factorization and solver. 2
3 whereas the optimized solver is the slowest. The reason is that the use of masks in the solver is more complicated 1 than for the factorization and the overhead is too large to outperform the spreads. 2 Block LU Factorization and Solver The block factorization algorithm described in [1] and [2] reorganizes the Gaussian elimination so that matrix multiplication becomes the dominant operation. Since the matrix multiplication is implemented eciently on the Connection Machine this implies a gain when changing to the block algorithm. The implementation is performed as described in [1] except for the fact that the recursive structure has been changed to an iterative one due to the nature of CM-FORTRA. The unoptimized version is using a :EWS layout of the matrix and uses subscript triplets to identify the dierent blocks 2. The optimized version uses a :EWS layout within each block and a :SERIAL layout for all the blocks which means that each block will be spread across the entire machine and that identifying the dierent blocks is done without communication. Both the optimized an unoptimized version use the optimized version of the normal factorization and the unoptimized version of the solver described in Section 1 for doing operations on the blocks. The implementation has been restricted to systems with a size of a multiple of the block size. Random test systems lead to the performance shown in Figs. 2 and 3. The timings shown are elapsed times computed on a 8K CM-200 using double precision and various block sizes. For the optimized version the time required for packing a normal layout into the combined :SERIAL and :EWS layout has not been included in the timings since most other calculations could be performed on the packed array using multiple instance calls which imply that the user could keep the data packed all the time 3. Each version was tested with dierent block sizes. The complexity for the block factorization is 2:5 for large and the complexity for the block solver is 2:4 for large. Surprisingly the \unoptimized" version of the factorization performs better than the optimized one even though it involves large amounts of unneeded communication. It seems that a block size of 16 for small systems and a size of 32 or 64 for large ones yields the best performance on a 8K machine. I would have expected the block size of 32 to be the best since a block size of 32 maps perfectly onto the machine, using a vector size of 4 for each PE. The reason why the optimized version is performing so poorly seems to be the time used calculating the Schur complement. At a block size of 16, 66% of the time is spent calculating the Schur complement; at a block size of 32 it is 63%, but at a block size of 64 it drops to 39%. This explains why the block size of 64 outperforms the smaller block sizes. It is also the main reason why the optimized version is slower than the unoptimized one. In the unoptimized version the Schur complement is calculated by the use of one matrix-matrix multiplication as opposed to the optimized version where the layout results in a multiple instance matrixmatrix multiplication each with a size equal to the block size 4. 1 Since two scans are needed, one scanning upward and one scanning downward along each axis as opposed to only one scan along each axis. 2 This leads to a lot of communication due to the creation of temporaries. 3 It is actually possible to avoid this packing by rearranging equations but it requires exact knowledge of where the dierent elements live as well as a longer implementation time. 4 At a lower level than CMF or with a dierent layout this multiple instance multiplication could be performed as a single matrix-matrix multiplication and thus increase execution time signicantly. 3
4 sec Performance of Block LU Factorization Type & Block Size Opt. 2 Opt. 8 Opt. 16 Opt. 32 Opt. 64 Unopt. 2 Unopt. 8 Unopt. 16 Unopt. 32 Unopt. 64 Figure 2: Performance of Block LU factorization. 4
5 sec Performance of Block LU Solver Type & Block Size Opt. 2 Opt. 8 Opt. 16 Opt. 32 Opt. 64 Unopt. 2 Unopt. 8 Unopt. 16 Unopt. 32 Unopt. 64 Figure 3: Performance of Block LU solver. 5
6 3 The CMSSL LU Factorization In the CMSSL library the Gaussian elimination is implemented by the use of a point algorithm and block cyclic ordering. The performance for test runs equal to the ones performed in Section 2 is shown in Figs. 4 and 5. The results have been obtained without the use of pivoting and equilibration and by the use of CMSSL Version 3.1 Beta 2. The complexity for large is 1:8 for the factorization and 1:5 for the solver. sec Performance of CMSSL LU Factorization Block Size Figure 4: Performance of CMSSL 3.1 LU factorization. 4 Concluding Remarks The best results of each of the sections above are shown in Fig. 6. Flop rates have been displayed instead of timings using the approximation of (2=3) 3 oatingpoint operations for the factorization routines and (2 2 no: of right hand sides) oating-point operations for the solver routines [3]. Regarding the CMSSL library it is seen that the routines are doing well on large matrices and that our peak performance on large systems is only roughly 4% of the CMSSL peak performance. The reasons why are discussed in Section 2. There seems to be a fair gain when using a block algorithm instead of a normal algorithm and the time of implementation needed to obtain this gain is relatively small. The 6
7 sec Performance of CMSSL LU Solver Block Size Figure 5: Performance of CMSSL 3.1 LU solver. 7
8 MFlops 10 3 Performance of LU Factorizers and Solvers CMSSL LU fact. CMSSL LU solver LU fact. LU solver BLU fact. BLU solver Figure 6: MFlop rates for dierent implementations of LU factorization and solver. 8
9 overall conclusion is that even though a \quick-and-dirty" implementation of block algorithms seems to work better than a similar implementation of point algorithms it takes a long time of implementation to obtain high performance. Since the block algorithm seems to map quite well onto the Connection Machine it should be possible with proper knowledge of the machine architecture, low level programming and sucient time to obtain a much higher performance rate. 5 Further Work If one wishes to obtain a better performance on the LU factorization it could be interesting to try using the :BLOCK layout now supported by CMF. This would enable one to make a single instance matrix-matrix multiplication when computing the Schur complement and still maintain the layout of the optimized block algorithm. Furthermore one could probably optimize communication by doing multiple instance scans rather than single instance scans. References [1] Demmel, James W., Higham, icholas J., Schreiber, Robert S. Block LU Factorization. LAPACK Working ote 40, Febr [2] Golub, Gene H., van Loan, Charles F. Matrix Computations. Second edition, The Johns Hopkins University Press, [3] Thinking Machines Corporation. CMSSL Release otes for the CM-200, Preliminary Documentation for Version 3.1 Beta 2. TMC, Cambridge, Massachusetts, February
Implementation of QR Up- and Downdating on a. Massively Parallel Computer. Hans Bruun Nielsen z Mustafa Pnar z. July 8, 1996.
Implementation of QR Up- and Downdating on a Massively Parallel Computer Claus Btsen y Per Christian Hansen y Kaj Madsen z Hans Bruun Nielsen z Mustafa Pnar z July 8, 1996 Abstract We describe an implementation
More informationParallelizing LU Factorization
Parallelizing LU Factorization Scott Ricketts December 3, 2006 Abstract Systems of linear equations can be represented by matrix equations of the form A x = b LU Factorization is a method for solving systems
More informationFrequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8
Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationChapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition
Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition
More informationExtra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987
Extra-High Speed Matrix Multiplication on the Cray-2 David H. Bailey September 2, 1987 Ref: SIAM J. on Scientic and Statistical Computing, vol. 9, no. 3, (May 1988), pg. 603{607 Abstract The Cray-2 is
More informationChapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition
Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition Nicholas J. Higham Pythagoras Papadimitriou Abstract A new method is described for computing the singular value decomposition
More informationComparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne
Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of New York Bualo, NY 14260 Abstract The Connection Machine
More informationA Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields
A Fast and Exact Simulation Algorithm for General Gaussian Markov Random Fields HÅVARD RUE DEPARTMENT OF MATHEMATICAL SCIENCES NTNU, NORWAY FIRST VERSION: FEBRUARY 23, 1999 REVISED: APRIL 23, 1999 SUMMARY
More informationMesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System
Mesh Decomposition and Communication Procedures for Finite Element Applications on the Connection Machine CM-5 System The Harvard community has made this article openly available. Please share how this
More informationComputational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science
Computational Methods CMSC/AMSC/MAPL 460 Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Zero elements of first column below 1 st row multiplying 1 st
More informationComments on A structure-preserving method for the quaternion LU decomposition in quaternionic quantum theory by Minghui Wang and Wenhao Ma
Comments on A structure-preserving method for the quaternion LU decomposition in quaternionic quantum theory by Minghui Wang and Wenhao Ma Stephen J. Sangwine a a School of Computer Science and Electronic
More informationF04EBFP.1. NAG Parallel Library Routine Document
F04 Simultaneous Linear Equations F04EBFP NAG Parallel Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check for implementation-dependent
More informationA High Performance Sparse Cholesky Factorization Algorithm For. University of Minnesota. Abstract
A High Performance Sparse holesky Factorization Algorithm For Scalable Parallel omputers George Karypis and Vipin Kumar Department of omputer Science University of Minnesota Minneapolis, MN 55455 Technical
More informationHigh-Performance Implementation of the Level-3 BLAS
High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization
CS 598: Communication Cost Analysis of Algorithms Lecture 4: communication avoiding algorithms for LU factorization Edgar Solomonik University of Illinois at Urbana-Champaign August 31, 2016 Review of
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationRecursive Blocked Algorithms for Solving Triangular Systems Part I: One-Sided and Coupled Sylvester-Type Matrix Equations
Recursive Blocked Algorithms for Solving Triangular Systems Part I: One-Sided and Coupled Sylvester-Type Matrix Equations ISAK JONSSON and BO KA GSTRÖM Umeå University Triangular matrix equations appear
More informationIterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms
Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear
More informationComputing the rank of big sparse matrices modulo p using gaussian elimination
Computing the rank of big sparse matrices modulo p using gaussian elimination Charles Bouillaguet 1 Claire Delaplace 2 12 CRIStAL, Université de Lille 2 IRISA, Université de Rennes 1 JNCF, 16 janvier 2017
More informationBlocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.
Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationScan Primitives for GPU Computing
Scan Primitives for GPU Computing Shubho Sengupta, Mark Harris *, Yao Zhang, John Owens University of California Davis, *NVIDIA Corporation Motivation Raw compute power and bandwidth of GPUs increasing
More informationParallelisation of Surface-Related Multiple Elimination
Parallelisation of Surface-Related Multiple Elimination G. M. van Waveren High Performance Computing Centre, Groningen, The Netherlands and I.M. Godfrey Stern Computing Systems, Lyon,
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationFast Tridiagonal Solvers on GPU
Fast Tridiagonal Solvers on GPU Yao Zhang John Owens UC Davis Jonathan Cohen NVIDIA GPU Technology Conference 2009 Outline Introduction Algorithms Design algorithms for GPU architecture Performance Bottleneck-based
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationHigh performance matrix inversion of SPD matrices on graphics processors
High performance matrix inversion of SPD matrices on graphics processors Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí and Alfredo Remón Max-Planck-Institute for Dynamics of Complex Technical Systems
More informationA GPU Sparse Direct Solver for AX=B
1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks
More informationB(FOM) 2. Block full orthogonalization methods for functions of matrices. Kathryn Lund. December 12, 2017
B(FOM) 2 Block full orthogonalization methods for functions of matrices Kathryn Lund December 12, 2017 The block full orthogonalization methods for functions of matrices (denoted B(FOM) 2, for short) are
More informationA class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines
Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving
More informationBlocked Schur Algorithms for Computing the Matrix Square Root
Blocked Schur Algorithms for Computing the Matrix Square Root Edvin Deadman 1, Nicholas J. Higham 2,andRuiRalha 3 1 Numerical Algorithms Group edvin.deadman@nag.co.uk 2 University of Manchester higham@maths.manchester.ac.uk
More informationOptimizations of BLIS Library for AMD ZEN Core
Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was
More informationIntroduction to Analysis of Algorithms
Introduction to Analysis of Algorithms Analysis of Algorithms To determine how efficient an algorithm is we compute the amount of time that the algorithm needs to solve a problem. Given two algorithms
More informationExam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3
UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis
More informationParallel Linear Algebra in Julia
Parallel Linear Algebra in Julia Britni Crocker and Donglai Wei 18.337 Parallel Computing 12.17.2012 1 Table of Contents 1. Abstract... 2 2. Introduction... 3 3. Julia Implementation...7 4. Performance...
More informationCoarse-to-fine image registration
Today we will look at a few important topics in scale space in computer vision, in particular, coarseto-fine approaches, and the SIFT feature descriptor. I will present only the main ideas here to give
More informationPARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES
PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES Zhou B. B. and Brent R. P. Computer Sciences Laboratory Australian National University Canberra, ACT 000 Abstract We describe
More informationCS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra
CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?
More informationSparse matrices, graphs, and tree elimination
Logistics Week 6: Friday, Oct 2 1. I will be out of town next Tuesday, October 6, and so will not have office hours on that day. I will be around on Monday, except during the SCAN seminar (1:25-2:15);
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationTHE application of advanced computer architecture and
544 IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, VOL. 45, NO. 3, MARCH 1997 Scalable Solutions to Integral-Equation and Finite-Element Simulations Tom Cwik, Senior Member, IEEE, Daniel S. Katz, Member,
More informationSCALABLE ALGORITHMS for solving large sparse linear systems of equations
SCALABLE ALGORITHMS for solving large sparse linear systems of equations CONTENTS Sparse direct solvers (multifrontal) Substructuring methods (hybrid solvers) Jacko Koster, Bergen Center for Computational
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More information(Sparse) Linear Solvers
(Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 2 Don t you just invert
More informationSparse Direct Solvers for Extreme-Scale Computing
Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse
More informationA High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.
A High Performance Parallel Strassen Implementation Brian Grayson Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 787 bgrayson@pineeceutexasedu Ajay Pankaj
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationCOSC 311: ALGORITHMS HW1: SORTING
COSC 311: ALGORITHMS HW1: SORTIG Solutions 1) Theoretical predictions. Solution: On randomly ordered data, we expect the following ordering: Heapsort = Mergesort = Quicksort (deterministic or randomized)
More informationAutomatic Tuning of Sparse Matrix Kernels
Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley
More informationLecture 9. Introduction to Numerical Techniques
Lecture 9. Introduction to Numerical Techniques Ivan Papusha CDS270 2: Mathematical Methods in Control and System Engineering May 27, 2015 1 / 25 Logistics hw8 (last one) due today. do an easy problem
More informationParallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units
Parallel Alternating Direction Implicit Solver for the Two-Dimensional Heat Diffusion Problem on Graphics Processing Units Khor Shu Heng Engineering Science Programme National University of Singapore Abstract
More informationUsing PASSION System on LU Factorization
Syracuse University SURFACE Electrical Engineering and Computer Science Technical Reports College of Engineering and Computer Science 11-1995 Using PASSION System on LU Factorization Haluk Rahmi Topcuoglu
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization
CS 598: Communication Cost Analysis of Algorithms Lecture 7: parallel algorithms for QR factorization Edgar Solomonik University of Illinois at Urbana-Champaign September 14, 2016 Parallel Householder
More informationProject Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting
Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run
More information2. Use elementary row operations to rewrite the augmented matrix in a simpler form (i.e., one whose solutions are easy to find).
Section. Gaussian Elimination Our main focus in this section is on a detailed discussion of a method for solving systems of equations. In the last section, we saw that the general procedure for solving
More informationA Parallel Implementation of a Hidden Markov Model. Carl D. Mitchell, Randall A. Helzerman, Leah H. Jamieson, and Mary P. Harper
A Parallel Implementation of a Hidden Markov Model with Duration Modeling for Speech Recognition y Carl D. Mitchell, Randall A. Helzerman, Leah H. Jamieson, and Mary P. Harper School of Electrical Engineering,
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationAddition/Subtraction flops. ... k k + 1, n (n k)(n k) (n k)(n + 1 k) n 1 n, n (1)(1) (1)(2)
1 CHAPTER 10 101 The flop counts for LU decomposition can be determined in a similar fashion as was done for Gauss elimination The major difference is that the elimination is only implemented for the left-hand
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More informationPARDISO Version Reference Sheet Fortran
PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly
More informationNumerical Linear Algebra
Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,
More information1 INTRODUCTION The LMS adaptive algorithm is the most popular algorithm for adaptive ltering because of its simplicity and robustness. However, its ma
MULTIPLE SUBSPACE ULV ALGORITHM AND LMS TRACKING S. HOSUR, A. H. TEWFIK, D. BOLEY University of Minnesota 200 Union St. S.E. Minneapolis, MN 55455 U.S.A fhosur@ee,tewk@ee,boley@csg.umn.edu ABSTRACT. The
More informationFast PCA Computation in a DBMS with Aggregate UDFs and LAPACK
Fast PCA Computation in a DBMS with Aggregate UDFs and LAPACK Carlos Ordonez Naveen Mohanam Carlos Garcia-Alvarado Predrag T. Tosic Edgar Martinez University of Houston Houston, TX 77204, USA ABSTRACT
More informationLINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those
Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen
More informationTHE DEVELOPMENT OF THE POTENTIAL AND ACADMIC PROGRAMMES OF WROCLAW UNIVERISTY OF TECH- NOLOGY ITERATIVE LINEAR SOLVERS
ITERATIVE LIEAR SOLVERS. Objectives The goals of the laboratory workshop are as follows: to learn basic properties of iterative methods for solving linear least squares problems, to study the properties
More information17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer
Module 2: Divide and Conquer Divide and Conquer Control Abstraction for Divide &Conquer 1 Recurrence equation for Divide and Conquer: If the size of problem p is n and the sizes of the k sub problems are
More informationAMS209 Final Project
AMS209 Final Project Xingchen Yu Department of Applied Mathematics and Statistics, University of California, Santa Cruz November 2015 1 Abstract In the project, we explore LU decomposition with or without
More informationDIVIDE & CONQUER. Problem of size n. Solution to sub problem 1
DIVIDE & CONQUER Definition: Divide & conquer is a general algorithm design strategy with a general plan as follows: 1. DIVIDE: A problem s instance is divided into several smaller instances of the same
More informationCh 09 Multidimensional arrays & Linear Systems. Andrea Mignone Physics Department, University of Torino AA
Ch 09 Multidimensional arrays & Linear Systems Andrea Mignone Physics Department, University of Torino AA 2017-2018 Multidimensional Arrays A multidimensional array is an array containing one or more arrays.
More informationSummary. A simple model for point-to-point messages. Small message broadcasts in the α-β model. Messaging in the LogP model.
Summary Design of Parallel and High-Performance Computing: Distributed-Memory Models and lgorithms Edgar Solomonik ETH Zürich December 9, 2014 Lecture overview Review: α-β communication cost model LogP
More informationParallel Sparse LU Factorization on Different Message Passing Platforms
Parallel Sparse LU Factorization on Different Message Passing Platforms Kai Shen Department of Computer Science, University of Rochester Rochester, NY 1467, USA Abstract Several message passing-based parallel
More informationMatrices. D. P. Koester, S. Ranka, and G. C. Fox. The Northeast Parallel Architectures Center (NPAC) Syracuse University
Parallel LU Factorization of Block-Diagonal-Bordered Sparse Matrices D. P. Koester, S. Ranka, and G. C. Fox School of Computer and Information Science and The Northeast Parallel Architectures Center (NPAC)
More informationJulian Hall School of Mathematics University of Edinburgh. June 15th Parallel matrix inversion for the revised simplex method - a study
Parallel matrix inversion for the revised simplex method - A study Julian Hall School of Mathematics University of Edinburgh June 5th 006 Parallel matrix inversion for the revised simplex method - a study
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More information(Sparse) Linear Solvers
(Sparse) Linear Solvers Ax = B Why? Many geometry processing applications boil down to: solve one or more linear systems Parameterization Editing Reconstruction Fairing Morphing 1 Don t you just invert
More informationNotes on efficient Matlab programming
Notes on efficient Matlab programming Dag Lindbo dag@csc.kth.se June 11, 2010 1 Introduction - is Matlab slow? It is a common view that Matlab is slow. This is a very crude statement that holds both some
More informationChapter 23. Linear Motion Motion of a Bug
Chapter 23 Linear Motion The simplest example of a parametrized curve arises when studying the motion of an object along a straight line in the plane We will start by studying this kind of motion when
More informationAn Evaluation of High Performance Fortran Compilers Using the HPFBench Benchmark Suite
An Evaluation of High Performance Fortran Compilers Using the HPFBench Benchmark Suite Guohua Jin and Y. Charlie Hu Department of Computer Science Rice University 61 Main Street, MS 132 Houston, TX 775
More informationImplementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000
Implementation of the BLAS Level 3 and LINPACK Benchmark on the AP1000 Richard P. Brent and Peter E. Strazdins Computer Sciences Laboratory and Department of Computer Science Australian National University
More informationLeast-Squares Fitting of Data with B-Spline Curves
Least-Squares Fitting of Data with B-Spline Curves David Eberly, Geometric Tools, Redmond WA 98052 https://www.geometrictools.com/ This work is licensed under the Creative Commons Attribution 4.0 International
More informationExample 1: Give the coordinates of the points on the graph.
Ordered Pairs Often, to get an idea of the behavior of an equation, we will make a picture that represents the solutions to the equation. A graph gives us that picture. The rectangular coordinate plane,
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationRun Times. Efficiency Issues. Run Times cont d. More on O( ) notation
Comp2711 S1 2006 Correctness Oheads 1 Efficiency Issues Comp2711 S1 2006 Correctness Oheads 2 Run Times An implementation may be correct with respect to the Specification Pre- and Post-condition, but nevertheless
More informationLU FACTORIZATION AND THE LINPACK BENCHMARK ON THE INTEL PARAGON
LU FACTORIZATION AND THE LINPACK BENCHMARK ON THE INTEL PARAGON DAVID WOMBLE y, DAVID GREENBERG y, STEPHEN WHEAT y, AND ROLF RIESEN y Abstract. An implementation of the LINPACK benchmark is described which
More informationALMA Memo No An Imaging Study for ACA. Min S. Yun. University of Massachusetts. April 1, Abstract
ALMA Memo No. 368 An Imaging Study for ACA Min S. Yun University of Massachusetts April 1, 2001 Abstract 1 Introduction The ALMA Complementary Array (ACA) is one of the several new capabilities for ALMA
More informationParallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited LAPACK Working Note #208
Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited LAPACK Working Note #208 Hatem Ltaief 1, Jakub Kurzak 1, and Jack Dongarra 1,2,3 1 Department of Electrical
More informationNumerical considerations
Numerical considerations CHAPTER 6 CHAPTER OUTLINE 6.1 Floating-Point Data Representation...13 Normalized Representation of M...13 Excess Encoding of E...133 6. Representable Numbers...134 6.3 Special
More informationCS 770G - Parallel Algorithms in Scientific Computing
CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,
More informationA Parallel Implementation of the BDDC Method for Linear Elasticity
A Parallel Implementation of the BDDC Method for Linear Elasticity Jakub Šístek joint work with P. Burda, M. Čertíková, J. Mandel, J. Novotný, B. Sousedík Institute of Mathematics of the AS CR, Prague
More informationAlgorithms and Data Structures
Charles A. Wuethrich Bauhaus-University Weimar - CogVis/MMC June 22, 2017 1/51 Introduction Matrix based Transitive hull All shortest paths Gaussian elimination Random numbers Interpolation and Approximation
More informationLecture 27: Fast Laplacian Solvers
Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall
More informationNAG Fortran Library Routine Document F07AAF (DGESV).1
NAG Fortran Library Routine Document Note: before using this routine, please read the Users Note for your implementation to check the interpretation of bold italicised terms and other implementation-dependent
More informationSIMULATING ARBITRARY-GEOMETRY ULTRASOUND TRANSDUCERS USING TRIANGLES
Jørgen Arendt Jensen 1 Paper presented at the IEEE International Ultrasonics Symposium, San Antonio, Texas, 1996: SIMULATING ARBITRARY-GEOMETRY ULTRASOUND TRANSDUCERS USING TRIANGLES Jørgen Arendt Jensen,
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More information