Adaptive Parallel Exact dense LU factorization

Size: px
Start display at page:

Download "Adaptive Parallel Exact dense LU factorization"

Transcription

1 1/35 Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013, Université de Grenoble

2 2/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

3 3/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

4 Gaussian elimination in Dense Computer algebra 4/35 Dense benchmarking of supercomputers ( basis of linear algebra Sparse Large sparse matrix problems smaller dense problems (still large!) Sparse Iterative : Induce dense elimination on blocs of iterated vectors (Krylov, Lanczos, smith normal form) Sparse Direct : Switch to dense after fill-in [FGB]

5 5/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

6 Gaussian elimination in numerical computation 6/35 pivoting strategies search for best pivot good numerical stability good data locality Reduce the fill-in reduce additional memory needs reduce induced computation costs

7 Exact gaussian elimination applications 7/35 Exact Rank Algebraic topology (smith normal form) Rank Profile Grobner basis computation [FGB] Computational number theory [Stein] Characteristic Polynomial Graph Theory [G. Royle] Coding theory Semi-fields

8 Rank profile 8/35 Row/Column rank profile Definition Generic rank profile example : lexico-graphically smallest sequence of r row/column indices s.t. the corresponding rows/columns of A are linearly independant. If its first r leading principal minors are non zero the sequence {1,...,r} is the row rank profile of a generic rank profile matrix

9 9/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

10 Optimized building block in Dense linear algebra 10/35 Matrix multiplication Algorithmic complexity : Strassen O(n 2.8 ),..., O(n ω ) Optimized hardware implementation : pipeline, SSE, AVX,... Implementation : block versions cache optimization reduce dependencies on the bus speed faster computation for blocks loaded in cache recursive iterative cascading

11 Gaussian elimination concerns 11/35 Same concerns as M.M. = block versions Implementation optimization benefits from matrix multiplication Reduce dependencies on bus speed (cache optimization) Possible best versions adapted for parallel computing Tiled iterative implementation block recursive implementation

12 Exact gaussian elimination adapted for Parallel computing 12/35 block versions trade-off Common point less memory accesses if block size fits the cache N 3 /B memory accesses. (N dimension of the matrix, B the block size) Trade-off block recursive : More adaptative tiled iterative : less synchronizations Historically, It s more difficult to parallelize recursive implementation with existing model of Parallel computing (OpenMP,...)

13 State of the art 13/35 State of the art Sequential Exact : FFLAS-FFPACK, M4RI, within FGB Parallel numeric : ScaLAPACK, Plasma-Quark Parallel exact :?? this work

14 14/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

15 LU factorization of generic rank profile matrices 15/35 A = L U P LU Decomposition applications Solving System : A.x = b; L.(U.x) = b; L.(y) = b Rank : Rank(A) is the number of rows of U Invert of A : A 1 = U 1.L 1 Determinant : det(a) = ±det(u) row or column Rank Profile : given by positions of row or column permutations

16 Tiled iterative LU decomposition 16/35 LU decomposition on first block A 11 = L 1.U 1 updates : A 21 = A 21.U 1 1 ; A 31 = A 31.U 1 1 A 12 = L 1 1.A 12 ; A 13 = L 1 1.A 13 ; A 22 = A 22 A 21.A 12...

17 Tiled iterative LU Decomposition 17/35 A11 A12 A13 A21 A22 A23 A31 A32 A33 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk

18 Tiled iterative LU Decomposition 17/35 U1 L1 A'12 A'13 A'21 A'22 A'23 A'31 A'32 A'33 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk

19 Tiled iterative LU Decomposition 17/35 U1 L1 A'21 A'12 A'13 U2 L2 A''23 A'31 A''32 A''33 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk

20 Tiled iterative LU Decomposition 17/35 U1 L1 A'21 A'12 A'13 U2 L2 A''23 U3 A'31 A''32 L3 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk

21 OpenMP parallel loop Synchronizations 18/35 LU(A11) ApplyP FTRSM (A12) ApplyP FTRSM (A21) ApplyP FTRSM (A13) ApplyP FTRSM (A31) FGEMM (A32) FGEMM (A22) FGEMM (A23) FGEMM (A33) waiting for all tasks... LU(A22) Synchronization ApplyP FTRSM (A32) FGEMM (A33) ApplyP FTRSM (A23) waiting for all tasks... LU(A33) Synchronization Time

22 for(k=0 ; k<nblocks ; k++){ R = FFPACK : :LUdivine(...) ; #pragma omp parallel for shared(a, P) { #pragma omp for nowait for(i=k+1 ; i<nblocks ; i++) FFLAS : :ftrsm(...) ; } #pragma omp parallel for shared(a, P) for(i=k+1 ; i<nblocks ; i++){ FFPACK : :applyp(...) ; FFLAS : :ftrsm(...) ; } #pragma omp parallel for shared(a, P, T) for(i=k+1 ; i<nblocks ; i++){ #pragma omp parallel for shared(a ) for(j=k+1 ; j<nblocks ; j++){ FFLAS : :fgemm(...) ;}} } 19/35

23 KAAPI dataflow scheduling for Tiled LUP 20/35 LU(A11) ApplyP FTRSM (A13) FGEMM (A23) ApplyP FTRSM (A21) FGEMM (A22) LU(A22) ApplyP FTRSM (A12) ApplyP FTRSM (A31) FGEMM (A32) FGEMM (A33) ApplyP FTRSM (A32) ApplyP FTRSM (A23) FGEMM (A33) LU(A33) Time

24 for(int k=0 ; k<nblocks ; k++){ #pragma kaapi task readwrite(&a) write(&p, &Q) R = FFPACK : :LUdivine(...) ; for(int i=k+1 ; i<nblocks ; i++){ #pragma kaapi task readwrite(&a) read(&a) FFLAS : :ftrsm(...) ; } for(int i=k+1 ; i<nblocks ; i++){ #pragma kaapi task readwrite(&a) read(&p) FFPACK : :applyp(...) ; #pragma kaapi task readwrite(&a) read(&a) FFLAS : :ftrsm(...) ; } for(int i=k+1 ; i<nblocks ; i++){ for(int j=k+1 ; j<nblocks ; j++){ #pragma kaapi task readwrite(&a) read(&a) FFPACK : :fgemm(...) ; } } } 21/35

25 KAAPI vs OpenMP HPAC : Intel SandyBridge E Ghz, 32 cores, L3 cache(16384 KB). (Z/1009Z) Overcost Parallel vs sequential for matrix dimension 10000*10000 LUdivine (sequential) OpenMP LU BS=512 KAAPI LU BS=212 KAAPI LU BS=424 timings (seconds) number of cores 22/35

26 KAAPI version speed-up 23/ speed-up kaapi and OpenMP for matrix dimension 10000*10000 KAAPI LU BS=212 KAAPI LU BS=424 OpenMP LU BS=512 Ideal speed-up number of cores

27 Parallelization overcost on LU algorithm 24/35 Timings (seconds) Gain factor KAAPI vs OMP on dense full rank matrices (32 cores) OpenMP kaapi 1-KAAPI/OMP 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % -10 % -20 % -30 % 2K 4K 6K 8K 10K 12K 14K 16K 18K 20K matrix dimension gain factor

28 25/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

29 CUP decomposition (Rank deficient matrices) 26/35 U A = C P

30 block CUP decomposition 27/35 block CUP Less parallelism some independent tasks removed big sequential costly task

31 Parallelization of block CUP with OpenMP 28/ CUP (n=10000, R=5000 blocksize=212) over Z/1009 OpenMP CUP speedup Ideal 25 Speed-up Number of cores

32 Parallelization of block CUP with KAAPI Dynamic scheduling 29/35 dependencies The graph of task dependency is calculated during runtime Dependency between tasks is done according to the referent of each task. In this implementation, the referent is the pointer of the block i.e. it s the pointer on the upper-left side of each block. X X X

33 Parallelization of block CUP with KAAPI Static scheduling 30/35 The graph of task dependancy is precalculated before execution. (faster) X X X X is a task parameter, set as CW. CW mode for static scheduling is not defined yet in actual KAAPI version.

34 31/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

35 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

36 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices 0 [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

37 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices 0 [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

38 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices 0 0 [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

39 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

40 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

41 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

42 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35

43 33/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion

44 Conclusion 34/35 Exact Computation Parallelization in exact Trade-off : (Tiled, block) <=> (adaptative, less sync.) Specificity in Exact/Numeric : rank, rank profile New issues and trade-off /Numeric & Parallel Numeric dataflow synchro. LUP : better adaptatibity more parallelism PLUQ Dynamic scheduling CUP : dynamic block size, parallelism? new algorithm to parallelize : recursive, tile?

45 Thank you for your attention! 35/35

Task based parallelization of recursive linear algebra routines using Kaapi

Task based parallelization of recursive linear algebra routines using Kaapi Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée

More information

Parallel computation of echelon forms

Parallel computation of echelon forms Parallel computation of echelon forms Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, Ziad Sultan To cite this version: Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, Ziad Sultan. Parallel

More information

FFPACK: Finite Field Linear Algebra Package

FFPACK: Finite Field Linear Algebra Package FFPACK: Finite Field Linear Algebra Package Jean-Guillaume Dumas, Pascal Giorgi and Clément Pernet pascal.giorgi@ens-lyon.fr, {Jean.Guillaume.Dumas, Clément.Pernet}@imag.fr P. Giorgi, J-G. Dumas & C. Pernet

More information

Genericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries

Genericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries Genericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries Clément Pernet & the LinBox group U. Joseph Fourier (Grenoble 1, Inria/LIP AriC) Séminaire Performance et Généricité,

More information

Computing the rank of big sparse matrices modulo p using gaussian elimination

Computing the rank of big sparse matrices modulo p using gaussian elimination Computing the rank of big sparse matrices modulo p using gaussian elimination Charles Bouillaguet 1 Claire Delaplace 2 12 CRIStAL, Université de Lille 2 IRISA, Université de Rennes 1 JNCF, 16 janvier 2017

More information

DOCTEUR DE L UNIVERSITÉ DE GRENOBLE

DOCTEUR DE L UNIVERSITÉ DE GRENOBLE THÈSE Pour obtenir le grade de DOCTEUR DE L UNIVERSITÉ DE GRENOBLE Spécialité : Mathématiques Appliquées Arrêté ministériel : 7 août 2006 Présentée par Ziad SULTAN Thèse dirigée par Jean-Guillaume DUMAS

More information

From BLAS routines to finite field exact linear algebra solutions

From BLAS routines to finite field exact linear algebra solutions From BLAS routines to finite field exact linear algebra solutions Pascal Giorgi Laboratoire de l Informatique du Parallélisme (Arenaire team) ENS Lyon - CNRS - INRIA - UCBL France Main goals Solve Linear

More information

Project Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting

Project Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run

More information

The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions

The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions Martin R. Albrecht Nancy, March 30, 2011 Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables

More information

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable

More information

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs. technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance

More information

Generating Optimized Sparse Matrix Vector Product over Finite Fields

Generating Optimized Sparse Matrix Vector Product over Finite Fields Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.

More information

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview

Aim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure

More information

Multiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack.

Multiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack. Multiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack. Brice Boyer LIP6, UPMC, France Joint work with Pascal Giorgi (LIRMM), Clément Pernet (LIG, ENSL), Bastien Vialla (LIRMM)

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

Proceedings of the Federated Conference on Computer Science and Information Systems pp

Proceedings of the Federated Conference on Computer Science and Information Systems pp Proceedings of the Federated Conference on Computer Science and Information Systems pp. 9 39 DOI:.539/5F35 ACSIS, Vol. 5 Strategies of parallelizing loops on the multicore architectures on the example

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

PARDISO Version Reference Sheet Fortran

PARDISO Version Reference Sheet Fortran PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Introduction to Multithreaded Algorithms

Introduction to Multithreaded Algorithms Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course

More information

The LinBox Project for Linear Algebra Computation

The LinBox Project for Linear Algebra Computation The LinBox Project for Linear Algebra Computation A Practical Tutorial Daniel S. Roche Symbolic Computation Group School of Computer Science University of Waterloo MOCAA 2008 University of Western Ontario

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

10th August Part One: Introduction to Parallel Computing

10th August Part One: Introduction to Parallel Computing Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer

More information

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3 UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis

More information

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P. 1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Dense Matrix Multiplication

Dense Matrix Multiplication Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, 2015 1 / 56 Overview 1 The Problem 2

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

A Parallel Algorithm for Finding Sub-graph Isomorphism

A Parallel Algorithm for Finding Sub-graph Isomorphism CS420: Parallel Programming, Fall 2008 Final Project A Parallel Algorithm for Finding Sub-graph Isomorphism Ashish Sharma, Santosh Bahir, Sushant Narsale, Unmil Tambe Department of Computer Science, Johns

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.

More information

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

5. Direct Methods for Solving Systems of Linear Equations. They are all over the place... and may have special needs

5. Direct Methods for Solving Systems of Linear Equations. They are all over the place... and may have special needs 5. Direct Methods for Solving Systems of Linear Equations They are all over the place... and may have special needs They are all over the place... and may have special needs, December 13, 2012 1 5.3. Cholesky

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

A MATLAB Interface to the GPU

A MATLAB Interface to the GPU Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further

More information

CS 770G - Parallel Algorithms in Scientific Computing

CS 770G - Parallel Algorithms in Scientific Computing CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

Mixed Precision Methods

Mixed Precision Methods Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction

More information

Parallel Implementations of Gaussian Elimination

Parallel Implementations of Gaussian Elimination s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini DM545 Linear and Integer Programming Lecture 2 The Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 3. 4. Standard Form Basic Feasible Solutions

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Algorithm Engineering

Algorithm Engineering Algorithm Engineering Paolo D Alberto Electrical and Computer Engineering Carnegie Mellon University Personal Research Background Embedded and High Performance Computing Compiler: Static and Dynamic Theory

More information

Persistent Homology and Nested Dissection

Persistent Homology and Nested Dissection Persistent Homology and Nested Dissection Don Sheehy University of Connecticut joint work with Michael Kerber and Primoz Skraba A Topological Data Analysis Pipeline A Topological Data Analysis Pipeline

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations D. Zheltkov, N. Zamarashkin INM RAS September 24, 2018 Scalability of Lanczos method Notations Matrix order

More information

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms

Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear

More information

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanfordedu) February 6, 2018 Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 In the

More information

OpenMP Examples - Tasking

OpenMP Examples - Tasking Dipartimento di Ingegneria Industriale e dell Informazione University of Pavia December 4, 2017 Outline 1 2 Assignment 2: Quicksort Assignment 3: Jacobi Outline 1 2 Assignment 2: Quicksort Assignment 3:

More information

Gaussian Elimination 2 5 = 4

Gaussian Elimination 2 5 = 4 Linear Systems Lab Objective: The fundamental problem of linear algebra is solving the linear system Ax = b, given that a solution exists There are many approaches to solving this problem, each with different

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

Determinant Computation on the GPU using the Condensation AMMCS Method / 1

Determinant Computation on the GPU using the Condensation AMMCS Method / 1 Determinant Computation on the GPU using the Condensation Method Sardar Anisul Haque Marc Moreno Maza Ontario Research Centre for Computer Algebra University of Western Ontario, London, Ontario AMMCS 20,

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;

More information

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N. GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Parallel Programming

Parallel Programming Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems

More information

Lecture 15: More Iterative Ideas

Lecture 15: More Iterative Ideas Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

OpenMP Doacross Loops Case Study

OpenMP Doacross Loops Case Study National Aeronautics and Space Administration OpenMP Doacross Loops Case Study November 14, 2017 Gabriele Jost and Henry Jin www.nasa.gov Background Outline - The OpenMP doacross concept LU-OMP implementations

More information

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense

More information

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 Case Study: Matrix Multiplication 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 1 4k-by-4k Matrix Multiplication Version Implementation Running time (s) GFLOPS Absolute

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX

More information

Parallelism paradigms

Parallelism paradigms Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,

More information

Getting the most out of your CPUs Parallel computing strategies in R

Getting the most out of your CPUs Parallel computing strategies in R Getting the most out of your CPUs Parallel computing strategies in R Stefan Theussl Department of Statistics and Mathematics Wirtschaftsuniversität Wien July 2, 2008 Outline Introduction Parallel Computing

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop

More information

Accelerating the Iterative Linear Solver for Reservoir Simulation

Accelerating the Iterative Linear Solver for Reservoir Simulation Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,

More information

GRAPH CENTERS USED FOR STABILIZATION OF MATRIX FACTORIZATIONS

GRAPH CENTERS USED FOR STABILIZATION OF MATRIX FACTORIZATIONS Discussiones Mathematicae Graph Theory 30 (2010 ) 349 359 GRAPH CENTERS USED FOR STABILIZATION OF MATRIX FACTORIZATIONS Pavla Kabelíková Department of Applied Mathematics FEI, VSB Technical University

More information

A Static Cut-off for Task Parallel Programs

A Static Cut-off for Task Parallel Programs A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We

More information

OpenMP * Past, Present and Future

OpenMP * Past, Present and Future OpenMP * Past, Present and Future Tim Mattson Intel Corporation Microprocessor Technology Labs timothy.g.mattson@intel.com * The name OpenMP is the property of the OpenMP Architecture Review Board. 1 OpenMP

More information

BOOLEAN MATRIX FACTORISATIONS & DATA MINING. Pauli Miettinen 6 February 2013

BOOLEAN MATRIX FACTORISATIONS & DATA MINING. Pauli Miettinen 6 February 2013 BOOLEAN MATRIX FACTORISATIONS & DATA MINING Pauli Miettinen 6 February 2013 In the sleepy days when the provinces of France were still quietly provincial, matrices with Boolean entries were a favored occupation

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture

More information

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV EE382 (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan

More information

Communication-efficient parallel generic pairwise elimination

Communication-efficient parallel generic pairwise elimination Communication-efficient parallel generic pairwise elimination Alexander Tiskin Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom Abstract The model of bulk-synchronous

More information

Parallelization of Graph Isomorphism using OpenMP

Parallelization of Graph Isomorphism using OpenMP Parallelization of Graph Isomorphism using OpenMP Vijaya Balpande Research Scholar GHRCE, Nagpur Priyadarshini J L College of Engineering, Nagpur ABSTRACT Advancement in computer architecture leads to

More information

LARP / 2018 ACK : 1. Linear Algebra and Its Applications - Gilbert Strang 2. Autar Kaw, Transforming Numerical Methods Education for STEM Graduates

LARP / 2018 ACK : 1. Linear Algebra and Its Applications - Gilbert Strang 2. Autar Kaw, Transforming Numerical Methods Education for STEM Graduates Triangular Factors and Row Exchanges LARP / 28 ACK :. Linear Algebra and Its Applications - Gilbert Strang 2. Autar Kaw, Transforming Numerical Methods Education for STEM Graduates Then there were three

More information

NUMERICAL PARALLEL COMPUTING

NUMERICAL PARALLEL COMPUTING Lecture 4: More on OpenMP http://people.inf.ethz.ch/iyves/pnc11/ Peter Arbenz, Andreas Adelmann Computer Science Dept, ETH Zürich, E-mail: arbenz@inf.ethz.ch Paul Scherrer Institut, Villigen E-mail: andreas.adelmann@psi.ch

More information

Shared Memory Programming Model

Shared Memory Programming Model Shared Memory Programming Model Ahmed El-Mahdy and Waleed Lotfy What is a shared memory system? Activity! Consider the board as a shared memory Consider a sheet of paper in front of you as a local cache

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,

More information

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication Penporn Koanantakool 1,2, riful zad 2, ydin Buluç 2,Dmitriy Morozov 2, Sang-Yun Oh 2,3, Leonid Oliker 2, Katherine Yelick 1,2 1 omputer

More information

Sparse Matrices and Graphs: There and Back Again

Sparse Matrices and Graphs: There and Back Again Sparse Matrices and Graphs: There and Back Again John R. Gilbert University of California, Santa Barbara Simons Institute Workshop on Parallel and Distributed Algorithms for Inference and Optimization

More information

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. Row echelon form A matrix is said to be in the row echelon form if the leading entries shift to the

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,

More information