Adaptive Parallel Exact dense LU factorization
|
|
- Joan Elliott
- 5 years ago
- Views:
Transcription
1 1/35 Adaptive Parallel Exact dense LU factorization Ziad SULTAN 15 mai 2013 JNCF2013, Université de Grenoble
2 2/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion
3 3/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion
4 Gaussian elimination in Dense Computer algebra 4/35 Dense benchmarking of supercomputers ( basis of linear algebra Sparse Large sparse matrix problems smaller dense problems (still large!) Sparse Iterative : Induce dense elimination on blocs of iterated vectors (Krylov, Lanczos, smith normal form) Sparse Direct : Switch to dense after fill-in [FGB]
5 5/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion
6 Gaussian elimination in numerical computation 6/35 pivoting strategies search for best pivot good numerical stability good data locality Reduce the fill-in reduce additional memory needs reduce induced computation costs
7 Exact gaussian elimination applications 7/35 Exact Rank Algebraic topology (smith normal form) Rank Profile Grobner basis computation [FGB] Computational number theory [Stein] Characteristic Polynomial Graph Theory [G. Royle] Coding theory Semi-fields
8 Rank profile 8/35 Row/Column rank profile Definition Generic rank profile example : lexico-graphically smallest sequence of r row/column indices s.t. the corresponding rows/columns of A are linearly independant. If its first r leading principal minors are non zero the sequence {1,...,r} is the row rank profile of a generic rank profile matrix
9 9/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion
10 Optimized building block in Dense linear algebra 10/35 Matrix multiplication Algorithmic complexity : Strassen O(n 2.8 ),..., O(n ω ) Optimized hardware implementation : pipeline, SSE, AVX,... Implementation : block versions cache optimization reduce dependencies on the bus speed faster computation for blocks loaded in cache recursive iterative cascading
11 Gaussian elimination concerns 11/35 Same concerns as M.M. = block versions Implementation optimization benefits from matrix multiplication Reduce dependencies on bus speed (cache optimization) Possible best versions adapted for parallel computing Tiled iterative implementation block recursive implementation
12 Exact gaussian elimination adapted for Parallel computing 12/35 block versions trade-off Common point less memory accesses if block size fits the cache N 3 /B memory accesses. (N dimension of the matrix, B the block size) Trade-off block recursive : More adaptative tiled iterative : less synchronizations Historically, It s more difficult to parallelize recursive implementation with existing model of Parallel computing (OpenMP,...)
13 State of the art 13/35 State of the art Sequential Exact : FFLAS-FFPACK, M4RI, within FGB Parallel numeric : ScaLAPACK, Plasma-Quark Parallel exact :?? this work
14 14/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion
15 LU factorization of generic rank profile matrices 15/35 A = L U P LU Decomposition applications Solving System : A.x = b; L.(U.x) = b; L.(y) = b Rank : Rank(A) is the number of rows of U Invert of A : A 1 = U 1.L 1 Determinant : det(a) = ±det(u) row or column Rank Profile : given by positions of row or column permutations
16 Tiled iterative LU decomposition 16/35 LU decomposition on first block A 11 = L 1.U 1 updates : A 21 = A 21.U 1 1 ; A 31 = A 31.U 1 1 A 12 = L 1 1.A 12 ; A 13 = L 1 1.A 13 ; A 22 = A 22 A 21.A 12...
17 Tiled iterative LU Decomposition 17/35 A11 A12 A13 A21 A22 A23 A31 A32 A33 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk
18 Tiled iterative LU Decomposition 17/35 U1 L1 A'12 A'13 A'21 A'22 A'23 A'31 A'32 A'33 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk
19 Tiled iterative LU Decomposition 17/35 U1 L1 A'21 A'12 A'13 U2 L2 A''23 A'31 A''32 A''33 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk
20 Tiled iterative LU Decomposition 17/35 U1 L1 A'21 A'12 A'13 U2 L2 A''23 U3 A'31 A''32 L3 Routines of FFLAS-FFPACK LIBRARY on a Finite Field Z/pZ FTRSM (update blocks on same column and same row) A ik = A ik.u 1 kk A ki = L 1 kk.a ki FGEMM (MM, update remaining blocks) A ij = A ij A ik.a kj applyp (applying permutation matrix) A ik = A ik.p 1 kk
21 OpenMP parallel loop Synchronizations 18/35 LU(A11) ApplyP FTRSM (A12) ApplyP FTRSM (A21) ApplyP FTRSM (A13) ApplyP FTRSM (A31) FGEMM (A32) FGEMM (A22) FGEMM (A23) FGEMM (A33) waiting for all tasks... LU(A22) Synchronization ApplyP FTRSM (A32) FGEMM (A33) ApplyP FTRSM (A23) waiting for all tasks... LU(A33) Synchronization Time
22 for(k=0 ; k<nblocks ; k++){ R = FFPACK : :LUdivine(...) ; #pragma omp parallel for shared(a, P) { #pragma omp for nowait for(i=k+1 ; i<nblocks ; i++) FFLAS : :ftrsm(...) ; } #pragma omp parallel for shared(a, P) for(i=k+1 ; i<nblocks ; i++){ FFPACK : :applyp(...) ; FFLAS : :ftrsm(...) ; } #pragma omp parallel for shared(a, P, T) for(i=k+1 ; i<nblocks ; i++){ #pragma omp parallel for shared(a ) for(j=k+1 ; j<nblocks ; j++){ FFLAS : :fgemm(...) ;}} } 19/35
23 KAAPI dataflow scheduling for Tiled LUP 20/35 LU(A11) ApplyP FTRSM (A13) FGEMM (A23) ApplyP FTRSM (A21) FGEMM (A22) LU(A22) ApplyP FTRSM (A12) ApplyP FTRSM (A31) FGEMM (A32) FGEMM (A33) ApplyP FTRSM (A32) ApplyP FTRSM (A23) FGEMM (A33) LU(A33) Time
24 for(int k=0 ; k<nblocks ; k++){ #pragma kaapi task readwrite(&a) write(&p, &Q) R = FFPACK : :LUdivine(...) ; for(int i=k+1 ; i<nblocks ; i++){ #pragma kaapi task readwrite(&a) read(&a) FFLAS : :ftrsm(...) ; } for(int i=k+1 ; i<nblocks ; i++){ #pragma kaapi task readwrite(&a) read(&p) FFPACK : :applyp(...) ; #pragma kaapi task readwrite(&a) read(&a) FFLAS : :ftrsm(...) ; } for(int i=k+1 ; i<nblocks ; i++){ for(int j=k+1 ; j<nblocks ; j++){ #pragma kaapi task readwrite(&a) read(&a) FFPACK : :fgemm(...) ; } } } 21/35
25 KAAPI vs OpenMP HPAC : Intel SandyBridge E Ghz, 32 cores, L3 cache(16384 KB). (Z/1009Z) Overcost Parallel vs sequential for matrix dimension 10000*10000 LUdivine (sequential) OpenMP LU BS=512 KAAPI LU BS=212 KAAPI LU BS=424 timings (seconds) number of cores 22/35
26 KAAPI version speed-up 23/ speed-up kaapi and OpenMP for matrix dimension 10000*10000 KAAPI LU BS=212 KAAPI LU BS=424 OpenMP LU BS=512 Ideal speed-up number of cores
27 Parallelization overcost on LU algorithm 24/35 Timings (seconds) Gain factor KAAPI vs OMP on dense full rank matrices (32 cores) OpenMP kaapi 1-KAAPI/OMP 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % -10 % -20 % -30 % 2K 4K 6K 8K 10K 12K 14K 16K 18K 20K matrix dimension gain factor
28 25/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion
29 CUP decomposition (Rank deficient matrices) 26/35 U A = C P
30 block CUP decomposition 27/35 block CUP Less parallelism some independent tasks removed big sequential costly task
31 Parallelization of block CUP with OpenMP 28/ CUP (n=10000, R=5000 blocksize=212) over Z/1009 OpenMP CUP speedup Ideal 25 Speed-up Number of cores
32 Parallelization of block CUP with KAAPI Dynamic scheduling 29/35 dependencies The graph of task dependency is calculated during runtime Dependency between tasks is done according to the referent of each task. In this implementation, the referent is the pointer of the block i.e. it s the pointer on the upper-left side of each block. X X X
33 Parallelization of block CUP with KAAPI Static scheduling 30/35 The graph of task dependancy is precalculated before execution. (faster) X X X X is a task parameter, set as CW. CW mode for static scheduling is not defined yet in actual KAAPI version.
34 31/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion
35 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35
36 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices 0 [Dumas, Pernet, Sultan, ISSAC 2013] 32/35
37 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices 0 [Dumas, Pernet, Sultan, ISSAC 2013] 32/35
38 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices 0 0 [Dumas, Pernet, Sultan, ISSAC 2013] 32/35
39 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35
40 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35
41 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35
42 Algorithme PLUQ Quad-recursif Recursive cutting according to row and columns We can permut the blocks in a way that we can obtain row rank profile and column rank profile at the same time. rank profile of all leading sub-matrices [Dumas, Pernet, Sultan, ISSAC 2013] 32/35
43 33/35 Outline 1 Introduction 2 Exact gaussian elimination Gaussian elimination in numerical computation Exact gaussian elimination Rank profile 3 Dense linear algebra Optimized building block 4 Block generic full rank matrices Tiled LU factorization Parallelization of block LU speedup 5 Any rank profile Tiled CUP decomposition Parallelization with OpenMP Parallelization of Block CUP with KAAPI 6 perspective 7 conclusion
44 Conclusion 34/35 Exact Computation Parallelization in exact Trade-off : (Tiled, block) <=> (adaptative, less sync.) Specificity in Exact/Numeric : rank, rank profile New issues and trade-off /Numeric & Parallel Numeric dataflow synchro. LUP : better adaptatibity more parallelism PLUQ Dynamic scheduling CUP : dynamic block size, parallelism? new algorithm to parallelize : recursive, tile?
45 Thank you for your attention! 35/35
Task based parallelization of recursive linear algebra routines using Kaapi
Task based parallelization of recursive linear algebra routines using Kaapi Clément PERNET joint work with Jean-Guillaume DUMAS and Ziad SULTAN Université Grenoble Alpes, LJK-CASYS January 20, 2017 Journée
More informationParallel computation of echelon forms
Parallel computation of echelon forms Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, Ziad Sultan To cite this version: Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, Ziad Sultan. Parallel
More informationFFPACK: Finite Field Linear Algebra Package
FFPACK: Finite Field Linear Algebra Package Jean-Guillaume Dumas, Pascal Giorgi and Clément Pernet pascal.giorgi@ens-lyon.fr, {Jean.Guillaume.Dumas, Clément.Pernet}@imag.fr P. Giorgi, J-G. Dumas & C. Pernet
More informationGenericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries
Genericity and efficiency in exact linear algebra with the FFLAS-FFPACK and LinBox libraries Clément Pernet & the LinBox group U. Joseph Fourier (Grenoble 1, Inria/LIP AriC) Séminaire Performance et Généricité,
More informationComputing the rank of big sparse matrices modulo p using gaussian elimination
Computing the rank of big sparse matrices modulo p using gaussian elimination Charles Bouillaguet 1 Claire Delaplace 2 12 CRIStAL, Université de Lille 2 IRISA, Université de Rennes 1 JNCF, 16 janvier 2017
More informationDOCTEUR DE L UNIVERSITÉ DE GRENOBLE
THÈSE Pour obtenir le grade de DOCTEUR DE L UNIVERSITÉ DE GRENOBLE Spécialité : Mathématiques Appliquées Arrêté ministériel : 7 août 2006 Présentée par Ziad SULTAN Thèse dirigée par Jean-Guillaume DUMAS
More informationFrom BLAS routines to finite field exact linear algebra solutions
From BLAS routines to finite field exact linear algebra solutions Pascal Giorgi Laboratoire de l Informatique du Parallélisme (Arenaire team) ENS Lyon - CNRS - INRIA - UCBL France Main goals Solve Linear
More informationProject Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting
Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run
More informationThe M4RI & M4RIE libraries for linear algebra over F 2 and small extensions
The M4RI & M4RIE libraries for linear algebra over F 2 and small extensions Martin R. Albrecht Nancy, March 30, 2011 Outline M4RI Introduction Multiplication Elimination M4RIE Introduction Travolta Tables
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable
More informationAll routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.
technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance
More informationGenerating Optimized Sparse Matrix Vector Product over Finite Fields
Generating Optimized Sparse Matrix Vector Product over Finite Fields Pascal Giorgi 1 and Bastien Vialla 1 LIRMM, CNRS, Université Montpellier 2, pascal.giorgi@lirmm.fr, bastien.vialla@lirmm.fr Abstract.
More informationAim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview
Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure
More informationMultiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack.
Multiplication matrice creuse vecteur dense exacte et efficace dans Fflas-Ffpack. Brice Boyer LIP6, UPMC, France Joint work with Pascal Giorgi (LIRMM), Clément Pernet (LIG, ENSL), Bastien Vialla (LIRMM)
More informationCOMPUTATIONAL LINEAR ALGEBRA
COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim
More informationProceedings of the Federated Conference on Computer Science and Information Systems pp
Proceedings of the Federated Conference on Computer Science and Information Systems pp. 9 39 DOI:.539/5F35 ACSIS, Vol. 5 Strategies of parallelizing loops on the multicore architectures on the example
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 20: Sparse Linear Systems; Direct Methods vs. Iterative Methods Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 26
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationPARDISO Version Reference Sheet Fortran
PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationIntroduction to Multithreaded Algorithms
Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 21 Outline 1 Course
More informationThe LinBox Project for Linear Algebra Computation
The LinBox Project for Linear Algebra Computation A Practical Tutorial Daniel S. Roche Symbolic Computation Group School of Computer Science University of Waterloo MOCAA 2008 University of Western Ontario
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationExam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3
UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationDense Matrix Multiplication
Dense Matrix Multiplication Abhishek Somani, Debdeep Mukhopadhyay Mentor Graphics, IIT Kharagpur October 7, 2015 Abhishek, Debdeep (IIT Kgp) Matrix Mult. October 7, 2015 1 / 56 Overview 1 The Problem 2
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationA Parallel Algorithm for Finding Sub-graph Isomorphism
CS420: Parallel Programming, Fall 2008 Final Project A Parallel Algorithm for Finding Sub-graph Isomorphism Ashish Sharma, Santosh Bahir, Sushant Narsale, Unmil Tambe Department of Computer Science, Johns
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More information5. Direct Methods for Solving Systems of Linear Equations. They are all over the place... and may have special needs
5. Direct Methods for Solving Systems of Linear Equations They are all over the place... and may have special needs They are all over the place... and may have special needs, December 13, 2012 1 5.3. Cholesky
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationA MATLAB Interface to the GPU
Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further
More informationCS 770G - Parallel Algorithms in Scientific Computing
CS 770G - Parallel lgorithms in Scientific Computing Dense Matrix Computation II: Solving inear Systems May 28, 2001 ecture 6 References Introduction to Parallel Computing Kumar, Grama, Gupta, Karypis,
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationMixed Precision Methods
Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 5: Sparse Linear Systems and Factorization Methods Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 18 Sparse
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationDM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini
DM545 Linear and Integer Programming Lecture 2 The Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 3. 4. Standard Form Basic Feasible Solutions
More informationNumerical Linear Algebra
Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationAlgorithm Engineering
Algorithm Engineering Paolo D Alberto Electrical and Computer Engineering Carnegie Mellon University Personal Research Background Embedded and High Performance Computing Compiler: Static and Dynamic Theory
More informationPersistent Homology and Nested Dissection
Persistent Homology and Nested Dissection Don Sheehy University of Connecticut joint work with Michael Kerber and Primoz Skraba A Topological Data Analysis Pipeline A Topological Data Analysis Pipeline
More informationESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report
ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new
More informationBlock Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations
Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations D. Zheltkov, N. Zamarashkin INM RAS September 24, 2018 Scalability of Lanczos method Notations Matrix order
More informationIterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms
Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear
More informationLecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1
CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanfordedu) February 6, 2018 Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1 In the
More informationOpenMP Examples - Tasking
Dipartimento di Ingegneria Industriale e dell Informazione University of Pavia December 4, 2017 Outline 1 2 Assignment 2: Quicksort Assignment 3: Jacobi Outline 1 2 Assignment 2: Quicksort Assignment 3:
More informationGaussian Elimination 2 5 = 4
Linear Systems Lab Objective: The fundamental problem of linear algebra is solving the linear system Ax = b, given that a solution exists There are many approaches to solving this problem, each with different
More informationDense matrix algebra and libraries (and dealing with Fortran)
Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)
More informationDeterminant Computation on the GPU using the Condensation AMMCS Method / 1
Determinant Computation on the GPU using the Condensation Method Sardar Anisul Haque Marc Moreno Maza Ontario Research Centre for Computer Algebra University of Western Ontario, London, Ontario AMMCS 20,
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationA Few Numerical Libraries for HPC
A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationSolving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems
AMSC 6 /CMSC 76 Advanced Linear Numerical Analysis Fall 7 Direct Solution of Sparse Linear Systems and Eigenproblems Dianne P. O Leary c 7 Solving Sparse Linear Systems Assumed background: Gauss elimination
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationCSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing
HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;
More informationGPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.
GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran G. Ruetsch, M. Fatica, E. Phillips, N. Juffa Outline WRF and RRTM Previous Work CUDA Fortran Features RRTM in CUDA
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationParallel Programming
Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems
More informationLecture 15: More Iterative Ideas
Lecture 15: More Iterative Ideas David Bindel 15 Mar 2010 Logistics HW 2 due! Some notes on HW 2. Where we are / where we re going More iterative ideas. Intro to HW 3. More HW 2 notes See solution code!
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationChallenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,
More informationOpenMP Doacross Loops Case Study
National Aeronautics and Space Administration OpenMP Doacross Loops Case Study November 14, 2017 Gabriele Jost and Henry Jin www.nasa.gov Background Outline - The OpenMP doacross concept LU-OMP implementations
More informationCOSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science
Introduction to HPC Lecture 21 Department of Computer Science Most slides from UC Berkeley CS 267 Spring 2011, Lecture 12, Dense Linear Algebra (part 2), Parallel Gaussian Elimination. Jim Demmel Dense
More informationCase Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017
Case Study: Matrix Multiplication 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 1 4k-by-4k Matrix Multiplication Version Implementation Running time (s) GFLOPS Absolute
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More informationParallelism paradigms
Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationCS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010
CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,
More informationGetting the most out of your CPUs Parallel computing strategies in R
Getting the most out of your CPUs Parallel computing strategies in R Stefan Theussl Department of Statistics and Mathematics Wirtschaftsuniversität Wien July 2, 2008 Outline Introduction Parallel Computing
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop
More informationAccelerating the Iterative Linear Solver for Reservoir Simulation
Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,
More informationGRAPH CENTERS USED FOR STABILIZATION OF MATRIX FACTORIZATIONS
Discussiones Mathematicae Graph Theory 30 (2010 ) 349 359 GRAPH CENTERS USED FOR STABILIZATION OF MATRIX FACTORIZATIONS Pavla Kabelíková Department of Applied Mathematics FEI, VSB Technical University
More informationA Static Cut-off for Task Parallel Programs
A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We
More informationOpenMP * Past, Present and Future
OpenMP * Past, Present and Future Tim Mattson Intel Corporation Microprocessor Technology Labs timothy.g.mattson@intel.com * The name OpenMP is the property of the OpenMP Architecture Review Board. 1 OpenMP
More informationBOOLEAN MATRIX FACTORISATIONS & DATA MINING. Pauli Miettinen 6 February 2013
BOOLEAN MATRIX FACTORISATIONS & DATA MINING Pauli Miettinen 6 February 2013 In the sleepy days when the provinces of France were still quietly provincial, matrices with Boolean entries were a favored occupation
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationswsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu
swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture
More informationEE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan
More informationCommunication-efficient parallel generic pairwise elimination
Communication-efficient parallel generic pairwise elimination Alexander Tiskin Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom Abstract The model of bulk-synchronous
More informationParallelization of Graph Isomorphism using OpenMP
Parallelization of Graph Isomorphism using OpenMP Vijaya Balpande Research Scholar GHRCE, Nagpur Priyadarshini J L College of Engineering, Nagpur ABSTRACT Advancement in computer architecture leads to
More informationLARP / 2018 ACK : 1. Linear Algebra and Its Applications - Gilbert Strang 2. Autar Kaw, Transforming Numerical Methods Education for STEM Graduates
Triangular Factors and Row Exchanges LARP / 28 ACK :. Linear Algebra and Its Applications - Gilbert Strang 2. Autar Kaw, Transforming Numerical Methods Education for STEM Graduates Then there were three
More informationNUMERICAL PARALLEL COMPUTING
Lecture 4: More on OpenMP http://people.inf.ethz.ch/iyves/pnc11/ Peter Arbenz, Andreas Adelmann Computer Science Dept, ETH Zürich, E-mail: arbenz@inf.ethz.ch Paul Scherrer Institut, Villigen E-mail: andreas.adelmann@psi.ch
More informationShared Memory Programming Model
Shared Memory Programming Model Ahmed El-Mahdy and Waleed Lotfy What is a shared memory system? Activity! Consider the board as a shared memory Consider a sheet of paper in front of you as a local cache
More informationBrief notes on setting up semi-high performance computing environments. July 25, 2014
Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1
More informationAutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming
AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,
More informationCommunication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication
ommunication-voiding Parallel Sparse-Dense Matrix-Matrix Multiplication Penporn Koanantakool 1,2, riful zad 2, ydin Buluç 2,Dmitriy Morozov 2, Sang-Yun Oh 2,3, Leonid Oliker 2, Katherine Yelick 1,2 1 omputer
More informationSparse Matrices and Graphs: There and Back Again
Sparse Matrices and Graphs: There and Back Again John R. Gilbert University of California, Santa Barbara Simons Institute Workshop on Parallel and Distributed Algorithms for Inference and Optimization
More informationMATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.
MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix. Row echelon form A matrix is said to be in the row echelon form if the leading entries shift to the
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,
More information