LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

Similar documents
AMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline:

2.7 Numerical Linear Algebra Software

Dense matrix algebra and libraries (and dealing with Fortran)

A Few Numerical Libraries for HPC

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Scientific Computing. Some slides from James Lambers, Stanford

Matrix algorithms: fast, stable, communication-optimizing random?!

BLAS: Basic Linear Algebra Subroutines I

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

TASKING Performance Libraries For Infineon AURIX MCUs Product Overview

9. Linear Algebra Computation

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

BLAS: Basic Linear Algebra Subroutines I

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High Performance Dense Linear Algebra in Intel Math Kernel Library (Intel MKL)

AMS526: Numerical Analysis I (Numerical Linear Algebra)

How to Write Fast Numerical Code

Numerical Linear Algebra

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

F02WUF NAG Fortran Library Routine Document

KAUST Repository. A High Performance QDWH-SVD Solver using Hardware Accelerators. Authors Sukkari, Dalal E.; Ltaief, Hatem; Keyes, David E.

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

ATLAS (Automatically Tuned Linear Algebra Software),

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

LAPACK 3, LAPACK 95, and Fortran 90 linkage issues

NAG Fortran Library Routine Document F07AAF (DGESV).1

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

Image Compression using Singular Value Decomposition

Intel Math Kernel Library (Intel MKL) Sparse Solvers. Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager

BLAS. Basic Linear Algebra Subprograms

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

Self Adapting Numerical Software (SANS-Effort)

Short LAPACK User s Guide

Matrix Multiplication

Automatic Development of Linear Algebra Libraries for the Tesla Series

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Outline. Availability Routines Code Testers Methodology

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Technology on Dense Linear Algebra

Lecture 7X: Prac-ces with Principles of Parallel Algorithm Design Concurrent and Mul-core Programming CSE 436/536

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Sparse Matrices and Graphs: There and Back Again

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Matrix Multiplication

Resources for parallel computing

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems

NEW ADVANCES IN GPU LINEAR ALGEBRA

ABSTRACT 1. INTRODUCTION. * phone ; fax ; emphotonics.com

Chapter 1 A New Parallel Algorithm for Computing the Singular Value Decomposition

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016

Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems

NAG Fortran Library Routine Document F08KAF (DGELSS).1

LAPACK++: A Design Overview of Object-Oriented Extensions. for High Performance Linear Algebra. 600,000 lines of Fortran 77 source code.

A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices

COMPUTATIONAL LINEAR ALGEBRA

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Achieve Better Performance with PEAK on XSEDE Resources

USING LAPACK SOLVERS FOR STRUCTURED MATRICES WITHIN MATLAB

MATH 5520 Basics of MATLAB

MATH 3511 Basics of MATLAB

(Sparse) Linear Solvers

Numerical libraries. - Exercises -

Lecture 11: Randomized Least-squares Approximation in Practice. 11 Randomized Least-squares Approximation in Practice

CALCULATING RANKS, NULL SPACES AND PSEUDOINVERSE SOLUTIONS FOR SPARSE MATRICES USING SPQR

MAGMA: a New Generation

The Singular Value Decomposition: Let A be any m n matrix. orthogonal matrices U, V and a diagonal matrix Σ such that A = UΣV T.

Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Supercomputing and Science An Introduction to High Performance Computing

Parallel Programming & Cluster Computing

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix

Recognition, SVD, and PCA

Introduction to Parallel Programming & Cluster Computing

printf("\n\nx = "); for(i=0;i<5;i++) printf("\n%f %f", X[i],X[i+5]); printf("\n\ny = "); for(i=0;i<5;i++) printf("\n%f", Y[i]);

SLEPc: Scalable Library for Eigenvalue Problem Computations

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation

(Creating Arrays & Matrices) Applied Linear Algebra in Geoscience Using MATLAB

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

PALM: A Package for Solving Quadratic Eigenvalue Problems with Low-rank Damping. Prepared by Ding Lu, Yangfeng Su and Zhaojun Bai July 1st, 2014.

ON DATA LAYOUT IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM WITH PRE PROCESSING

CS 6210 Fall 2016 Bei Wang. Review Lecture What have we learnt in Scientific Computing?

NAG Fortran Library Routine Document F04CAF.1

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

NAG Fortran Library Routine Document F04BGF.1

Matlab course at. P. Ciuciu 1,2. 1: CEA/NeuroSpin/LNAO 2: IFR49

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Numerical libraries. Exercises PARALLEL COMPUTING USING MPI AND OPENMP. M.Cremonesi May 2014

A GPU Sparse Direct Solver for AX=B

Lecture 7: Linear Algebra Algorithms

Numerical Analysis I - Final Exam Matrikelnummer:

(Sparse) Linear Solvers

An Approximate Singular Value Decomposition of Large Matrices in Julia

Generalized trace ratio optimization and applications

Blocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

Transcription:

LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1

Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data Level 3 BLAS: matrix-matrix ops 3 2 O( n ) flops on O( n ) data Maximize ratio flops to memory refs. Can we provide portable software for computations in dense linear algebra that is efficient on a wide range of modern high-performance computers? 2

LINPACK and EISPACK LINPACK (1970s and early 1980s) is a numerical subroutine library for solving linear equations, least-squares problems, and for finding singular values written in Fortran. EISPACK (1972-1973) is a collection of double-precision Fortran subroutines that compute the eigenvalues and eigenvectors of several classes of matrices. MATLAB started its life in the late 1970s as an interactive calculator built on top of LINPACK and EISPACK, which were then state-of-the-art Fortran subroutine libraries for matrix computation. LINPACK and EISPACK portable... but inefficient, why?... 3

LINPACK and EISPACK... Because their memory access patterns disregard the multi-layered memory hierarchies of the machines, (recall Stef s mountain) thereby spending too much time moving data instead of doing useful floatingpoint operations... i.e. They are based on the vector operation kernels of the Level 1 BLAS 4

What is LAPACK? LAPACK, designed to supersede LINPACK and EISPACK, addresses inefficiency by reorganizing the algorithms to use block matrix operations in the innermost loops, i.e. max calls to LEVEL 3 BLAS efficiency, functionality, algorithms vectorization, data movement, parallelism LAPACK is a portable, modern (1990) library of Fortran77 subroutines for solving the most commonly occurring problems in numerical linear algebra. linear systems of equations eigenvalues and eigenvectors linear least squares singular value decomposition Freely available on netlib Pre-built libraries for a variety of architectures 5

The main point is... The (higher Level) BLAS are the building blocks of LAPACK. 6

Structure of LAPACK Driver routines: Computational routines: Auxiliary routines: 7

Structure of LAPACK Driver routines: Solve standard types of problems. Linear systems. Computational routines: Auxiliary routines: 8

Structure of LAPACK Driver routines: Solve standard types of problems. Linear systems. Computational routines: Performs a distinct computational task. LU factorisation. Auxiliary routines: 9

Structure of LAPACK Driver routines: Solve standard types of problems. Linear systems. Computational routines: Performs a distinct computational task. LU factorisation. Auxiliary routines: Subtask or common low-level computation. Matrix norm. 10

DRIVER ROUTINES systems of linear equations least-squares solutions of linear systems overdetermined underdetermined rank deficient eigenvalue problems symmetric nonsymmetric singular value problems 11

COMPUTATIONAL ROUTINES matrix factorizations LU Cholesky QR (many variants; generalised, implicitly shifted, root free...) LQ SVD Schur generalized Schur back subsitution condition number estimation error bound computation 12

COMPUTATIONAL ROUTINES eigenvalues and vectors Cuppen s divide and conquer relatively robust representaion (RRR) bisection inverse iteration Matrix balancing invariant subspaces & condition numbers solve Sylvester matrix equation condition numbers of eigenval/vecs of a matrix in Schur form move eigenvalues in matrix of Schur form 13

Data Types and Precision Real Complex Most computations have matching routines Single Double All computations have matching routines 14

Matrix Types Dense Banded But... not general sparse matrices 15

Routine Naming XYYZZZ 16

Routine Naming XYYZZZ Data Type S single D double C complex Z double complex 17

Routine Naming XYYZZZ Data Type S single D double C complex Z double complex Matrix Type DI diagonal BD bidiagonal GE general OR orthogonal 18

Routine Naming XYYZZZ Data Type S single D double C complex Z double complex Matrix Type DI diagonal BD bidiagonal GE general OR orthogonal Computation SVD singular value decomp LLS linear least squares CON condition number TRF factorize 19

Routine Naming XYYZZZ Data Type S single D double C complex Z double complex Matrix Type DI diagonal BD bidiagonal GE general OR orthogonal Computation SVD singular value decomp LLS linear least squares CON condition number TRF factorize DGESVD 20

Driver Routines Simple Performs the basic operation requested solve a complete problem, like finding the eigenvalues of a matrix or solving a set of linear equations Expert (storage x 2 simple) Additional options e.g. condition number, checks for near-singularity, refinement, error bounds. 21

Documentation Just like the BLAS routines, the LAPACK routines are virtually self-explanatory. Details of the input and output parameters for any given subroutine are contained in the header section of the file. David now takes you through the handout... SUBROUTINE SGESV( N, NRHS, A, LDA, IPIV, B, LDB, INFO ) 22

Benchmarks Linear Equations Dense Tridiagonal Linear Least Squares Overdetermined Singular Value Decomposition For square matrices Eigenproblem Nonsymmetric 23

Benchmarks Linear Equations Dense DGESV Tridiagonal DGTSV Linear Least Squares Overdetermined DGELS Singular Value Decomposition For square matrices DGESVD Eigenproblem Nonsymmetric DGEEV 24

Benchmarks Linear Equations Dense DGESV backslash (LU) Tridiagonal DGTSV sparse backslash Linear Least Squares Overdetermined DGELS qr Singular Value Decomposition For square matrices DGESVD svd Eigenproblem Nonsymmetric DGEEV eig 25

Benchmark Procedure for k = 2 : maxdim n = k^2 loops = 0 start timing while time < maxtime loops = loops + 1 CALL PROCEDURE (n) end T(k) = time/loops end 26

Linear Equations Ax= b A is an n-by-n matrix, b is an n-by-1 vector, x is an n-by-1 vector. DGESV LU factorisation 27

DGESV RESULTS 28

DGTSV RESULTS 29

Linear Least Squares min b - Ax x 2 A is an m-by-n matrix, b is an m-by-1 vector, x is an n-by-1 vector. When m >= n, rank(a) = n, overdetermined system. DGELS QR or LQ factorisation 30

DGELS RESULTS 31

Singular Value Decomposition T A = UΣV A is an m-by-n matrix, U is an m-by-m orthogonal matrix, V is an n-by-n orthogonal matrix. Σ is an m-by-n diagonal matrix containing the singular values of A Let m = n DGESVD 32

DGESVD RESULTS 33

Nonsymmetric Eigenproblem Au = λu A is an m-by-n matrix, u is an m vector; an eigenvector, λ are the eigenvalues. DGEEV 34

DGEEV RESULTS 35

Computer Suitability Designed to give high efficiency on vector processors high performance super-scalar workstations shared memory processors Satisfactory use on all scalar machines PC s workstations mainframes 36

ScaLAPACK Is a distributed-memory version of LAPACK for parallel architectures massively parallel SIMD machines distributed memory machines 37

CLAPACK CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler. The CLAPACK library was built using a Fortran to C conversion utility called f2c. The entire Fortran 77 LAPACK library is run through f2c to obtain C code, and then modified to improve readability. 38

LAPACK Flop Counting So far just presented timing results, but how to measure flops and flops/s? Easy for some operations, e.g. For NxN 3 matrices A and B, A*B requires 2N flops, 3 or LU factorisation requires 0.67N But flops for eigenvalue computation or SVD depend on input data 39

LAPACK Flop Counting Best way to count flops would be to use hardware counters standard feature on modern processors for software profiling Accessing these counters can be difficult: often poorly documented, unstable or inaccessible to user level software Hence cross-platform software for hardware flop counting has traditionally been limited 40

LAPACK Flop Counting PAPI project: aims to define standard interface to hardware counters to simplify profiling software See A Scalable Cross-Platform Infrastructure for Application Performance Tuning using Hardware Counters, J. Dongarra et. al. We did not do direct flop counting, couldn't find any simple software for LAPACK e.g. MATLAB: >> help flops FLOPS Obsolete floating point operation count. Earlier versions of MATLAB counted the number of floating point operations. With the incorporation of LAPACK in MATLAB 6, this is no longer practical. 41

Synthetic Flops Alternative approach: use standard flop counts for various algorithms DGESV (LU factorise): 0.67N DGEEV (ew's and ev's): 26.67N DGESVD (SVD): 6.67N See http://www.netlib.org/lapack/lug/node71.ht ml 3 3 3 42

DGESV 43

DGEEV 44

DGESVD 45

Spectral Methods LAPACK capabilities illustrated by Fortran implementation of cheb, p13, p14, p15 from Spectral Methods in MATLAB by L.N. Trefethen... 46