A Standard for Batching BLAS Operations

Size: px
Start display at page:

Download "A Standard for Batching BLAS Operations"

Transcription

1 A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1

2 API for Batching BLAS Operations We are proposing, as a community standard, an API for Batched Basic Linear Algebra Operations The focus is on multiple independent BLAS operations Think small matrices that are operated on in a single routine. Goal to be more efficient and portable for multi/manycore & GPU hardware systems. We can show 2x speedup and 3x better energy efficiency. 2 / 57

3 Definition Think of making one call to a function and having that operation perform in parallel cross the collection. So for matrix multiple C = C + A * B we have: 3 / 57

4 Definition The matrices in the collection can be of different sizes. Then internal to the Batched BLAS (BBLAS), the scheduling can go on to optimize use of the hardware. Nvidia is already doing a good job of this. They have a subset of the BLAS implemented. Intel has an implementation of Batched DGEMM only. 4

5 Dense Linear Algebra LINPACK (70 s) (Vector operations) LAPACK (80 s) (Blocking, cache friendly) Software/Algorithms follow hardware evolution in time ScaLAPACK (90 s) (Distributed Memory) PLASMA (00 s) New Algorithms (many-core friendly) MAGMA Hybrid Algorithms (heterogeneity friendly) Rely on - Level-1 BLAS operations Level 1 BLAS Rely on - Level-3 BLAS operations Level 3 BLAS Rely on - PBLAS PBLAS Mess Passing Rely on - a DAG/scheduler - block data layout - some extra kernels BLAS on tiles + DAG scheduling BLAS tasking + ( CPU / GPU / Xeon Phi ) hybrid scheduling Examples: Classic fork-join BLAS model for(int i = 0; ) { } // factor a panel // sequential LAPACK DGETRF2( ); // backward swap // sequential LAPACK DLASWP( ); // forward swap // sequential LAPACK DLASWP( ); // triangular solve // parallel BLAS DTRSM( ); // matrix multiply // Parallel BLAS DGEMM( ); These models work well for large DLA problems BLAS tasking + hybrid scheduling

6 Examples Need of Batched routines for Numerical LA [ e.g., sparse direct multifrontal methods, preconditioners for sparse iterative methods, tiled algorithms in dense linear algebra, etc.; ] [ collaboration with Tim Davis at al., Texas A&M University] Sparse / Dense Matrix System DAG-based factorization To capture main LA patterns needed in a numerical library for Batched LA LU, QR, or Cholesky on small diagonal matrices TRSMs, QRs, or LUs TRSMs, TRMMs Updates (Schur complement) GEMMs, SYRKs, TRMMs Example matrix from Quantum chromodynamics Reordered and ready for sparse direct multifrontal solver Diagonal blocks can be handled in parallel through batched LU, QR, or Cholesky factorizations nz = 6716

7 Machine Learning Need of Batched and/or Tensor contraction routines in machine learning e.g., Convolutional Neural Networks (CNNs) used in computer vision Key computation is convolution of Filter Fi (feature detector) and input image D (data): Data D Output O O n D k O n,k F n Filters F Convolution operation: For every filter F n and every channel, the computation for every pixel value O n,k is a tensor contraction: O = n,k D i k,i F Plenty of parallelism; small operations that must be batched With data reshape the computation can be transformed into a batched GEMM (and hence, efficiently implemented; among other approaches) n,i

8 : Calling Sequence: C =αa*b + βc SPECIFICATION Level 3 BLAS DGEMM Calling Sequence dgemm ( char * transa, char * transb, integer * m, integer * n, double * alpha, double * A, integer * lda, double * B, integer * ldb, double * beta, double * C, integer * ldc );

9 : Calling Sequence: C =αa*b + βc SPECIFICATION PROPOSED SPECIFICATION Level 3 BLAS DGEMM Calling Sequence dgemm ( char * transa, char * transb, integer * m, integer * n, double * alpha, double * A, integer * lda, double * B, integer * ldb, double * beta, double * C, integer * ldc ); Batched Level 3 BLAS DGEMM Calling Sequence dgemm_batch ( enum * transa, enum * transb, integer * m, integer * n, double * alpha, double ** arraya, integer * lda, double ** arrayb, integer * ldb, double * beta, double ** arrayc, integer * ldc, integer batch_count, enum batch_opts, integer * info ); Arrays batch_count number of matrices in the batch batch_opts enumerated value, specifying style of the batched computation. Valid values are BATCH_FIXED or BATCH_VARIABLE, specifying fixed or variable size matrices, respectively.

10 Batched Level 3 BLAS DGEMM Example DGEMM (NN), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s CPU nonbatched Loop around each DGEMM using the 16 cores Matrix size M = N, K = 32

11 Batched Level 3 BLAS DGEMM Example DGEMM (NN), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Each core does a DGEMM Gflop/s CPU batched CPU nonbatched Loop around each DGEMM using the 16 cores Matrix size M = N, K = 32

12 Batched Level 3 BLAS DGEMM Example 450 DGEMM (NN), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s GPU nonbatched Matrix size M = N, K = 32

13 Batched Level 3 BLAS DGEMM Example 450 DGEMM (NN), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s GPU batched GPU nonbatched Matrix size M = N, K = 32

14 Batched Level 2 BLAS DGEMV Example 12 DGEMV (N), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU 10 Gflop/s 8 6 CPU nonbatched Matrix size M = N

15 Batched Level 2 BLAS DGEMV Example 12 DGEMV (N), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU 10 Gflop/s 8 6 CPU batched CPU nonbatched Matrix size M = N

16 Batched Level 2 BLAS DGEMV Example 50 DGEMV (N), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s GPU nonbatched Matrix size M = N

17 Batched Level 2 BLAS DGEMV Example 50 DGEMV (N), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s GPU batched GPU nonbatched Matrix size M = N

18 Batched Level 1 BLAS DAXPY Example 5 DAXPY, batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s CPU nonbatched Matrix size M = N

19 Batched Level 1 BLAS DAXPY Example 5 DAXPY, batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s CPU batched CPU nonbatched Matrix size M = N

20 Batched Level 1 BLAS DAXPY Example 12 DAXPY, batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU 10 Gflop/s 8 6 GPU nonbatched Matrix size M = N

21 Batched Level 1 BLAS DAXPY Example 12 DAXPY, batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU 10 Gflop/s 8 6 GPU batched GPU nonbatched Matrix size M = N

22 Batched BLAS Performance Batched Level 3 BLAS: DGEMM Example Batched Level 2 BLAS: DGEMV Example Batched Level 2 BLAS: DAXPY Example Versions designed & autotuned for various sizes and shapes Reference: J. Dongarra, I. Duff, M. Gates, A. Haidar, S. Hammarling, N. Higham, J. Hogg, P. Lara, M. Zounon, S. Relton, and S. Tomov, A Proposed API for Batched Basic Linear Algebra Subprograms, UTK Computer Science Technical Report, (available at March / 51 / 57

23 Batched BLAS Performance Batched Level 3 BLAS: DGEMM Example Versions designed & autotuned for various sizes and shapes Reference: J. Dongarra, I. Duff, M. Gates, A. Haidar, S. Hammarling, N. Higham, J. Hogg, P. Lara, M. Zounon, S. Relton, and S. Tomov, A Proposed API for Batched Basic Linear Algebra Subprograms, UTK Computer Science Technical Report, (available at March / 51 / 57

24 A Proposed API for Batched LAPACK Batched LU example: Arrays batch_count number of matrices in the batch batch_opts enumerated value, specifying style of the batched computation. Valid values are BATCH_FIXED or BATCH_VARIABLE, specifying fixed or variable size matrices, respectively. Reference: J. Dongarra, I. Duff, M. Gates, A. Haidar, S. Hammarling, N. Higham, J. Hogg, P. Lara, M. Zounon, S. Relton, and S. Tomov, A Proposed API for Batched Basic Linear Algebra Subprograms, UTK Computer Science Technical Report, (available at March / / 57 51

25 Community Activity We want to engage the community in developing this proposed standard. The specification is open for discussion, and we welcome input. Workshop on Batched, Reproducible, and Reduced Precision BLAS May 18 & 19 University of Tennessee See 25

26 Questions? What does the word salishan mean? 26

27 Questions? What does the word salishan mean? A group of languages of the Pacific Northwest The Salishan language family consists of twenty-three languages. 27

28 Collaborators / Software / Support u PLASMA u MAGMA u PaRSEC(Parallel Runtime Scheduling and Execution Control) u Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver MAGMA PLASMA 28

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification

Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification Jack Dongarra *, Iain Duff, Mark Gates *, Azzam Haidar *, Sven Hammarling, Nicholas J. Higham, Jonathan Hogg, Pedro Valero Lara, Piotr

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/20/13 1 Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops /Watt 1 2 3 4 National

More information

A Comparison of Potential Interfaces for Batched BLAS Computations. Relton, Samuel D. and Valero-Lara, Pedro and Zounon, Mawussi. MIMS EPrint: 2016.

A Comparison of Potential Interfaces for Batched BLAS Computations. Relton, Samuel D. and Valero-Lara, Pedro and Zounon, Mawussi. MIMS EPrint: 2016. A Comparison of Potential Interfaces for Batched BLAS Computations Relton, Samuel D. and Valero-Lara, Pedro and Zounon, Mawussi 2016 MIMS EPrint: 2016.42 Manchester Institute for Mathematical Sciences

More information

H2020 FETHPC 2014: GA D7.3 Draft specification for Hybrid (Batched) BLAS

H2020 FETHPC 2014: GA D7.3 Draft specification for Hybrid (Batched) BLAS H2020 FETHPC 2014: GA 671633 D7.3 Draft specification for Hybrid (Batched) BLAS April 2016 Document information Scheduled delivery 2016-04-30 Actual delivery 2016-04-26 Version 0.1 Responsible partner

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

Mixed Precision Methods

Mixed Precision Methods Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction

More information

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville MAGMA LAPACK for GPUs Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville Keeneland GPU Tutorial 2011, Atlanta, GA April 14-15,

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project

More information

C++ API for Batch BLAS

C++ API for Batch BLAS 4 C++ API for Batch BLAS Ahmad Abdelfattah ICL 1 Konstantin Arturov Intel 2 Cris Cecka NVIDIA 3 Jack Dongarra ICL Chip Freitag AMD 4 Mark Gates ICL Azzam Haidar ICL Jakub Kurzak ICL Piotr Luszczek ICL

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

An Overview of High Performance Computing and Challenges for the Future

An Overview of High Performance Computing and Challenges for the Future An Overview of High Performance Computing and Challenges for the Future Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 6/15/2009 1 H. Meuer, H. Simon, E. Strohmaier,

More information

High-performance matrix-matrix multiplications of very small matrices

High-performance matrix-matrix multiplications of very small matrices High-performance matrix-matrix multiplications of very small matrices I. Masliah 2, A. Abdelfattah 1, A. Haidar 1, S. Tomov 1, M. Baboulin 2, J. Falcou 2, and J. Dongarra 1,3 1 Innovative Computing Laboratory,

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Toward a supernodal sparse direct solver over DAG runtimes

Toward a supernodal sparse direct solver over DAG runtimes Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX

More information

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving

More information

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore svmoore@utep.edu CPS5401 Fall 2012 svmoore.pbworks.com November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017 Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

High-performance Cholesky factorization for GPU-only execution

High-performance Cholesky factorization for GPU-only execution High-performance Cholesky factorization for GPU-only execution Azzam Haidar Ahmad Abdelfatah Stanimire Tomov University of Tennessee, U.S.A. haidar,ahmad,tomov@icl.utk.edu Jack Dongarra University of Tennessee,

More information

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 The Challenges of Multicore and Specialized Accelerators Jack Dongarra University of Tennessee

More information

High-Performance Matrix-Matrix Multiplications of Very Small Matrices

High-Performance Matrix-Matrix Multiplications of Very Small Matrices High-Performance Matrix-Matrix Multiplications of Very Small Matrices Ian Masliah 2(B), Ahmad Abdelfattah 1, A. Haidar 1, S. Tomov 1, Marc Baboulin 2, J. Falcou 2, and J. Dongarra 1,3 1 Innovative Computing

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 18C (217) 495 54 International Conference on Computational Science, ICCS 217, 12-14 June 217, Zurich, Switzerland The Design

More information

printf("\n\nx = "); for(i=0;i<5;i++) printf("\n%f %f", X[i],X[i+5]); printf("\n\ny = "); for(i=0;i<5;i++) printf("\n%f", Y[i]);

printf(\n\nx = ); for(i=0;i<5;i++) printf(\n%f %f, X[i],X[i+5]); printf(\n\ny = ); for(i=0;i<5;i++) printf(\n%f, Y[i]); OLS.c / / #include #include #include int main(){ int i,info, ipiv[2]; char trans = 't', notrans ='n'; double alpha = 1.0, beta=0.0; int ncol=2; int nrow=5; int

More information

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Sept Profiling, DLA Algorithms ENAHPC / 6 Power Profiling of Cholesky and

More information

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Chapter 3 Dense Linear Algebra for Hybrid GPU-Based Systems Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee Jack Dongarra Department of Electrical Engineering

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters

More information

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee. Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works

More information

Out Of Memory SVD Solver for Big Data

Out Of Memory SVD Solver for Big Data Out Of Memory SVD Solver for Big Data Azzam Haidar, Khairul Kabir, Diana Fayad, Stanimire Tomov, Jack Dongarra {haidar fayad tomov dongarra}@icl.utk.edu, Innovative Computing Laboratory (ICL), University

More information

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak

More information

Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices

Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices This space is reserved for the Procedia header, do not use it Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices Tingxing Dong 1, Azzam Haidar 2, Stanimire Tomov 2, and Jack Dongarra

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

Resources for parallel computing

Resources for parallel computing Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and

More information

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak

More information

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators Hatem Ltaief 1, Stanimire Tomov 1, Rajib Nath 1, and Jack Dongarra 1,2,3 1 Department of Electrical Engineering and Computer Science,

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization

Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra Innovative Computing Laboratory University

More information

Fast and reliable linear system solutions on new parallel architectures

Fast and reliable linear system solutions on new parallel architectures Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin

More information

Optimization for Performance and Energy for Batched Matrix Computations on GPUs

Optimization for Performance and Energy for Batched Matrix Computations on GPUs Optimization for Performance and Energy for Batched Matrix Computations on GPUs Azzam Haidar University of Tennessee, U.S.A. haidar@eecs.utk.edu Stanimire Tomov University of Tennessee, U.S.A. tomov@eecs.utk.edu

More information

Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs

Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs Parallel Processing and Applied Mathematics September 11-14, 2011 Toruń, Poland Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs Innovative Computing Laboratory Electrical Engineering and Computer

More information

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech, Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech, Atlanta February 24, 2017 Acknowledgements Benoit Jacob

More information

INTEL MKL Vectorized Compact routines

INTEL MKL Vectorized Compact routines INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava, PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

*Yuta SAWA and Reiji SUDA The University of Tokyo

*Yuta SAWA and Reiji SUDA The University of Tokyo Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,

More information

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

More information

Fastest and most used math library for Intel -based systems 1

Fastest and most used math library for Intel -based systems 1 Fastest and most used math library for Intel -based systems 1 Speaker: Alexander Kalinkin Contributing authors: Peter Caday, Kazushige Goto, Louise Huot, Sarah Knepper, Mesut Meterelliyoz, Arthur Araujo

More information

A GPU Sparse Direct Solver for AX=B

A GPU Sparse Direct Solver for AX=B 1 / 25 A GPU Sparse Direct Solver for AX=B Jonathan Hogg, Evgueni Ovtchinnikov, Jennifer Scott* STFC Rutherford Appleton Laboratory 26 March 2014 GPU Technology Conference San Jose, California * Thanks

More information

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ, 27.6-29.6.2016 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi - Compiler Assisted Offload - Automatic Offload - Native Execution

More information

arxiv: v1 [cs.ms] 26 Apr 2013

arxiv: v1 [cs.ms] 26 Apr 2013 A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices arxiv:1304.7053v1 [cs.ms] 26 Apr 2013 Abstract Chetan Jhurani, Paul Mullowney Tech-X Corporation 5621 Arapahoe Ave Boulder,

More information

CUDA Accelerated Compute Libraries. M. Naumov

CUDA Accelerated Compute Libraries. M. Naumov CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party

More information

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random

More information

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures Azzam Haidar, Hatem Ltaief, Asim YarKhan and Jack Dongarra Department of Electrical Engineering and

More information

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box. Piotr Luszczek Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU

More information

Making Dataflow Programming Ubiquitous for Scientific Computing

Making Dataflow Programming Ubiquitous for Scientific Computing Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale

More information

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop Intel MKL @ MIC Architecture Dr. Momme Allalen, LRZ, allalen@lrz.de PRACE PATC: Intel MIC&GPU Programming Workshop 1 2 Momme Allalen, HPC with GPGPUs, Oct. 10, 2011 What is the Intel MKL? Math library

More information

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department HPC Numerical Libraries Nicola Spallanzani n.spallanzani@cineca.it SuperComputing Applications and Innovation Department Algorithms and Libraries Many numerical algorithms are well known and largely available.

More information

Technology on Dense Linear Algebra

Technology on Dense Linear Algebra Impact of Multi core and Many core Technology on Dense Linear Algebra Enrique S. Quintana-Ortí Berlin, September 2011 Berlin, September 2011 1 Multi-core and Many-core The free lunch is over (H. Sutter,

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA

More information

Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach

Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 8-21 Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Sparse Matrices Direct methods

Sparse Matrices Direct methods Sparse Matrices Direct methods Iain Duff STFC Rutherford Appleton Laboratory and CERFACS Summer School The 6th de Brùn Workshop. Linear Algebra and Matrix Theory: connections, applications and computations.

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs

Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory, University of Tennessee

More information

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Samuel Thibault and Stanimire Tomov INRIA, LaBRI,

More information

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA 1

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA 1 Procedia Computer Science Procedia Computer Science 00 1 10 International Conference on Computational Science, ICCS One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in

More information

Adrian Tate XK6 / openacc workshop Manno, Mar

Adrian Tate XK6 / openacc workshop Manno, Mar Adrian Tate XK6 / openacc workshop Manno, Mar6-7 2012 1 Overview & Philosophy Two modes of usage Contents Present contents Upcoming releases Optimization of libsci_acc Autotuning Adaptation Asynchronous

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information

Implementation of the C++ API for Batch BLAS

Implementation of the C++ API for Batch BLAS 7 Implementation of the C++ API for Batch BLAS Ahmad Abdelfattah Mark Gates Jakub Kurzak Piotr Luszczek Jack Dongarra Innovative Computing Laboratory June 29, 2018 This research was supported by the Exascale

More information

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear

More information

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg STFC Rutherford Appleton Laboratory 1 / 20 19 March 2015 GPU Technology Conference San Jose, California * Thanks also to

More information

Accelerating Linpack with CUDA on heterogeneous clusters

Accelerating Linpack with CUDA on heterogeneous clusters Accelerating Linpack with CUDA on heterogeneous clusters Massimiliano Fatica NVIDIA Corporation 2701 San Tomas Expressway Santa Clara CA 95050 mfatica@nvidia.com ABSTRACT This paper describes the use of

More information

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6

More information

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers ABSTRACT Azzam Haidar University of Tennessee, Knoxville Knoxville, TN haidar@icl.utk.edu Stanimire Tomov University of

More information

Chapter 1 Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Chapter 1 Accelerating Numerical Dense Linear Algebra Calculations with GPUs Chapter 1 Accelerating Numerical Dense Linear Algebra Calculations with GPUs Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki 1.1 Introduction

More information

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran

More information

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers Harnessing GPU Tensor Cores for Fast FP Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers Azzam Haidar, Stanimire Tomov, Jack Dongarra Nicholas J. Higham {haidar tomov dongarra}@icl.utk.edu,

More information

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System Allan Richmond R. Morales, Chong Tian, Kwai Wong, Eduardo D Azevedo The George Washington University, The Chinese University

More information

BLAS. Basic Linear Algebra Subprograms

BLAS. Basic Linear Algebra Subprograms BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize

More information

Advanced Numerical Techniques for Cluster Computing

Advanced Numerical Techniques for Cluster Computing Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers

More information