A Few Numerical Libraries for HPC

Similar documents
Dense matrix algebra and libraries (and dealing with Fortran)

Matrix Multiplication

Matrix Multiplication

Resources for parallel computing

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

BLAS: Basic Linear Algebra Subroutines I

Scientific Computing. Some slides from James Lambers, Stanford

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

BLAS: Basic Linear Algebra Subroutines I

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

9. Linear Algebra Computation

BLAS. Christoph Ortner Stef Salvini

A Standard for Batching BLAS Operations

ATLAS (Automatically Tuned Linear Algebra Software),

A High Performance C Package for Tridiagonalization of Complex Symmetric Matrices

Arm Performance Libraries Reference Manual

1 Motivation for Improving Matrix Multiplication

AMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline:

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

2.7 Numerical Linear Algebra Software

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Lecture 9. Introduction to Numerical Techniques

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

printf("\n\nx = "); for(i=0;i<5;i++) printf("\n%f %f", X[i],X[i+5]); printf("\n\ny = "); for(i=0;i<5;i++) printf("\n%f", Y[i]);

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

BLAS. Basic Linear Algebra Subprograms

NAG Fortran Library Routine Document F01CTF.1

COSC6365. Introduction to HPC. Lecture 21. Lennart Johnsson Department of Computer Science

TASKING Performance Libraries For Infineon AURIX MCUs Product Overview

NAG Fortran Library Routine Document F01CWF.1

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

Automatic Development of Linear Algebra Libraries for the Tesla Series

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Intel Math Kernel Library

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

1 Introduction to MATLAB

Parallel Numerical Algorithms

Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to declare an array in C?

Lecture 17: Array Algorithms

x = 12 x = 12 1x = 16

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

1 Introduction to MATLAB

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU kernels for dense linear algebra

2 CHAPTER 1. LEGACYBLAS Considered methods Various other naming schemes have been proposed, such as adding C or c to the name. Most of these schemes a

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Performance Modeling for Ranking Blocked Algorithms

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

NEW ADVANCES IN GPU LINEAR ALGEBRA

Introduction to Parallel Programming & Cluster Computing

An array is a collection of data that holds fixed number of values of same type. It is also known as a set. An array is a data type.

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

A quick guide to Fortran

Introduction to the Message Passing Interface (MPI)

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department

Performance Optimization Tutorial. Labs

Benchmark Results. 2006/10/03

Dense LU Factorization

Parallel Architectures

Solving Dense Linear Systems on Graphics Processors

*Yuta SAWA and Reiji SUDA The University of Tokyo

A Simple and Practical Linear Algebra Library Interface with Static Size Checking

LINPACK Benchmark Optimizations on a Virtual Processor Grid

CUSOLVER LIBRARY. DU _v7.0 March 2015

Supercomputing and Science An Introduction to High Performance Computing

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

How to Write Fast Numerical Code

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

(Creating Arrays & Matrices) Applied Linear Algebra in Geoscience Using MATLAB

AMath 483/583 Lecture 11

Auto-tunable GPU BLAS

MatLab Just a beginning

Introduction to Matrix Operations in Matlab

Lecture 7X: Prac-ces with Principles of Parallel Algorithm Design Concurrent and Mul-core Programming CSE 436/536

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

LAFF-On High-Performance Programming

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Optimizing the operations with sparse matrices on Intel architecture

Parallelizing LU Factorization

Getting the most out of your CPUs Parallel computing strategies in R

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Scientific Computing with GPUs Autotuning GEMMs Fermi GPUs

Accelerating Dense Linear Algebra on the GPU

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Overcoming the Barriers to Sustained Petaflop Performance

Tribhuvan University Institute of Science and Technology 2065

Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Transcription:

A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37

Outline 1 HPC == numerical linear algebra? The role of linear algebra in HPC Numerical linear algebra 2 Dense Matrix Linear Algebra BLAS LAPACK ATLAS 3 Matrix-matrix products Rowwise matrix-matrix multiplication Recursive block oriented matrix-matrix multiplication CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 2 / 37

Outline 1 HPC == numerical linear algebra? The role of linear algebra in HPC Numerical linear algebra 2 Dense Matrix Linear Algebra BLAS LAPACK ATLAS 3 Matrix-matrix products Rowwise matrix-matrix multiplication Recursive block oriented matrix-matrix multiplication CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 3 / 37

HPC == numerical linear algebra? Perhaps the largest single group of HPC applications are devoted to solving problems described by differential equations. In these problems there is typically a domain, which may be space, time, both of these, or something else altogether The domain is discretized, resulting a system of equations that must be solved These equations are often linear so techniques from numerical linear algebra are used to solve them When the equations are nonlinear we may be able to linearize them (treat them as linear on a portion of the domain) to get the solution part-by-part CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 4 / 37

HPC!= numerical linear algebra? Some of the problems that can lead to differential equations and linear systems as well as others that don t involve differential equations can be be approached in completely different ways. Two important examples: 1 simulation by Monte Carlo methods 2 dealing with data: searching and sorting CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 5 / 37

Numerical linear algebra libraries As we ve seen in the case of matrix-matrix multiplication, standard textbook algorithms are often not the most efficient implementations In addition, they may not be the most correct in the sense of keeping errors as small as possible Since it s important to get it right and get it right fast, its nearly always advisable to use numerical libraries of proven code to handle linear algebra operations Today we ll focus on a few important libraries. There are many others, and you should always look for available appropriate libraries that can help as part of tackling any programming problem CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 6 / 37

Outline 1 HPC == numerical linear algebra? The role of linear algebra in HPC Numerical linear algebra 2 Dense Matrix Linear Algebra BLAS LAPACK ATLAS 3 Matrix-matrix products Rowwise matrix-matrix multiplication Recursive block oriented matrix-matrix multiplication CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 7 / 37

Vector operations Key operations yielding vectors include: copy one vector to another: u v swap two vectors: u v scale a vector by a constant: u αu add a multiple of one vector to another: αu + v Key operations yielding scalars include: inner product: u v = i u iv i 1-norm of a vector: u 1 = i u i 2-norm of a vector: u 2 = i u2 i find maximal entry in a vector: u = max i u i CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 8 / 37

Matrix-vector operations Key operations between matrices and vectors include: general (dense) or sparse matrix-vector multiplication: Ax rank-1 update: (adds a scalar multiple of xy T to a matrix): A + αxy T solution of matrix-vector equations: Ax = b CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 9 / 37

Matrix-matrix operations Key operations between pairs of matrices include: general matrix-matrix multiplication symmetric matrix-matrix multiplication sparse matrix-matrix multiplication Key operations on a single matrix include: factoring/decomposing a matrix: A = LU, A = QR, A = UΣV finding the eigenvalues and eigenvectors of a matrix CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 10 / 37

Outline 1 HPC == numerical linear algebra? The role of linear algebra in HPC Numerical linear algebra 2 Dense Matrix Linear Algebra BLAS LAPACK ATLAS 3 Matrix-matrix products Rowwise matrix-matrix multiplication Recursive block oriented matrix-matrix multiplication CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 11 / 37

Basic Linear Algebra Subprograms The BLAS are routines that provide standard building blocks for performing basic vector and matrix operations: the Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software. The BLAS homepage is http://www.netlib.org/blas/. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 12 / 37

BLAS description The name of each BLAS routine (usually) begins with a character that indicates the type of data it operates on: S Real single precision. D Real double precision. C Complex single precision. Z Complex double precision. Level 1 BLAS handle operations that are O(n) data and O(n) work Level 2 BLAS handle operations that are O(n 2 ) data and O(n 2 ) work Level 3 BLAS handle operations that are O(n 2 ) data and O(n 3 ) work CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 13 / 37

Level 1 BLAS Level 1 BLAS are designed for operations with O(n) data and O(n) work. Some BLAS 1 subprograms xcopy xswap xscal xaxpy xdot xasum xnrm2 IxAMAX copy one vector to another swap two vectors scale a vector by a constant add a multiple of one vector to another inner product 1-norm of a vector 2-norm of a vector find maximal entry in a vector Notice the exception to the naming convention in the last line. BLAS functions returning an integer have names starting with I so the datatype indicator is displaced to the second position. BLAS 1 quick reference: http://www.netlib.org/blas/blasqr.ps. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 14 / 37

Level 2 BLAS Level 2 BLAS are designed for operations with O(n 2 ) data and O(n 2 ) work. Some BLAS 2 subprograms xgemv xger xsyr2 xtrsv general matrix-vector multiplication general rank-1 update symmetric rank-2 update solve a triangular system of equations A detailed description of BLAS 2 can be found at http://www.netlib.org/blas/blas2-paper.ps. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 15 / 37

Level 3 BLAS Level 3 BLAS are designed for operations with O(n 2 ) data and O(n 3 ) work. Some BLAS 3 subprograms xgemm xsymm xsyrk xsyr2k general matrix-matrix multiplication symmetric matrix-matrix multiplication symmetric rank-k update symmetric rank-2k update A detailed description of BLAS 3 can be found at http://www.netlib.org/blas/blas3-paper.ps. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 16 / 37

BLAS are written in FORTRAN Calling BLAS routines from FORTRAN is relatively straightforward FORTRAN stores two-dimension arrays in column-major order. Many BLAS routines accept a parameter named LDA which stands for the leading dimension of A. LDA is needed to handle strides between elements common to a single row of A. In contrast, C and C++ store two-dimensional arrays in row-major order. Rather than the leading dimension, the trailing dimension specifies the stride between elements common to a single column. Calling vector-only BLAS routines from C is relatively straightforward. One must work with transposes of matrices in C/C++ programs to use the BLAS matrix-vector and matrix-matrix routines. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 17 / 37

CBLAS: The BLAS for C/C++ A version of the BLAS has been written with routines that can be called directly from C/C++. The CBLAS routines have the same name as their FORTRAN counterparts except all subprogram and function names are lowercase and are prepended with cblas. The CBLAS matrix routines also take an additional parameter to indicate if the matrix is stored in row-major or column-major order. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 18 / 37

Calling Fortran BLAS from C/C++ Even with the matrix transpose issue, it s fairly easy to call the Fortran BLAS from C or C++. 1 Find out the calling sequence for the routine. Man pages for BLAS routines can be looked up on-line. You may also find the BLAS quick reference (http://www.netlib.org/blas/blasqr.pdf or http://www.netlib.org/lapack/lug/node145.html) helpful. 2 Create a prototype in your C/C++ code for the function. The function name should match the Fortran function name in lower-case and have an appended underscore. 3 Optionally create an interface function to make handling the differences between how C and Fortran treat arguments easier. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 19 / 37

Example: BLAS calling routine DGEMM() from C DGEMM() performs the operation C αop(a)op(b) + βc where op(a) can either be A or A T. The Fortran calling sequence is DGEMM ( TRANSA, TRANSB,M,N,K,ALPHA,LDA,B,LDB,BETA,C,LDC ) TRANSA,TRANSB: (CHARACTER*1) N for no transpose, T or Y to transpose M: (INTEGER) number of rows in C and A N: (INTEGER) number of columns in C and B K: (INTEGER) number of columns in A and rows in B ALPHA, BETA: (DOUBLE PRECISION) scale factors for AB and C LDA, LDB, LDC: (INTEGER) leading dimensions of A, B, and C CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 20 / 37

Example: BLAS calling routine DGEMM() from C The corresponding prototype in a C or C++ file would look like #if defined ( cplusplus ) extern "C" { # endif void dgemm_ ( char * transa, char * transb, int * m, int * n, int * k, double * alpha, double * a, int * lda, double * b, int * ldb, double * beta, double * c, int * ldc ); #if defined ( cplusplus ) } # endif Each parameter is a pointer since Fortran does not support pass-by-value. The extern "C" is necessary to prevent C++ name mangling. Using #if defined( cplusplus) ensures this code will work with both C and C++. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 21 / 37

Example: BLAS calling routine DGEMM() from C Finally, it s often helpful to write an interface function that handles the different way Fortran and C parameters are treated: void dgemm ( char transa, char transb, int m, int n, int k, double alpha, double * a, int lda, double * b, int ldb, double beta, double * c, int ldc ) { dgemm_ (& transa, & transb, &m, &n, &k, & alpha, a, &lda, b, &ldb, &beta, c, & ldc ); } Notice that this allows us to pass transa and transb and the other scalar values as literals, whereas without this interface routine we would not be able to. Check out https://github.com/gordon-cs/cps343-hoe/blob/ master/02-mem-hier/matrix_prod.c for a complete example. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 22 / 37

Outline 1 HPC == numerical linear algebra? The role of linear algebra in HPC Numerical linear algebra 2 Dense Matrix Linear Algebra BLAS LAPACK ATLAS 3 Matrix-matrix products Rowwise matrix-matrix multiplication Recursive block oriented matrix-matrix multiplication CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 23 / 37

LAPACK The Linear Algebra PACKage is a successor to both LINPACK and EISPACK. LINPACK is a library of FORTRAN routines for numerical linear algebra and written in the 1970s. EISPACK is a library of FORTRAN routines for numerical computation of eigenvalues and eigenvectors of matrices and was also written in the 1970s. One of the LINPACK authors, Cleve Moler, went on to write an interactive, user-friendly front end to these libraries called MATLAB. LINPACK and EISPACK primarily make use of the level 1 BLAS LAPACK largely replaces LINPACK and EISPACK but takes advantage of level 2 and level 3 BLAS for more efficient operation on computers with hierarcharical memory and shared memory multiprocessors. The LAPACK homepage is http://www.netlib.org/lapack/. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 24 / 37

LAPACK for C/C++ LAPACK is written in FORTRAN 90, so calling from C/C++ has the same challenges as discussed for the BLAS. A version called CLAPACK exists that can be more easily called from C/C++ but still has the column-major matrix access issue. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 25 / 37

Outline 1 HPC == numerical linear algebra? The role of linear algebra in HPC Numerical linear algebra 2 Dense Matrix Linear Algebra BLAS LAPACK ATLAS 3 Matrix-matrix products Rowwise matrix-matrix multiplication Recursive block oriented matrix-matrix multiplication CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 26 / 37

ATLAS ATLAS stands for Automatically Tuned Linear Algebra Software and consists of the BLAS and some routines from LAPACK. The key to getting high performance from the BLAS (and hence from LAPACK) is using BLAS routines that are tuned to each particular machine s architecture and compiler. Vendors of HPC equipment typically supply a BLAS library that has been optimized for their machines. The ATLAS project was created to provide similar support for HPC equipment built from commodity hardware (e.g. Beowulf clusters). The ATLAS homepage is http://math-atlas.sourceforge.net/. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 27 / 37

ATLAS Performance Example The graph below shows GFLOP/s vs. matrix dimension for matrix products using several BLAS implementations. The testing was done on a Lenovo ThinkStation E32 with 8GB RAM running Ubuntu 14.04. The Reference BLAS are supplied with LAPACK and are not optimized. The System ATLAS BLAS are supplied with Ubuntu 14.04. The ATLAS BLAS are version 3.11.38 and were compiled on the host. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 28 / 37

Multithreaded Open BLAS Performance Example The graph below shows GFLOP/s vs. matrix dimension for matrix products using several BLAS implementations. The testing was done on a Lenovo ThinkStation E32 with 4 cores and 8GB RAM running Ubuntu 14.04. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 29 / 37

Outline 1 HPC == numerical linear algebra? The role of linear algebra in HPC Numerical linear algebra 2 Dense Matrix Linear Algebra BLAS LAPACK ATLAS 3 Matrix-matrix products Rowwise matrix-matrix multiplication Recursive block oriented matrix-matrix multiplication CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 30 / 37

Rowwise Matrix-Matrix Multiplication The rowwise matrix-matrix product for C = AB is computed with the pseudo-code below. for i = 1 to n do for j = 1 to n do c ij = 0 for k = 1 to n do c ij = c ij + a ik b kj endfor endfor endfor Notice that both A and C are accessed by rows in the innermost loop while B is accessed by columns. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 31 / 37

Effect of memory hierarchy sizes on performance The graph below shows GFLOPS vs matrix dimension for a rowwise matrix-matrix product. The drop-off in performance between N = 300 and N = 400 suggests that all of B can fit in the processor s cache when N = 300 but not when N = 400. A 300 300 matrix using double precision floating point values consumes approximately 700 KB. A 400 400 matrix needs at least 1250 KB. The cache on the machine used for this test is 2048 KB. Note: This graph was generated on a previous generation Minor Prophets machines. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 32 / 37

Outline 1 HPC == numerical linear algebra? The role of linear algebra in HPC Numerical linear algebra 2 Dense Matrix Linear Algebra BLAS LAPACK ATLAS 3 Matrix-matrix products Rowwise matrix-matrix multiplication Recursive block oriented matrix-matrix multiplication CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 33 / 37

Key idea: minimize cache misses If we partition A and B into appropriate sized blocks [ ] [ A00 A A = 01 B00 B, B = 01 A 10 A 11 B 10 B 11 ] then the product C = AB can be computed as [ ] [ C00 C C = 01 A00 B = 00 + A 01 B 10 A 00 B 01 + A 01 B 11 C 10 C 11 A 10 B 00 + A 11 B 10 A 10 B 01 + A 11 B 11 ] where each of the matrix-matrix products involves matrices of roughly 1/4 the size of the full matrices. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 34 / 37

Recursion We can exploit this idea quite easily with a recursive algorithm. matrix function C = mult(a, B) if (size(b) > threshold) then C 00 = mult(a 00, B 00 ) + mult(a 01, B 10 ) C 01 = mult(a 00, B 01 ) + mult(a 01, B 11 ) C 10 = mult(a 10, B 00 ) + mult(a 11, B 10 ) C 11 = mult(a 10, B 01 ) + mult(a 11, B 11 ) else C = AB endif return C CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 35 / 37

Recursive block-oriented matrix-matrix product The graph below shows GFLOPS vs matrix dimension for both a rowwise product and a recursive block-oriented matrix-matrix product. Performance is improved across the board, including for smaller matrix sizes Performance is relatively flat - no degradation as matrix sizes increase. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 36 / 37

Recursive block-oriented matrix-matrix product This graph shows data for matrix-matrix products but was generated on our 2016 workstation cluster. Notice the larger matrix dimension (now N = 2000) and overall higher GFLOP/s rates. Probably need to go to N = 3000 or more to show leveling-off. Note that increasing block size indefinitely does not help; the sequential algorithm essentially used 8N 2 as the block size. CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 37 / 37