Introduction to Numerical Libraries for HPC. Bilel Hadri. Computational Scientist KAUST Supercomputing Lab.

Size: px
Start display at page:

Download "Introduction to Numerical Libraries for HPC. Bilel Hadri. Computational Scientist KAUST Supercomputing Lab."


1 Introduction to Numerical Libraries for HPC Bilel Hadri Computational Scientist KAUST Supercomputing Lab Bilel Hadri 1

2 Numerical Libraries Application Areas Most used libraries/software in HPC! Linear Algebra Systems of equations Direct, Iterative, Multigrid solvers Sparse, Dense system Eigenvalue problems Least squares Signal processing FFT Numerical Integration Random Number Generators Bilel Hadri Introduction to Numerical Libraries for HPC 2

3 Numerical Libraries - Motivations Don t Reinvent the Wheel! Improves productivity! Get a better performance! Faster and better algorithms Bilel Hadri Introduction to Numerical Libraries for HPC 3

4 Faster (Better Code) Achieving best performance requires creating very processor- and system-specific code Example: Dense matrix-matrix multiply Simple to express: do i=1, n do j=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo Bilel Hadri Introduction to Numerical Libraries for HPC 4

5 Performance How fast should this run? Our matrix-matrix multiply algorithm has 2n 3 floating point operations 3 nested loops, each with n iterations 1 multiply, 1 add in each inner iteration For n=100, 2x10 6 operations, about 1 msec on a 2GHz processor For n=1000, 2x10 9 operations, or about 1 sec Reality: N= ms N=1000 6s à Obvious expression of algorithms are not transformed into leading performance. Bilel Hadri Introduction to Numerical Libraries for HPC 5

6 Numerical Libraries Packages Linear Algebra BLAS/LAPACK/ ScaLAPACK /PLASMA/ MAGMA/HiCMA PETSc HYPRE TRILINOS Signal processing FFTW Numerical Integration GSL Random Number Generators SPRNG Bilel Hadri Introduction to Numerical Libraries for HPC 6

7 Others MUMPS MUMPS (MUltifrontal Massively Parallel sparse direct Solver) is a package of parallel, sparse, direct linear system solvers based on a multifrontal algorithm. SuperLU 4.3. SuperLU is a sequential version of SuperLU_dist and a sequential incomplete LU preconditioner that can accelerate the convergence of Krylov subspace iterative solvers. ParMETIS ParMETIS (Parallel Graph Partitioning and Fill reducing Matrix Ordering) is a library of routines that partition unstructured graphs and meshes and compute fill reducing orderings of sparse matrices. Bilel Hadri Introduction to Numerical Libraries for HPC 7

8 SUNDIALS (SUite of Nonlinear and DIfferential/Algebraic equation Solvers) consists of 5 solvers: CVODE, CVODES, IDA, IDAS, and KINSOL. In addition, SUNDIALS provides a MATLAB interface to CVODES, IDAS, and KINSOL that is called sundialstb. Scotch b Scotch is a software package and libraries for sequential and parallel graph partitioning, static mapping, sparse matrix block ordering, and sequential meshand hypergraph partitioning. Note: On Shaheen, they are all grouped into cray-tpsl Freely Available Software for Linear Algebra Bilel Hadri Introduction to Numerical Libraries for HPC 8

9 BLAS (Basic Linear Algebra Subprograms) The BLAS functionality is divided into three levels: Level 1: contains vector operations of the form: Level 2: contains matrix-vector operations of the form: Level 3: contains matrix-matrix operations of the form: Several implementations for different languages exist Reference implementation Bilel Hadri Introduction to Numerical Libraries for HPC 9

10 BLAS: naming conventions Each routine has a name which specifies the operation, the type of matrices involved and their precisions. Names are in the form: PMMOO. Each operation is defined for four precisions (P) S single real D double real C single complex Z double complex The types of matrices are (MM) GE general GB general band SY symmetric SB symmetric band SP symmetric packed HE hermitian HB hermitian band HP hermitian packed TR triangular TB triangular band TP triangular packed Examples SGEMM stands for single-precision general matrix-matrix multiply DGEMM stands for double-precision matrix-matrix multiply. Some of the most common operations (OO): DOT scalar product, x^t y AXPY vector sum, α x + y MV matrix-vector product, A x SV matrix-vector solve, inv(a) x MM matrix-matrix product, A B SM matrix-matrix solve, inv(a) B Bilel Hadri Introduction to Numerical Libraries for HPC 10

11 Vector operations (xrot, xswap, xcopy etc.) Scalar dot products (xdot etc.) Vector norms (IxAMX etc.) BLAS Level 1 routines Bilel Hadri Introduction to Numerical Libraries for HPC 11

12 BLAS Level 2 routines Matrix-vector operations (xgemv, xgbmv, xhemv, xhbmv etc.) Solving Tx = y for x, where T is triangular (xger, xher etc.) Bilel Hadri Introduction to Numerical Libraries for HPC 12

13 BLAS Level 3 routines Matrix-matrix operations (xgemm etc.) Solving for triangular matrices (xtrmm) Widely used matrix-matrix multiply (xsymm, xgemm) Bilel Hadri Introduction to Numerical Libraries for HPC 13

14 LAPACK (Linear Algebra PACKage) Linear Algebra PACKage Provides routines for Solving systems of simultaneous linear equations, Least-squares solutions of linear systems of equations, Eigenvalue problems, Householder transformation to implement QR decomposition on a matrix and Singular value problems Was initially designed to run efficiently on shared memory vector machines Depends on BLAS Has been extended for distributed systems ScaLAPACK ( Scalable Linear Algebra PACKage) Bilel Hadri Introduction to Numerical Libraries for HPC 14

15 LAPACK naming conventions Very similar to BLAS XYYZZZ X: data type S: REAL D: DOUBLE PRECISION C: COMPLEX Z: COMPLEX*16 or DOUBLE COMPLEX YY: matrix type BD: bidiagonal DI: diagonal GB: general band GE: general (i.e., unsymmetric, in some cases rectangular) GG: general matrices, generalized problem (i.e., a pair of general matrices) GT: general tridiagonal HB: (complex) Hermitian band HE: (complex) Hermitian HG: upper Hessenberg matrix, generalized problem (i.e a Hessenberg and a triangular matrix) HP: (complex) Hermitian, packed storage HS: upper Hessenberg OP: (real) orthogonal, packed storage OR: (real) orthogonal PB: symmetric or Hermitian positive definite band YY: more matrix types PO: symmetric or Hermitian positive definite PP: symmetric or Hermitian positive definite, packed storage PT: symmetric or Hermitian positive definite tridiagonal SB: (real) symmetric band SP: symmetric, packed storage ST: (real) symmetric tridiagonal SY: symmetric TB: triangular band TG: triangular matrices, generalized problem (i.e., a pair of triangular matrices) TP: triangular, packed storage TR: triangular (or in some cases quasitriangular) TZ: trapezoidal UN: (complex) unitary UP: (complex) unitary, packed storage ZZZ: performed computation Linear systems Factorizations Eigenvalue problems Singular value decomposition Etc. Bilel Hadri Introduction to Numerical Libraries for HPC 15

16 LAPACK routines Bilel Hadri Introduction to Numerical Libraries for HPC 16

17 Numerical Libraries packages Vendor libraries optimized implementations of BLAS, LAPACK, ScaLAPACK to their processors and their platform. Dongarra/ICL Bilel Hadri Introduction to Numerical Libraries for HPC 17

18 LAPACK & ScaLAPACK ScaLAPACK is a library with a subset of LAPACK routines running on distributed memory machines. ScaLAPACK is designed for heterogeneous computing and is potable on any computer that supports MPI or PVM. Bilel Hadri Introduction to Numerical Libraries for HPC 18

19 Overview of ScaLAPACK Bilel Hadri Introduction to Numerical Libraries for HPC 19

20 Why use LAPACK or ScaLAPACK? Solving systems of: Linear equations: Least squares: min Eigenvalue problem: Ax = b Singular value problem: Ax -b 2 Ax = λx A = USV T Bilel Hadri Introduction to Numerical Libraries for HPC 20

21 Reference BLAS vs Tuned The reference BLAS and LAPACK libraries a re reference implementations of the BLAS and LAPACK standard. These are not optimised and not multi-threaded, so not much performance should be expected. These libraries are available for downloadhttp:// and The Automatically Tuned Linear Algebra Software, ATLAS. During compile time, ATLAS automatically choses the algorithms delivering the best performance. ATLAS does not contain all LAPACK functionality; it can be downloaded from The Goto BLAS an implementation of the level 3 BLAS aimed at high efficiency]. The Goto BLAS is available for download from Bilel Hadri Introduction to Numerical Libraries for HPC 21

22 Optimized vendor libraries for BLAS/LAPACK Highly efficient versions Hand tuned assembly by hardware vendors Provide near peak performance Several vendors provide libraries optimized for their architecture (AMD, Fujitsu, IBM, Intel, NEC, ) Intel à MKL Cray à LibSci AMD à ACML IBM à ESSL USE them! ( Speedup up to 10 or more ) Bilel Hadri Introduction to Numerical Libraries for HPC 22

23 Bilel Hadri Introduction to Numerical Libraries for HPC AMD / MKL ACML (AMD Core Math Library) LAPACK, BLAS, and extended BLAS (sparse), FFTs (single- and doubleprecision, real and complex data types). APIs for both Fortran and C MKL (Math Kernel Library) LAPACK, BLAS, and extended BLAS (sparse), FFTs (single- and double-precision, real and complex data types). APIs for both Fortran and C Use the MKL advisory page to link your code with it:

24 Example with SGEMM Bilel Hadri Introduction to Numerical Libraries for HPC 24

25 Fortran example Source available at Bilel Hadri Introduction to Numerical Libraries for HPC 25

26 Linking examples Library Compiler Link-flags Cray Environment by default: compile without adding flags LIBSCI on Cray GNU compile without adding flags Intel compile without adding flags ACML GNU /opt/acml/4.4.0/gfortran64_mp/lib/libacml_mp.a fopenmp Intel /opt/acml/4.4.0/ifort64_mp/lib/libacml_mp.a -openmp lpthread MKL PGI GNU -Wl,--start-group /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_pgi_thread.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -mp -lpthread -Wl,--start-group /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_gnu_thread.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -L/opt/intel/Compiler/11.1/038/lib/intel64/ -liomp5 Intel -Wl,--start-group /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_thread.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -openmp lpthread Use the MKL advisory linkline: Bilel Hadri Introduction to Numerical Libraries for HPC 26

27 Compilation demos On Shaheen: Use -Wl,-ysgemm_ to check with optimized library is used. with Cray-libsci ftn o exe_libsci test_sgemm.f90 with Intel MKL On Ibex Unload cray-lisci ftn o exe_libsci test_sgemm.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm -ldl with BLAS Intel module load blas/3.7.1/gnu gfortran test_sgemm.f90 -lblas module load intel/2017 ifort -o exe_mkl test_sgemm.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group - liomp5 -lpthread -lm -ldl ACML module load acml/ gfortran64 gfortran -o exe_acml test_sgemm.f90 -lacml_mp Bilel Hadri Introduction to Numerical Libraries for HPC 27

28 Python fans! You can speedup your python scripts: By using the scientific libraries numpy and scipy Build python with the vendor optimized library Available with python/ on Shaheen You can build your own by following the instructions: with cray-libsci: Available next month on Shaheen. cray-python/17.09 Check installation: import numpy as np; np.show_config() Bilel Hadri Introduction to Numerical Libraries for HPC 28

29 Python Numpy check installation import numpy as np; >>> np.show_config() lapack_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] blas_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] openblas_lapack_info: NOT AVAILABLE lapack_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] blas_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] Bilel Hadri Introduction to Numerical Libraries for HPC 29

30 Python use Numpy and Scipy demo Bilel Hadri Introduction to Numerical Libraries for HPC 30

31 Performance formulae Performance is measured in floating point operations per second, FLOPS, or FLOP/s. Current processors deliver an R peak in the GFLOPS ( 10 9 FLOPS) range. The R peak of a system can be computed by: R peak = n CPU n core n FPU f n CPU is the number of CPUs in the system, n core is the number of computing cores per CPU, n FPU is the number of floating point units per core, f is the clock frequency Bilel Hadri Introduction to Numerical Libraries for HPC 31

32 FLOPs counts for recent processor microarchitectures Intel Core 2 and Nehalem: 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2- wide SSE2 multiplication 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication Intel Sandy Bridge/Ivy Bridge: 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication 16 SP FLOPs/cycle: 8-wide AVX addition + 8- wide AVX multiplication Intel Haswell: 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions Intel Skylake/Knight s Landing AVX flops/cycle DP & 64 flops/cycle SP AMD K10: AMD Bulldozer: 8 DP FLOPs/cycle: 4-wide FMA 16 SP FLOPs/cycle: 8-wide FMA ARM Cortex-A15: 2 DP FLOPs/cycle: scalar FMA or scalar multiplyadd 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4- wide NEON multiply-add IBM PowerPC A2 (Blue Gene Q): 8 DP FLOPs/cycle: 4-wide QPX FMA SP elements are extended to DP and processed on the same units Intel MIC (Xeon Phi), per core (supports 4 hyperthreads): 16 DP FLOPs/cycle: 8-wide FMA every cycle 32 SP FLOPs/cycle: 16-wide FMA every cycle Intel MIC (Xeon Phi), per thread: 8 DP FLOPs/cycle: 8-wide FMA every other cycle 4 DP FLOPs/cycle: 2-wide SSE2 addition SP FLOPs/cycle: 16-wide FMA every other wide SSE2 multiplication cycle 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication Bilel Hadri Introduction to Numerical Libraries for HPC 32

33 Lorena A. Barba, Rio Yokota Bilel Hadri Introduction to Numerical Libraries for HPC 33 Rooflines Roofline is a performance model used to bound the performance of various numerical methods and operations running on processor architectures

34 Best Practices Numerical Recipes books DO NOT provide optimized code. (Libraries can be 100x faster). Don t reinvent the wheel. Use optimized libraries! It s not only for C++/C/Fortran. Python has an interface with BLAS ( check with numpy/scipy) R, Matlab, cython. Don t forget the environment variables! The efficient use of numerical libraries can yield significant performance benefits Should be one of the first things to investigate when optimizing codes The best library implementation often varies depending on the individual routine and possibly even the size of input data READ the manual and/or attend the tutorials/workshops! Bilel Hadri Introduction to Numerical Libraries for HPC 34

35 THANKS Bilel Hadri Introduction to Numerical Libraries for HPC 35

Achieve Better Performance with PEAK on XSEDE Resources

Achieve Better Performance with PEAK on XSEDE Resources Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,

More information

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Mathematical Libraries and Application Software on JUQUEEN and JURECA Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course May 2017 I.Gutheil Outline General Informations Sequential Libraries Parallel

More information

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN JSC Training Course May 22, 2012 Outline General Informations Sequential Libraries Parallel

More information

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Mathematical Libraries and Application Software on JUQUEEN and JURECA Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course November 2015 I.Gutheil Outline General Informations Sequential Libraries Parallel

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK

More information

Some notes on efficient computing and high performance computing environments

Some notes on efficient computing and high performance computing environments Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public

More information

How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab

How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab Live Survey Please login with your laptop/mobile h#p://' And type the code VF9SKGQ6

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects /

More information

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Brief notes on setting up semi-high performance computing environments. July 25, 2014 Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email:

More information

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)

More information

A Few Numerical Libraries for HPC

A Few Numerical Libraries for HPC A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear

More information

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence

More information

How to compile Fortran program on application server

How to compile Fortran program on application server How to compile Fortran program on application server Center for Computational Materials Science, Institute for Materials Research, Tohoku University 2015.3 version 1.0 Contents 1. Compile... 1 1.1 How

More information

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1 LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2012 November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you

More information

AMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline:

AMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline: AMath 483/583 Lecture 22 Outline: MPI Master Worker paradigm Linear algebra LAPACK and the BLAS References: $UWHPSC/codes/mpi class notes: MPI section class notes: Linear algebra Another Send/Receive example

More information

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville MAGMA LAPACK for GPUs Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville Keeneland GPU Tutorial 2011, Atlanta, GA April 14-15,

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from Additionally Intel MKL provides

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter)

More information

Intel Math Kernel Library (Intel MKL) Latest Features

Intel Math Kernel Library (Intel MKL) Latest Features Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance

More information

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that: Parallel Computing and Data Locality Gary Howell In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that: Real estate and efficient computation

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

Cray Scientific Libraries. Overview

Cray Scientific Libraries. Overview Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized

More information

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017 Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level

More information

Advanced Numerical Techniques for Cluster Computing

Advanced Numerical Techniques for Cluster Computing Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012 Cray Scientific Libraries: Overview and Performance Cray XE6 Performance Workshop University of Reading 20-22 Nov 2012 Contents LibSci overview and usage BFRAME / CrayBLAS LAPACK ScaLAPACK FFTW / CRAFFT

More information

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department

HPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department HPC Numerical Libraries Nicola Spallanzani SuperComputing Applications and Innovation Department Algorithms and Libraries Many numerical algorithms are well known and largely available.

More information

CUDA Accelerated Compute Libraries. M. Naumov

CUDA Accelerated Compute Libraries. M. Naumov CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) Julie Langou Piotr Luszczek Alfredo Buttari Julien Langou

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

ATLAS (Automatically Tuned Linear Algebra Software),

ATLAS (Automatically Tuned Linear Algebra Software), LAPACK library I Scientists have developed a large library of numerical routines for linear algebra. These routines comprise the LAPACK package that can be obtained from

More information

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab Agenda Tools for

More information


GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information



More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. P. Arbenz Institute for Scientific Computing Department Informatik,

More information

Mathematical libraries at the CHPC

Mathematical libraries at the CHPC Presentation Mathematical libraries at the CHPC Martin Cuma Center for High Performance Computing University of Utah October 19, 2006 Overview What and what

More information

How to perform HPL on CPU&GPU clusters. Draško Tomić

How to perform HPL on CPU&GPU clusters. Draško Tomić How to perform HPL on CPU&GPU clusters Draško Tomić email: Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines

NAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines NAG Library Chapter Introduction Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 3 Recommendations on Choice and Use of Available Routines... 2 3.1 Naming Scheme... 2 3.1.1 NAGnames...

More information

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University Introduction to High Performance Computing Shaohao Chen Research Computing Services (RCS) Boston University Outline What is HPC? Why computer cluster? Basic structure of a computer cluster Computer performance

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: Note: the entire class

More information

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,

More information

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?

More information

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development Available online at Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi

More information


INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 The Challenges of Multicore and Specialized Accelerators Jack Dongarra University of Tennessee

More information

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017 Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware

More information

Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker)

Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker) Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker) Department of Mathematics and Center for Computational Mathematics University of Colorado at Denver SIAM Conference on Parallel

More information


COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

Introduction to Parallel Programming & Cluster Computing

Introduction to Parallel Programming & Cluster Computing Introduction to Parallel Programming & Cluster Computing Scientific Libraries & I/O Libraries Joshua Alexander, U Oklahoma Ivan Babic, Earlham College Michial Green, Contra Costa College Mobeen Ludin,

More information

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

Scaling Out Python* To HPC and Big Data

Scaling Out Python* To HPC and Big Data Scaling Out Python* To HPC and Big Data Sergey Maidanov Software Engineering Manager for Intel Distribution for Python* What Problems We Solve: Scalable Performance Make Python usable beyond prototyping

More information

IBM Research. IBM Research Report

IBM Research. IBM Research Report RC 24398 (W0711-017) November 5, 2007 (Last update: June 28, 2018) Computer Science/Mathematics IBM Research Report WSMP: Watson Sparse Matrix Package Part III iterative solution of sparse systems Version

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

ARM High Performance Computing

ARM High Performance Computing ARM High Performance Computing Eric Van Hensbergen Distinguished Engineer, Director HPC Software & Large Scale Systems Research IDC HPC Users Group Meeting Austin, TX September 8, 2016 ARM 2016 An introduction

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information


NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Ranger Optimization Release 0.3

Ranger Optimization Release 0.3 Ranger Optimization Release 0.3 Drew Dolgert May 20, 2011 Contents 1 Introduction i 1.1 Goals, Prerequisites, Resources...................................... i 1.2 Optimization and Scalability.......................................

More information

Sparse Direct Solvers for Extreme-Scale Computing

Sparse Direct Solvers for Extreme-Scale Computing Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information


OP2 FOR MANY-CORE ARCHITECTURES OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC

More information

2.7 Numerical Linear Algebra Software

2.7 Numerical Linear Algebra Software 2.7 Numerical Linear Algebra Software In this section we will discuss three software packages for linear algebra operations: (i) (ii) (iii) Matlab, Basic Linear Algebra Subroutines (BLAS) and LAPACK. There

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

Self Adapting Numerical Software (SANS-Effort)

Self Adapting Numerical Software (SANS-Effort) Self Adapting Numerical Software (SANS-Effort) Jack Dongarra Innovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory 1 Work on Self Adapting Software 1. Lapack For Clusters

More information

Fast-multipole algorithms moving to Exascale

Fast-multipole algorithms moving to Exascale Numerical Algorithms for Extreme Computing Architectures Software Institute for Methodologies and Abstractions for Codes SIMAC 3 Fast-multipole algorithms moving to Exascale Lorena A. Barba The George

More information

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Parallel Reduction from Block Hessenberg to Hessenberg using MPI Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson

More information

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation Alexander Kalinkin Anton Anders Roman Anders 1 Legal Disclaimer INFORMATION IN

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I 1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street

More information

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters

More information

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA The MUMPS library Abdou Guermouche and MUMPS team, Univ. Bordeaux 1 and INRIA June 22-24, 2010 MUMPS Outline MUMPS status Recently added features MUMPS and multicores? Memory issues GPU computing Future

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

*Yuta SAWA and Reiji SUDA The University of Tokyo

*Yuta SAWA and Reiji SUDA The University of Tokyo Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

BLAS. Basic Linear Algebra Subprograms

BLAS. Basic Linear Algebra Subprograms BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

IBM Research. IBM Research Report

IBM Research. IBM Research Report RC 21888 (98472) November 20, 2000 (Last update: September 17, 2018) Computer Science/Mathematics IBM Research Report WSMP: Watson Sparse Matrix Package Part II direct solution of general systems Version

More information

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation.

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX

More information

SCALABLE ALGORITHMS for solving large sparse linear systems of equations

SCALABLE ALGORITHMS for solving large sparse linear systems of equations SCALABLE ALGORITHMS for solving large sparse linear systems of equations CONTENTS Sparse direct solvers (multifrontal) Substructuring methods (hybrid solvers) Jacko Koster, Bergen Center for Computational

More information

Dan Stafford, Justine Bonnot

Dan Stafford, Justine Bonnot Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing

More information

Enabling the ARM high performance computing (HPC) software ecosystem

Enabling the ARM high performance computing (HPC) software ecosystem Enabling the ARM high performance computing (HPC) software ecosystem Ashok Bhat Product manager, HPC and Server tools ARM Tech Symposia India December 7th 2016 Are these supercomputers? For example, the

More information

Intel Math Kernel Library (Intel MKL) Sparse Solvers. Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) Sparse Solvers. Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) Sparse Solvers Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager Copyright 3, Intel Corporation. All rights reserved. Sparse

More information

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Optimising the Mantevo benchmark suite for multi- and many-core architectures Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 Outline 1. Cache and shared memory parallel computing concepts.

More information