Introduction to Numerical Libraries for HPC. Bilel Hadri. Computational Scientist KAUST Supercomputing Lab.
|
|
- Imogen Atkinson
- 5 years ago
- Views:
Transcription
1 Introduction to Numerical Libraries for HPC Bilel Hadri Computational Scientist KAUST Supercomputing Lab Bilel Hadri 1
2 Numerical Libraries Application Areas Most used libraries/software in HPC! Linear Algebra Systems of equations Direct, Iterative, Multigrid solvers Sparse, Dense system Eigenvalue problems Least squares Signal processing FFT Numerical Integration Random Number Generators Bilel Hadri Introduction to Numerical Libraries for HPC 2
3 Numerical Libraries - Motivations Don t Reinvent the Wheel! Improves productivity! Get a better performance! Faster and better algorithms Bilel Hadri Introduction to Numerical Libraries for HPC 3
4 Faster (Better Code) Achieving best performance requires creating very processor- and system-specific code Example: Dense matrix-matrix multiply Simple to express: do i=1, n do j=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo Bilel Hadri Introduction to Numerical Libraries for HPC 4
5 Performance How fast should this run? Our matrix-matrix multiply algorithm has 2n 3 floating point operations 3 nested loops, each with n iterations 1 multiply, 1 add in each inner iteration For n=100, 2x10 6 operations, about 1 msec on a 2GHz processor For n=1000, 2x10 9 operations, or about 1 sec Reality: N= ms N=1000 6s à Obvious expression of algorithms are not transformed into leading performance. Bilel Hadri Introduction to Numerical Libraries for HPC 5
6 Numerical Libraries Packages Linear Algebra BLAS/LAPACK/ ScaLAPACK /PLASMA/ MAGMA/HiCMA PETSc HYPRE TRILINOS Signal processing FFTW Numerical Integration GSL Random Number Generators SPRNG Bilel Hadri Introduction to Numerical Libraries for HPC 6
7 Others MUMPS MUMPS (MUltifrontal Massively Parallel sparse direct Solver) is a package of parallel, sparse, direct linear system solvers based on a multifrontal algorithm. SuperLU 4.3. SuperLU is a sequential version of SuperLU_dist and a sequential incomplete LU preconditioner that can accelerate the convergence of Krylov subspace iterative solvers. ParMETIS ParMETIS (Parallel Graph Partitioning and Fill reducing Matrix Ordering) is a library of routines that partition unstructured graphs and meshes and compute fill reducing orderings of sparse matrices. Bilel Hadri Introduction to Numerical Libraries for HPC 7
8 SUNDIALS (SUite of Nonlinear and DIfferential/Algebraic equation Solvers) consists of 5 solvers: CVODE, CVODES, IDA, IDAS, and KINSOL. In addition, SUNDIALS provides a MATLAB interface to CVODES, IDAS, and KINSOL that is called sundialstb. Scotch b Scotch is a software package and libraries for sequential and parallel graph partitioning, static mapping, sparse matrix block ordering, and sequential meshand hypergraph partitioning. Note: On Shaheen, they are all grouped into cray-tpsl Freely Available Software for Linear Algebra Bilel Hadri Introduction to Numerical Libraries for HPC 8
9 BLAS (Basic Linear Algebra Subprograms) The BLAS functionality is divided into three levels: Level 1: contains vector operations of the form: Level 2: contains matrix-vector operations of the form: Level 3: contains matrix-matrix operations of the form: Several implementations for different languages exist Reference implementation Bilel Hadri Introduction to Numerical Libraries for HPC 9
10 BLAS: naming conventions Each routine has a name which specifies the operation, the type of matrices involved and their precisions. Names are in the form: PMMOO. Each operation is defined for four precisions (P) S single real D double real C single complex Z double complex The types of matrices are (MM) GE general GB general band SY symmetric SB symmetric band SP symmetric packed HE hermitian HB hermitian band HP hermitian packed TR triangular TB triangular band TP triangular packed Examples SGEMM stands for single-precision general matrix-matrix multiply DGEMM stands for double-precision matrix-matrix multiply. Some of the most common operations (OO): DOT scalar product, x^t y AXPY vector sum, α x + y MV matrix-vector product, A x SV matrix-vector solve, inv(a) x MM matrix-matrix product, A B SM matrix-matrix solve, inv(a) B Bilel Hadri Introduction to Numerical Libraries for HPC 10
11 Vector operations (xrot, xswap, xcopy etc.) Scalar dot products (xdot etc.) Vector norms (IxAMX etc.) BLAS Level 1 routines Bilel Hadri Introduction to Numerical Libraries for HPC 11
12 BLAS Level 2 routines Matrix-vector operations (xgemv, xgbmv, xhemv, xhbmv etc.) Solving Tx = y for x, where T is triangular (xger, xher etc.) Bilel Hadri Introduction to Numerical Libraries for HPC 12
13 BLAS Level 3 routines Matrix-matrix operations (xgemm etc.) Solving for triangular matrices (xtrmm) Widely used matrix-matrix multiply (xsymm, xgemm) Bilel Hadri Introduction to Numerical Libraries for HPC 13
14 LAPACK (Linear Algebra PACKage) Linear Algebra PACKage Provides routines for Solving systems of simultaneous linear equations, Least-squares solutions of linear systems of equations, Eigenvalue problems, Householder transformation to implement QR decomposition on a matrix and Singular value problems Was initially designed to run efficiently on shared memory vector machines Depends on BLAS Has been extended for distributed systems ScaLAPACK ( Scalable Linear Algebra PACKage) Bilel Hadri Introduction to Numerical Libraries for HPC 14
15 LAPACK naming conventions Very similar to BLAS XYYZZZ X: data type S: REAL D: DOUBLE PRECISION C: COMPLEX Z: COMPLEX*16 or DOUBLE COMPLEX YY: matrix type BD: bidiagonal DI: diagonal GB: general band GE: general (i.e., unsymmetric, in some cases rectangular) GG: general matrices, generalized problem (i.e., a pair of general matrices) GT: general tridiagonal HB: (complex) Hermitian band HE: (complex) Hermitian HG: upper Hessenberg matrix, generalized problem (i.e a Hessenberg and a triangular matrix) HP: (complex) Hermitian, packed storage HS: upper Hessenberg OP: (real) orthogonal, packed storage OR: (real) orthogonal PB: symmetric or Hermitian positive definite band YY: more matrix types PO: symmetric or Hermitian positive definite PP: symmetric or Hermitian positive definite, packed storage PT: symmetric or Hermitian positive definite tridiagonal SB: (real) symmetric band SP: symmetric, packed storage ST: (real) symmetric tridiagonal SY: symmetric TB: triangular band TG: triangular matrices, generalized problem (i.e., a pair of triangular matrices) TP: triangular, packed storage TR: triangular (or in some cases quasitriangular) TZ: trapezoidal UN: (complex) unitary UP: (complex) unitary, packed storage ZZZ: performed computation Linear systems Factorizations Eigenvalue problems Singular value decomposition Etc. Bilel Hadri Introduction to Numerical Libraries for HPC 15
16 LAPACK routines Bilel Hadri Introduction to Numerical Libraries for HPC 16
17 Numerical Libraries packages Vendor libraries optimized implementations of BLAS, LAPACK, ScaLAPACK to their processors and their platform. Dongarra/ICL Bilel Hadri Introduction to Numerical Libraries for HPC 17
18 LAPACK & ScaLAPACK ScaLAPACK is a library with a subset of LAPACK routines running on distributed memory machines. ScaLAPACK is designed for heterogeneous computing and is potable on any computer that supports MPI or PVM. Bilel Hadri Introduction to Numerical Libraries for HPC 18
19 Overview of ScaLAPACK Bilel Hadri Introduction to Numerical Libraries for HPC 19
20 Why use LAPACK or ScaLAPACK? Solving systems of: Linear equations: Least squares: min Eigenvalue problem: Ax = b Singular value problem: Ax -b 2 Ax = λx A = USV T Bilel Hadri Introduction to Numerical Libraries for HPC 20
21 Reference BLAS vs Tuned The reference BLAS and LAPACK libraries a re reference implementations of the BLAS and LAPACK standard. These are not optimised and not multi-threaded, so not much performance should be expected. These libraries are available for downloadhttp:// and The Automatically Tuned Linear Algebra Software, ATLAS. During compile time, ATLAS automatically choses the algorithms delivering the best performance. ATLAS does not contain all LAPACK functionality; it can be downloaded from The Goto BLAS an implementation of the level 3 BLAS aimed at high efficiency]. The Goto BLAS is available for download from Bilel Hadri Introduction to Numerical Libraries for HPC 21
22 Optimized vendor libraries for BLAS/LAPACK Highly efficient versions Hand tuned assembly by hardware vendors Provide near peak performance Several vendors provide libraries optimized for their architecture (AMD, Fujitsu, IBM, Intel, NEC, ) Intel à MKL Cray à LibSci AMD à ACML IBM à ESSL USE them! ( Speedup up to 10 or more ) Bilel Hadri Introduction to Numerical Libraries for HPC 22
23 Bilel Hadri Introduction to Numerical Libraries for HPC AMD / MKL ACML (AMD Core Math Library) LAPACK, BLAS, and extended BLAS (sparse), FFTs (single- and doubleprecision, real and complex data types). APIs for both Fortran and C MKL (Math Kernel Library) LAPACK, BLAS, and extended BLAS (sparse), FFTs (single- and double-precision, real and complex data types). APIs for both Fortran and C Use the MKL advisory page to link your code with it:
24 Example with SGEMM Bilel Hadri Introduction to Numerical Libraries for HPC 24
25 Fortran example Source available at Bilel Hadri Introduction to Numerical Libraries for HPC 25
26 Linking examples Library Compiler Link-flags Cray Environment by default: compile without adding flags LIBSCI on Cray GNU compile without adding flags Intel compile without adding flags ACML GNU /opt/acml/4.4.0/gfortran64_mp/lib/libacml_mp.a fopenmp Intel /opt/acml/4.4.0/ifort64_mp/lib/libacml_mp.a -openmp lpthread MKL PGI GNU -Wl,--start-group /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_pgi_thread.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -mp -lpthread -Wl,--start-group /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_gnu_thread.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -L/opt/intel/Compiler/11.1/038/lib/intel64/ -liomp5 Intel -Wl,--start-group /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_thread.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -openmp lpthread Use the MKL advisory linkline: Bilel Hadri Introduction to Numerical Libraries for HPC 26
27 Compilation demos On Shaheen: Use -Wl,-ysgemm_ to check with optimized library is used. with Cray-libsci ftn o exe_libsci test_sgemm.f90 with Intel MKL On Ibex Unload cray-lisci ftn o exe_libsci test_sgemm.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm -ldl with BLAS Intel module load blas/3.7.1/gnu gfortran test_sgemm.f90 -lblas module load intel/2017 ifort -o exe_mkl test_sgemm.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group - liomp5 -lpthread -lm -ldl ACML module load acml/ gfortran64 gfortran -o exe_acml test_sgemm.f90 -lacml_mp Bilel Hadri Introduction to Numerical Libraries for HPC 27
28 Python fans! You can speedup your python scripts: By using the scientific libraries numpy and scipy Build python with the vendor optimized library Available with python/ on Shaheen You can build your own by following the instructions: with cray-libsci: Available next month on Shaheen. cray-python/17.09 Check installation: import numpy as np; np.show_config() Bilel Hadri Introduction to Numerical Libraries for HPC 28
29 Python Numpy check installation import numpy as np; >>> np.show_config() lapack_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] blas_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] openblas_lapack_info: NOT AVAILABLE lapack_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] blas_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] Bilel Hadri Introduction to Numerical Libraries for HPC 29
30 Python use Numpy and Scipy demo Bilel Hadri Introduction to Numerical Libraries for HPC 30
31 Performance formulae Performance is measured in floating point operations per second, FLOPS, or FLOP/s. Current processors deliver an R peak in the GFLOPS ( 10 9 FLOPS) range. The R peak of a system can be computed by: R peak = n CPU n core n FPU f n CPU is the number of CPUs in the system, n core is the number of computing cores per CPU, n FPU is the number of floating point units per core, f is the clock frequency Bilel Hadri Introduction to Numerical Libraries for HPC 31
32 FLOPs counts for recent processor microarchitectures Intel Core 2 and Nehalem: 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2- wide SSE2 multiplication 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication Intel Sandy Bridge/Ivy Bridge: 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication 16 SP FLOPs/cycle: 8-wide AVX addition + 8- wide AVX multiplication Intel Haswell: 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions Intel Skylake/Knight s Landing AVX flops/cycle DP & 64 flops/cycle SP AMD K10: AMD Bulldozer: 8 DP FLOPs/cycle: 4-wide FMA 16 SP FLOPs/cycle: 8-wide FMA ARM Cortex-A15: 2 DP FLOPs/cycle: scalar FMA or scalar multiplyadd 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4- wide NEON multiply-add IBM PowerPC A2 (Blue Gene Q): 8 DP FLOPs/cycle: 4-wide QPX FMA SP elements are extended to DP and processed on the same units Intel MIC (Xeon Phi), per core (supports 4 hyperthreads): 16 DP FLOPs/cycle: 8-wide FMA every cycle 32 SP FLOPs/cycle: 16-wide FMA every cycle Intel MIC (Xeon Phi), per thread: 8 DP FLOPs/cycle: 8-wide FMA every other cycle 4 DP FLOPs/cycle: 2-wide SSE2 addition SP FLOPs/cycle: 16-wide FMA every other wide SSE2 multiplication cycle 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication Bilel Hadri Introduction to Numerical Libraries for HPC 32
33 Lorena A. Barba, Rio Yokota Bilel Hadri Introduction to Numerical Libraries for HPC 33 Rooflines Roofline is a performance model used to bound the performance of various numerical methods and operations running on processor architectures
34 Best Practices Numerical Recipes books DO NOT provide optimized code. (Libraries can be 100x faster). Don t reinvent the wheel. Use optimized libraries! It s not only for C++/C/Fortran. Python has an interface with BLAS ( check with numpy/scipy) R, Matlab, cython. Don t forget the environment variables! The efficient use of numerical libraries can yield significant performance benefits Should be one of the first things to investigate when optimizing codes The best library implementation often varies depending on the individual routine and possibly even the size of input data READ the manual and/or attend the tutorials/workshops! Bilel Hadri Introduction to Numerical Libraries for HPC 34
35 THANKS Bilel Hadri Introduction to Numerical Libraries for HPC 35
Achieve Better Performance with PEAK on XSEDE Resources
Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,
More informationMathematical Libraries and Application Software on JUQUEEN and JURECA
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course May 2017 I.Gutheil Outline General Informations Sequential Libraries Parallel
More informationMathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN JSC Training Course May 22, 2012 Outline General Informations Sequential Libraries Parallel
More informationMathematical Libraries and Application Software on JUQUEEN and JURECA
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUQUEEN and JURECA JSC Training Course November 2015 I.Gutheil Outline General Informations Sequential Libraries Parallel
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationHigh-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen
High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK
More informationSome notes on efficient computing and high performance computing environments
Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public
More informationHow to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab
How to get Access to Shaheen2? Bilel Hadri Computational Scientist KAUST Supercomputing Core Lab Live Survey Please login with your laptop/mobile h#p://'ny.cc/kslhpc And type the code VF9SKGQ6 http://hpc.kaust.edu.sa
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationBrief notes on setting up semi-high performance computing environments. July 25, 2014
Brief notes on setting up semi-high performance computing environments July 25, 2014 1 We have two different computing environments for fitting demanding models to large space and/or time data sets. 1
More informationBLAS: Basic Linear Algebra Subroutines I
BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines
More informationBLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker
BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de
More informationDense matrix algebra and libraries (and dealing with Fortran)
Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)
More informationA Few Numerical Libraries for HPC
A Few Numerical Libraries for HPC CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Few Numerical Libraries for HPC Spring 2016 1 / 37 Outline 1 HPC == numerical linear
More informationDynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection
Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence
More informationHow to compile Fortran program on application server
How to compile Fortran program on application server Center for Computational Materials Science, Institute for Materials Research, Tohoku University 2015.3 version 1.0 Contents 1. Compile... 1 1.1 How
More informationLAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1
LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data
More informationBLAS: Basic Linear Algebra Subroutines I
BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines
More informationLinear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore svmoore@utep.edu CPS5401 Fall 2012 svmoore.pbworks.com November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you
More informationAMath 483/583 Lecture 22. Notes: Another Send/Receive example. Notes: Notes: Another Send/Receive example. Outline:
AMath 483/583 Lecture 22 Outline: MPI Master Worker paradigm Linear algebra LAPACK and the BLAS References: $UWHPSC/codes/mpi class notes: MPI section class notes: Linear algebra Another Send/Receive example
More informationMAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville
MAGMA LAPACK for GPUs Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville Keeneland GPU Tutorial 2011, Atlanta, GA April 14-15,
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationIntel Math Kernel Library (Intel MKL) Latest Features
Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance
More informationIn 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:
Parallel Computing and Data Locality Gary Howell In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that: Real estate and efficient computation
More informationBLAS. Christoph Ortner Stef Salvini
BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations
More informationCray Scientific Libraries. Overview
Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized
More informationFaster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017
Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level
More informationAdvanced Numerical Techniques for Cluster Computing
Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationLinear Algebra for Modern Computers. Jack Dongarra
Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d
More informationCray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012
Cray Scientific Libraries: Overview and Performance Cray XE6 Performance Workshop University of Reading 20-22 Nov 2012 Contents LibSci overview and usage BFRAME / CrayBLAS LAPACK ScaLAPACK FFTW / CRAFFT
More informationHPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department
HPC Numerical Libraries Nicola Spallanzani n.spallanzani@cineca.it SuperComputing Applications and Innovation Department Algorithms and Libraries Many numerical algorithms are well known and largely available.
More informationCUDA Accelerated Compute Libraries. M. Naumov
CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationExploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy
Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) Julie Langou Piotr Luszczek Alfredo Buttari Julien Langou
More informationDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends
Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact
More informationATLAS (Automatically Tuned Linear Algebra Software),
LAPACK library I Scientists have developed a large library of numerical routines for linear algebra. These routines comprise the LAPACK package that can be obtained from http://www.netlib.org/lapack/.
More informationWorkshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview
Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview Stefano Cozzini CNR/INFM Democritos and SISSA/eLab cozzini@democritos.it Agenda Tools for
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,
More informationMathematical libraries at the CHPC
Presentation Mathematical libraries at the CHPC Martin Cuma Center for High Performance Computing University of Utah mcuma@chpc.utah.edu October 19, 2006 http://www.chpc.utah.edu Overview What and what
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationNAG Library Chapter Introduction. F16 Further Linear Algebra Support Routines
NAG Library Chapter Introduction Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 3 Recommendations on Choice and Use of Available Routines... 2 3.1 Naming Scheme... 2 3.1.1 NAGnames...
More informationIntroduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University
Introduction to High Performance Computing Shaohao Chen Research Computing Services (RCS) Boston University Outline What is HPC? Why computer cluster? Basic structure of a computer cluster Computer performance
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationEvaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li
Evaluation of sparse LU factorization and triangular solution on multicore architectures X. Sherry Li Lawrence Berkeley National Laboratory ParLab, April 29, 28 Acknowledgement: John Shalf, LBNL Rich Vuduc,
More informationCS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra
CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationINTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006
INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 The Challenges of Multicore and Specialized Accelerators Jack Dongarra University of Tennessee
More informationAchieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017
Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware
More informationIlya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker)
Ilya Lashuk, Merico Argentati, Evgenii Ovtchinnikov, Andrew Knyazev (speaker) Department of Mathematics and Center for Computational Mathematics University of Colorado at Denver SIAM Conference on Parallel
More informationCOMPUTATIONAL LINEAR ALGEBRA
COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim
More informationIntroduction to Parallel Programming & Cluster Computing
Introduction to Parallel Programming & Cluster Computing Scientific Libraries & I/O Libraries Joshua Alexander, U Oklahoma Ivan Babic, Earlham College Michial Green, Contra Costa College Mobeen Ludin,
More informationAccelerating GPU Kernels for Dense Linear Algebra
Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28
More informationScaling Out Python* To HPC and Big Data
Scaling Out Python* To HPC and Big Data Sergey Maidanov Software Engineering Manager for Intel Distribution for Python* What Problems We Solve: Scalable Performance Make Python usable beyond prototyping
More informationIBM Research. IBM Research Report
RC 24398 (W0711-017) November 5, 2007 (Last update: June 28, 2018) Computer Science/Mathematics IBM Research Report WSMP: Watson Sparse Matrix Package Part III iterative solution of sparse systems Version
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationARM High Performance Computing
ARM High Performance Computing Eric Van Hensbergen Distinguished Engineer, Director HPC Software & Large Scale Systems Research IDC HPC Users Group Meeting Austin, TX September 8, 2016 ARM 2016 An introduction
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationRanger Optimization Release 0.3
Ranger Optimization Release 0.3 Drew Dolgert May 20, 2011 Contents 1 Introduction i 1.1 Goals, Prerequisites, Resources...................................... i 1.2 Optimization and Scalability.......................................
More informationSparse Direct Solvers for Extreme-Scale Computing
Sparse Direct Solvers for Extreme-Scale Computing Iain Duff Joint work with Florent Lopez and Jonathan Hogg STFC Rutherford Appleton Laboratory SIAM Conference on Computational Science and Engineering
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More information2.7 Numerical Linear Algebra Software
2.7 Numerical Linear Algebra Software In this section we will discuss three software packages for linear algebra operations: (i) (ii) (iii) Matlab, Basic Linear Algebra Subroutines (BLAS) and LAPACK. There
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationSelf Adapting Numerical Software (SANS-Effort)
Self Adapting Numerical Software (SANS-Effort) Jack Dongarra Innovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory 1 Work on Self Adapting Software 1. Lapack For Clusters
More informationFast-multipole algorithms moving to Exascale
Numerical Algorithms for Extreme Computing Architectures Software Institute for Methodologies and Abstractions for Codes SIMAC 3 Fast-multipole algorithms moving to Exascale Lorena A. Barba The George
More informationUsing Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System
Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More informationParallel Reduction from Block Hessenberg to Hessenberg using MPI
Parallel Reduction from Block Hessenberg to Hessenberg using MPI Viktor Jonsson May 24, 2013 Master s Thesis in Computing Science, 30 credits Supervisor at CS-UmU: Lars Karlsson Examiner: Fredrik Georgsson
More informationIntel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation
Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation Alexander Kalinkin Anton Anders Roman Anders 1 Legal Disclaimer INFORMATION IN
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationAdvanced School in High Performance and GRID Computing November Mathematical Libraries. Part I
1967-10 Advanced School in High Performance and GRID Computing 3-14 November 2008 Mathematical Libraries. Part I KOHLMEYER Axel University of Pennsylvania Department of Chemistry 231 South 34th Street
More informationParallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware
Parallelism V HPC Profiling John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture Overview Performance Counters Profiling PAPI TAU HPCToolkit PerfExpert Performance Counters
More informationMUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA
The MUMPS library Abdou Guermouche and MUMPS team, Univ. Bordeaux 1 and INRIA June 22-24, 2010 MUMPS Outline MUMPS status Recently added features MUMPS and multicores? Memory issues GPU computing Future
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More information*Yuta SAWA and Reiji SUDA The University of Tokyo
Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationBLAS. Basic Linear Algebra Subprograms
BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationIBM Research. IBM Research Report
RC 21888 (98472) November 20, 2000 (Last update: September 17, 2018) Computer Science/Mathematics IBM Research Report WSMP: Watson Sparse Matrix Package Part II direct solution of general systems Version
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More informationSCALABLE ALGORITHMS for solving large sparse linear systems of equations
SCALABLE ALGORITHMS for solving large sparse linear systems of equations CONTENTS Sparse direct solvers (multifrontal) Substructuring methods (hybrid solvers) Jacko Koster, Bergen Center for Computational
More informationDan Stafford, Justine Bonnot
Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing
More informationEnabling the ARM high performance computing (HPC) software ecosystem
Enabling the ARM high performance computing (HPC) software ecosystem Ashok Bhat Product manager, HPC and Server tools ARM Tech Symposia India December 7th 2016 Are these supercomputers? For example, the
More informationIntel Math Kernel Library (Intel MKL) Sparse Solvers. Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) Sparse Solvers Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager Copyright 3, Intel Corporation. All rights reserved. Sparse
More informationOptimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí
Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationOptimising the Mantevo benchmark suite for multi- and many-core architectures
Optimising the Mantevo benchmark suite for multi- and many-core architectures Simon McIntosh-Smith Department of Computer Science University of Bristol 1 Bristol's rich heritage in HPC The University of
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationIssues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM
Issues In Implementing The Primal-Dual Method for SDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Outline 1. Cache and shared memory parallel computing concepts.
More information