Introduction to Numerical Libraries for HPC. Bilel Hadri. Computational Scientist KAUST Supercomputing Lab.

Size: px

Start display at page:

Download "Introduction to Numerical Libraries for HPC. Bilel Hadri. Computational Scientist KAUST Supercomputing Lab."

Imogen Atkinson
5 years ago
Views:

1 Introduction to Numerical Libraries for HPC Bilel Hadri Computational Scientist KAUST Supercomputing Lab Bilel Hadri 1

2 Numerical Libraries Application Areas Most used libraries/software in HPC! Linear Algebra Systems of equations Direct, Iterative, Multigrid solvers Sparse, Dense system Eigenvalue problems Least squares Signal processing FFT Numerical Integration Random Number Generators Bilel Hadri Introduction to Numerical Libraries for HPC 2

3 Numerical Libraries - Motivations Don t Reinvent the Wheel! Improves productivity! Get a better performance! Faster and better algorithms Bilel Hadri Introduction to Numerical Libraries for HPC 3

4 Faster (Better Code) Achieving best performance requires creating very processor- and system-specific code Example: Dense matrix-matrix multiply Simple to express: do i=1, n do j=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo Bilel Hadri Introduction to Numerical Libraries for HPC 4

5 Performance How fast should this run? Our matrix-matrix multiply algorithm has 2n 3 floating point operations 3 nested loops, each with n iterations 1 multiply, 1 add in each inner iteration For n=100, 2x10 6 operations, about 1 msec on a 2GHz processor For n=1000, 2x10 9 operations, or about 1 sec Reality: N= ms N=1000 6s à Obvious expression of algorithms are not transformed into leading performance. Bilel Hadri Introduction to Numerical Libraries for HPC 5

GSL Random Number Generators SPRNG Bilel

6 Numerical Libraries Packages Linear Algebra BLAS/LAPACK/ ScaLAPACK /PLASMA/ MAGMA/HiCMA PETSc HYPRE TRILINOS Signal processing FFTW Numerical Integration GSL Random Number Generators SPRNG Bilel Hadri Introduction to Numerical Libraries for HPC 6

7 Others MUMPS MUMPS (MUltifrontal Massively Parallel sparse direct Solver) is a package of parallel, sparse, direct linear system solvers based on a multifrontal algorithm. SuperLU 4.3. SuperLU is a sequential version of SuperLU_dist and a sequential incomplete LU preconditioner that can accelerate the convergence of Krylov subspace iterative solvers. ParMETIS ParMETIS (Parallel Graph Partitioning and Fill reducing Matrix Ordering) is a library of routines that partition unstructured graphs and meshes and compute fill reducing orderings of sparse matrices. Bilel Hadri Introduction to Numerical Libraries for HPC 7

8 SUNDIALS (SUite of Nonlinear and DIfferential/Algebraic equation Solvers) consists of 5 solvers: CVODE, CVODES, IDA, IDAS, and KINSOL. In addition, SUNDIALS provides a MATLAB interface to CVODES, IDAS, and KINSOL that is called sundialstb. Scotch b Scotch is a software package and libraries for sequential and parallel graph partitioning, static mapping, sparse matrix block ordering, and sequential meshand hypergraph partitioning. Note: On Shaheen, they are all grouped into cray-tpsl Freely Available Software for Linear Algebra Bilel Hadri Introduction to Numerical Libraries for HPC 8

9 BLAS (Basic Linear Algebra Subprograms) The BLAS functionality is divided into three levels: Level 1: contains vector operations of the form: Level 2: contains matrix-vector operations of the form: Level 3: contains matrix-matrix operations of the form: Several implementations for different languages exist Reference implementation Bilel Hadri Introduction to Numerical Libraries for HPC 9

10 BLAS: naming conventions Each routine has a name which specifies the operation, the type of matrices involved and their precisions. Names are in the form: PMMOO. Each operation is defined for four precisions (P) S single real D double real C single complex Z double complex The types of matrices are (MM) GE general GB general band SY symmetric SB symmetric band SP symmetric packed HE hermitian HB hermitian band HP hermitian packed TR triangular TB triangular band TP triangular packed Examples SGEMM stands for single-precision general matrix-matrix multiply DGEMM stands for double-precision matrix-matrix multiply. Some of the most common operations (OO): DOT scalar product, x^t y AXPY vector sum, α x + y MV matrix-vector product, A x SV matrix-vector solve, inv(a) x MM matrix-matrix product, A B SM matrix-matrix solve, inv(a) B Bilel Hadri Introduction to Numerical Libraries for HPC 10

11 Vector operations (xrot, xswap, xcopy etc.) Scalar dot products (xdot etc.) Vector norms (IxAMX etc.) BLAS Level 1 routines Bilel Hadri Introduction to Numerical Libraries for HPC 11

12 BLAS Level 2 routines Matrix-vector operations (xgemv, xgbmv, xhemv, xhbmv etc.) Solving Tx = y for x, where T is triangular (xger, xher etc.) Bilel Hadri Introduction to Numerical Libraries for HPC 12

13 BLAS Level 3 routines Matrix-matrix operations (xgemm etc.) Solving for triangular matrices (xtrmm) Widely used matrix-matrix multiply (xsymm, xgemm) Bilel Hadri Introduction to Numerical Libraries for HPC 13

14 LAPACK (Linear Algebra PACKage) Linear Algebra PACKage Provides routines for Solving systems of simultaneous linear equations, Least-squares solutions of linear systems of equations, Eigenvalue problems, Householder transformation to implement QR decomposition on a matrix and Singular value problems Was initially designed to run efficiently on shared memory vector machines Depends on BLAS Has been extended for distributed systems ScaLAPACK ( Scalable Linear Algebra PACKage) Bilel Hadri Introduction to Numerical Libraries for HPC 14

15 LAPACK naming conventions Very similar to BLAS XYYZZZ X: data type S: REAL D: DOUBLE PRECISION C: COMPLEX Z: COMPLEX*16 or DOUBLE COMPLEX YY: matrix type BD: bidiagonal DI: diagonal GB: general band GE: general (i.e., unsymmetric, in some cases rectangular) GG: general matrices, generalized problem (i.e., a pair of general matrices) GT: general tridiagonal HB: (complex) Hermitian band HE: (complex) Hermitian HG: upper Hessenberg matrix, generalized problem (i.e a Hessenberg and a triangular matrix) HP: (complex) Hermitian, packed storage HS: upper Hessenberg OP: (real) orthogonal, packed storage OR: (real) orthogonal PB: symmetric or Hermitian positive definite band YY: more matrix types PO: symmetric or Hermitian positive definite PP: symmetric or Hermitian positive definite, packed storage PT: symmetric or Hermitian positive definite tridiagonal SB: (real) symmetric band SP: symmetric, packed storage ST: (real) symmetric tridiagonal SY: symmetric TB: triangular band TG: triangular matrices, generalized problem (i.e., a pair of triangular matrices) TP: triangular, packed storage TR: triangular (or in some cases quasitriangular) TZ: trapezoidal UN: (complex) unitary UP: (complex) unitary, packed storage ZZZ: performed computation Linear systems Factorizations Eigenvalue problems Singular value decomposition Etc. Bilel Hadri Introduction to Numerical Libraries for HPC 15

16 LAPACK routines Bilel Hadri Introduction to Numerical Libraries for HPC 16

17 Numerical Libraries packages Vendor libraries optimized implementations of BLAS, LAPACK, ScaLAPACK to their processors and their platform. Dongarra/ICL Bilel Hadri Introduction to Numerical Libraries for HPC 17

18 LAPACK & ScaLAPACK ScaLAPACK is a library with a subset of LAPACK routines running on distributed memory machines. ScaLAPACK is designed for heterogeneous computing and is potable on any computer that supports MPI or PVM. Bilel Hadri Introduction to Numerical Libraries for HPC 18

19 Overview of ScaLAPACK Bilel Hadri Introduction to Numerical Libraries for HPC 19

20 Why use LAPACK or ScaLAPACK? Solving systems of: Linear equations: Least squares: min Eigenvalue problem: Ax = b Singular value problem: Ax -b 2 Ax = λx A = USV T Bilel Hadri Introduction to Numerical Libraries for HPC 20

21 Reference BLAS vs Tuned The reference BLAS and LAPACK libraries a re reference implementations of the BLAS and LAPACK standard. These are not optimised and not multi-threaded, so not much performance should be expected. These libraries are available for downloadhttp:// and The Automatically Tuned Linear Algebra Software, ATLAS. During compile time, ATLAS automatically choses the algorithms delivering the best performance. ATLAS does not contain all LAPACK functionality; it can be downloaded from The Goto BLAS an implementation of the level 3 BLAS aimed at high efficiency]. The Goto BLAS is available for download from Bilel Hadri Introduction to Numerical Libraries for HPC 21

22 Optimized vendor libraries for BLAS/LAPACK Highly efficient versions Hand tuned assembly by hardware vendors Provide near peak performance Several vendors provide libraries optimized for their architecture (AMD, Fujitsu, IBM, Intel, NEC, ) Intel à MKL Cray à LibSci AMD à ACML IBM à ESSL USE them! ( Speedup up to 10 or more ) Bilel Hadri Introduction to Numerical Libraries for HPC 22

23 Bilel Hadri Introduction to Numerical Libraries for HPC AMD / MKL ACML (AMD Core Math Library) LAPACK, BLAS, and extended BLAS (sparse), FFTs (single- and doubleprecision, real and complex data types). APIs for both Fortran and C MKL (Math Kernel Library) LAPACK, BLAS, and extended BLAS (sparse), FFTs (single- and double-precision, real and complex data types). APIs for both Fortran and C Use the MKL advisory page to link your code with it:

Example with SGEMM http://homepage.ntu.edu.

24 Example with SGEMM Bilel Hadri Introduction to Numerical Libraries for HPC 24

Fortran example Source available at http://homepage.ntu.edu.

25 Fortran example Source available at Bilel Hadri Introduction to Numerical Libraries for HPC 25

26 Linking examples Library Compiler Link-flags Cray Environment by default: compile without adding flags LIBSCI on Cray GNU compile without adding flags Intel compile without adding flags ACML GNU /opt/acml/4.4.0/gfortran64_mp/lib/libacml_mp.a fopenmp Intel /opt/acml/4.4.0/ifort64_mp/lib/libacml_mp.a -openmp lpthread MKL PGI GNU -Wl,--start-group /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_pgi_thread.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -mp -lpthread -Wl,--start-group /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_gnu_thread.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -L/opt/intel/Compiler/11.1/038/lib/intel64/ -liomp5 Intel -Wl,--start-group /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_intel_thread.a /opt/intel/compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group -openmp lpthread Use the MKL advisory linkline: Bilel Hadri Introduction to Numerical Libraries for HPC 26

27 Compilation demos On Shaheen: Use -Wl,-ysgemm_ to check with optimized library is used. with Cray-libsci ftn o exe_libsci test_sgemm.f90 with Intel MKL On Ibex Unload cray-lisci ftn o exe_libsci test_sgemm.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm -ldl with BLAS Intel module load blas/3.7.1/gnu gfortran test_sgemm.f90 -lblas module load intel/2017 ifort -o exe_mkl test_sgemm.f90 -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group - liomp5 -lpthread -lm -ldl ACML module load acml/ gfortran64 gfortran -o exe_acml test_sgemm.f90 -lacml_mp Bilel Hadri Introduction to Numerical Libraries for HPC 27

28 Python fans! You can speedup your python scripts: By using the scientific libraries numpy and scipy Build python with the vendor optimized library Available with python/ on Shaheen You can build your own by following the instructions: with cray-libsci: Available next month on Shaheen. cray-python/17.09 Check installation: import numpy as np; np.show_config() Bilel Hadri Introduction to Numerical Libraries for HPC 28

29 Python Numpy check installation import numpy as np; >>> np.show_config() lapack_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] blas_opt_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] openblas_lapack_info: NOT AVAILABLE lapack_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] blas_mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] mkl_info: libraries = ['mkl_rt', 'pthread'] library_dirs = ['/opt/intel/composer_xe_ /mkl/lib/intel64'] define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include'] Bilel Hadri Introduction to Numerical Libraries for HPC 29

30 Python use Numpy and Scipy demo Bilel Hadri Introduction to Numerical Libraries for HPC 30

31 Performance formulae Performance is measured in floating point operations per second, FLOPS, or FLOP/s. Current processors deliver an R peak in the GFLOPS ( 10 9 FLOPS) range. The R peak of a system can be computed by: R peak = n CPU n core n FPU f n CPU is the number of CPUs in the system, n core is the number of computing cores per CPU, n FPU is the number of floating point units per core, f is the clock frequency Bilel Hadri Introduction to Numerical Libraries for HPC 31

32 FLOPs counts for recent processor microarchitectures Intel Core 2 and Nehalem: 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2- wide SSE2 multiplication 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication Intel Sandy Bridge/Ivy Bridge: 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication 16 SP FLOPs/cycle: 8-wide AVX addition + 8- wide AVX multiplication Intel Haswell: 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions Intel Skylake/Knight s Landing AVX flops/cycle DP & 64 flops/cycle SP AMD K10: AMD Bulldozer: 8 DP FLOPs/cycle: 4-wide FMA 16 SP FLOPs/cycle: 8-wide FMA ARM Cortex-A15: 2 DP FLOPs/cycle: scalar FMA or scalar multiplyadd 8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4- wide NEON multiply-add IBM PowerPC A2 (Blue Gene Q): 8 DP FLOPs/cycle: 4-wide QPX FMA SP elements are extended to DP and processed on the same units Intel MIC (Xeon Phi), per core (supports 4 hyperthreads): 16 DP FLOPs/cycle: 8-wide FMA every cycle 32 SP FLOPs/cycle: 16-wide FMA every cycle Intel MIC (Xeon Phi), per thread: 8 DP FLOPs/cycle: 8-wide FMA every other cycle 4 DP FLOPs/cycle: 2-wide SSE2 addition SP FLOPs/cycle: 16-wide FMA every other wide SSE2 multiplication cycle 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication Bilel Hadri Introduction to Numerical Libraries for HPC 32

33 Lorena A. Barba, Rio Yokota Bilel Hadri Introduction to Numerical Libraries for HPC 33 Rooflines Roofline is a performance model used to bound the performance of various numerical methods and operations running on processor architectures

34 Best Practices Numerical Recipes books DO NOT provide optimized code. (Libraries can be 100x faster). Don t reinvent the wheel. Use optimized libraries! It s not only for C++/C/Fortran. Python has an interface with BLAS ( check with numpy/scipy) R, Matlab, cython. Don t forget the environment variables! The efficient use of numerical libraries can yield significant performance benefits Should be one of the first things to investigate when optimizing codes The best library implementation often varies depending on the individual routine and possibly even the size of input data READ the manual and/or attend the tutorials/workshops! Bilel Hadri Introduction to Numerical Libraries for HPC 34

35 THANKS Bilel Hadri Introduction to Numerical Libraries for HPC 35

Achieve Better Performance with PEAK on XSEDE Resources

Achieve Better Performance with PEAK on XSEDE Resources Haihang You, Bilel Hadri, Shirley Moore XSEDE 12 July 18 th 2012 Motivations FACTS ALTD ( Automatic Tracking Library Database ) ref Fahey, Jones,