HPC Libraries. Hartmut Kaiser PhD. High Performance Computing: Concepts, Methods & Means

Size: px

Start display at page:

Download "HPC Libraries. Hartmut Kaiser PhD. High Performance Computing: Concepts, Methods & Means"

Anis Porter
6 years ago
Views:

1 High Performance Computing: Concepts, Methods & Means HPC Libraries Hartmut Kaiser PhD Center for Computation & Technology Louisiana State University April 19 th, 2007

2 Outline Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) Special purpose libraries (FFTW) General purpose libraries (C++: Boost) Summary Materials for test 2

3 Outline Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) Special purpose libraries (FFTW) General purpose libraries (C++: Boost) Summary Materials for test 3

4 Puzzle of the Day #include <stdio.h> int main() { int a = 10; switch(a) { case '1': printf("one\n"); break; case '2': printf("two\n"); break; } defa1ut: printf("none\n"); } return 0; If you expect the output of the above program to be NONE, I would request you to check it out! 4

5 Application domains Linear algebra BLAS, ATLAS, LAPACK, ScaLAPACK, Slatec, pim Ordinary and partial Differential Equations PETSc Mesh manipulation and Load Balancing METIS, ParMETIS, CHACO, JOSTLE, PARTY Graph manipulation Boost.Graph library Vector/Signal/Image processing VSIPL, PSSL. General parallelization MPI, pthreads Other domain specific libraries NAMD, NWChem, Fluent, Gaussian, LS-DYNA 5

6 Application Domain Overview Linear Algebra Libraries Provide optimized methods for constructing sets of linear equations, performing operations on them (matrix-matrix products, matrix-vector products) and solving them (factoring, forward & backward substitution. Commonly used libraries include BLAS, ATLAS, LAPACK, ScaLAPACK, PaLAPACK PDE Solvers: Developing general-porpose, parallel numerical PDE libraries Usual toolsets include manipulation of sparse data structures, iterative linear system solvers, preconditioners, nonlinear solvers and time-stepping methods. Commonly used libraries for solving PDEs include SAMRAI, PETSc, PARASOL, Overture, among others. 6

7 Application Domain Overview Mesh manipulation and Load Balancing These libraries help in partitioning meshes in roughly equal sizes across processors, thereby balancing the workload while minimizing size of separators and communication costs. Commonly used libraries for this purpose include METIS, ParMetis, Chaco, JOSTLE among others. Other packages: FFTW: features highly optimized Fourier transform package including both real and complex multidimensional transforms in sequential, multithreaded, and parallel versions. NAMD: molecular dynamics library available for Unix/Linux, Windows, OS X Fluent: computational fluid dynamics package, used for such applications as environment control systems, propulsion, reactor modeling etc. 7

8 Outline Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) Special purpose libraries (FFTW) General purpose libraries (C++: Boost) Summary Materials for test 8

9 BLAS (Updated set of) Basic Linear Algebra Subprograms The BLAS functionality is divided into three levels: Level 1: contains vector operations of the form: as well as scalar dot products and vector norms Level 2: contains matrix-vector operations of the form as well as Tx = y solving for x with T being triangular Level 3: contains matrix-matrix operations of the form as well as solving for triangular matrices T. This level contains the widely used General Matrix Multiply operation. 9

10 BLAS Several implementations for different languages exist Reference implementation (F77 and C) ATLAS, highly optimized for particular processor architectures A generic C++ template class library providing BLAS functionality: ublas Several vendors provide libraries optimized for their architecture (AMD, HP, IBM, Intel, NEC, NViDIA, Sun) 10

11 BLAS: F77 naming conventions 11

12 BLAS: C naming conventions F77 routine name is changed to lowercase and prefixed with cblas_ All routines which accept two dimensional arrays have a new additional first parameter specifying the matrix memory layout (row major or column major) Character parameters are replaced by corresponding enum values Input arguments are declared const Non-complex scalar input parameters are passed by value Complex scalar input argiments are passed using a void* Arrays are passed by address Output scalar arguments are passed by address Complex functions become subroutines which return the result via an additional last parameter (void*), appending _sub to the name 12

13 BLAS Level 1 routines Vector operations (xrot, xswap, xcopy etc.) Scalar dot products (xdot etc.) Vector norms (IxAMX etc.) 13

14 BLAS Level 2 routines Matrix-vector operations (xgemv, xgbmv, xhemv, xhbmv etc.) Solving Tx = y for x, where T is triangular (xger, xher etc.) 14

15 BLAS Level 3 routines Matrix-matrix operations (xgemm etc.) Solving for triangular matrices (xtrmm) Widely used matrix-matrix multiply (xsymm, xgemm) 15

16 Demo 1 Shows solving a matrix multiplication problem using BLAS expressed in FORTRAN, C, and C++ Shows genericity of ublas, by comparing generic and banded matrix versions Shows newmat, a C++ matrix library which uses operator overloading 16

17 Outline Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) Special purpose libraries (FFTW) General purpose libraries (C++: Boost) Summary Materials for test 17

18 LAPACK Linear Algebra PACKage Written in F77 Provides routines for Solving systems of simultaneous linear equations, Least-squares solutions of linear systems of equations, Eigenvalue problems, Householder transformation to implement QR decomposition on a matrix and Singular value problems Was initially designed to run efficiently on shared memory vector machines Depends on BLAS Has been extended for distributed (SIMD) systems (ScaPACK and PLAPACK) 18

19 LAPACK (Architecture) 19

20 LAPACK naming conventions 20

21 Demo 2 Shows how using a library might speed up the computation considerably 21

22 Outline Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) Special purpose libraries (FFTW) General purpose libraries (C++: Boost) Summary Materials for test 22

23 PETSc (pronounced PET-see) Portable, Extensible Toolkit for Scientific Computation ( Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations (PDEs) Employs the MPI standard for all message-passing communication Intended for use in large-scale application projects Includes a large suite of parallel linear and nonlinear equation solvers Easily used in application codes written in C, C++, Fortran and Python Good introduction: 23

24 PETSc (general features) Features include: Parallel vectors Scatters (handles communicating ghost point information) Gathers Parallel matrices Several sparse storage formats Easy, efficient assembly. Scalable parallel preconditioners Krylov subspace methods Parallel Newton-based nonlinear solvers Parallel time stepping (ODE) solvers 24

25 PETSc (Architecture) PETSc: Module architecture and layers of abstraction 25

26 PETSc: Component details Vector operations (Vec): Provides the vector operations required for setting up and solving large-scale linear and nonlinear problems. Includes easy-to-use parallel scatter and gather operations, as well as special-purpose code for handling ghost points for regular data structures. Matrix operations (Mat): A large suite of data structures and code for the manipulation of parallel sparse matrices. Includes four different parallel matrix data structures, each appropriate for a different class of problems. Preconditioners (PC): A collection of sequential and parallel preconditioners, including (sequential) ILU(k) (incomplete factorization), LU (lower/upper decomposition), both sequential and parallel block Jacobi, overlapping additive Schwarz methods Time stepping ODE solvers (TS): Code for the time evolution of solutions of PDEs. In addition, provides pseudo-transient continuation techniques for computing steady-state solutions. 26

27 PETSc: Component details Krylov subspace solvers (KSP): Parallel implementations of many popular Krylov subspace iterative methods, including GMRES (Generalized Minimal Residual method), CG (Conjugate Gradient), CGS (Conjugate Gradient Squared), Bi-CG-Stab (BiConjugate Gradient Squared), two variants of TFQMR (transpose free QMR), CR (Conjugate Residuals), LSQR (Least Square Root). All are coded so that they are immediately usable with any preconditioners and any matrix data structures, including matrix-free methods. Non-linear solvers (SNES): Data-structure-neutral implementations of Newton-like methods for nonlinear systems. Includes both line search and trust region techniques with a single interface. Employs by default the above data structures and linear solvers. Users can set custom monitoring routines, convergence criteria, etc. 27

28 Outline Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) Special purpose libraries (FFTW) General purpose libraries (C++: Boost) Summary Materials for test 28

29 Mesh libraries Introduction Structured/unstructured meshes Examples Mesh decomposition 29

30 Introduction to Meshes and Grids Mesh/Grid : 2D or 3D representation of the computational domain. Common 2D meshes are composed of triangular or quadrilateral elements Common 3D meshes are composed of hexahedral, tetrahedral or pyramidal elements Quadrilateral Triangle 2D Mesh elements Hexahedron Prism Tetrahedron 3D Mesh elements 30

Unstructured Meshes Mesh connectivity information must be stored Incurs additional memory and computational cost Handles

31 Structured/Unstructured Meshes Structured Grids (Meshes) Cartesian grids, logically rectangular grids Mesh info accessed implicitly using grid point indices Efficient in both computation and storage Typically use finite difference discretization Unstructured Meshes Mesh connectivity information must be stored Incurs additional memory and computational cost Handles complex geometries and grid adaptivity Typically use finite volume or finite element discretization Mesh quality becomes a concern 31

32 Mesh examples 32

33 Meshes are used for Computation 33

34 Mesh Decomposition Goal is to maximize interior while minimizing connections between subdomains. That is, minimize communication. Such decomposition problems have been studied in load balancing for parallel computation. Lots of choices: METIS, ParMETIS -- University of Minnesota. PARTI -- University of Maryland, CHACO -- Sandia National Laboratories, JOSTLE -- University of Greenwich, PARTY -- University of Paderborn, SCOTCH -- Université Bordeaux, TOP/DOMDEC -- NAS at NASA Ames Research Center. 34

35 Mesh Decomposition Load balancing Distribute elements evenly across processors. Each processor should have equal share of work. Communication costs should be minimized. Minimize sub-domain boundary elements. Minimize number of neighboring domains. Distribution should reflect machine architecture. Communication versus calculation. Bandwidth versus latency. Note that optimizing load balance and communication cost simultaneously is an NP-hard problem. 35

36 Mesh decomposition

37 Static and Dynamic Meshes Static Grids (Meshes) Decomposition need only be carried out once Static decomposition may therefore be carried out as a preprocessing step, often done in serial Dynamic Meshes Decomposition must be adapted as underlying mesh or processor load changes. Dynamic decomposition therefore becomes part of the calculation itself and cannot be carried out solely as a pre-processing step. 37

38 HP J CPU Solve Time: 13:26 Baseline Time src : Amy Apon, 38

39 Linux Cluster 2 CPU s Solve Time: 5:20 Speed-Up: 2.5X src : Amy Apon, 39

40 Linux Cluster 4 CPU s Solve Time: 3:07 Speed-Up: 4.3X src : Amy Apon, 40

41 Linux Cluster 8 CPU s Solve Time: 1:51 Speed-Up: 7.3X src : Amy Apon, 41

42 Linux Cluster 16 CPU s Solve Time: 1:03 Speed-Up: 12.8X src : Amy Apon, 42

43 Speedup due to decomposition # CPUs Run-times (s)

44 Jostle and Metis

45 Jostle

46 Jostle

47 Jostle

48 Metis

49 ParMetis

50 Metis (serial)

51 Comparison

52 Outline Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) Special purpose libraries (FFTW) General purpose libraries (C++: Boost) Summary Materials for test 52

53 FFTW Fastest Fourier Transform in the West Portable C subroutine library for computing discrete cosine/sine transform (DCT/DST) Computes arbitrary size discrete Fourier and Hartley transforms on real or complex data, in one or more dimensions Optimized for speed through application of special-purpose compiler genfft (codelet generator), originally written in OCaml; performance comparable even with vendor optimized libraries Free software, distributed under GPL; also available under commercial MIT license Developed at MIT by Matteo Frigo and Steven G. Johnson Won J. H. Wilkinson Prize for Numerical Software in 1999 Most recent stable version is ( 53

54 Main FFTW Features C and FORTRAN interfaces, C++ wrappers available Speed, including support for SSE, SSE2, 3dNow! and Altivec Arbitrary size transforms with complexity of O(n log(n)) (sizes which can be factored to 2, 3, 5 and 7 are most efficient by default, but a custom code can be also generated for other sizes if required) Even/odd data (DCT/DST), types I-IV Can produce pure real output, or process pure real input data Efficient handling of multiple, strided transforms (e.g. transformation of multiple arrays at once; one dimension of multi-dimensional array; one field of multi-component array) Parallel code supporting Cilk, SMP platforms with threads, or MPI Ability to save and restore plans optimized for a given platform (through wisdom mechanism) Portable to any platform with a working C compiler 54

55 FFTW Sample Code Computing 1-D complex DFT #include <fftw3.h>... { fftw_complex *in, *out; fftw_plan p;... in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); /* populate in[] with input data */ p = fftw_plan_dft_1d(n, in, out, FFTW_FORWARD, FFTW_ESTIMATE);... fftw_execute(p); /* repeat as needed */ /* transform now available in out[] */... fftw_destroy_plan(p); fftw_free(in); fftw_free(out); } Source: 55

56 Outline Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) Special purpose libraries (FFTW) General purpose libraries (C++: Boost) Summary Materials for test 56

57 The Boost Libraries What s Boost What s important Other stuff 57

58 What is Boost? Data Structures, Containers, Iterators, and Algorithms String and Text Processing Function Objects and Higher-Order Programming Generic Programming and Template Metaprogramming Math and Numerics Input/Output Miscellaneous Mostly header only 58

59 What s important OS abstraction Thread: OS independent kernel level thread interface Asio: asynchronous input output Filesystem: file system operations as file copy, delete, directory create, file path handling System: OS error code abstraction and handling Program options: handling of command line arguments and parameters Streams: build your own C++ streams DateTime: Handling of dates, times and time periods Timer: simple timer object 59

60 What s important Data types, Container types, all extending STL Pointer containers: allow for pointers in STL containers: vector<char *> ptr_vector<char> Multi index: data structures with multiple indicies Constant sized arrays: array<char, 10>, acts like vector or plain C array Any: can hold values of any type (if you need polymorphism) Variant: can hold values of any of the types specified at compile time ( C equivalent is discriminated union) Optional: can hold a value or nothing Tuple: like a vector or array, but every element may have a different type (similar to plain struct) Graph library: very sophisticated collection of graph releated data structures and algorithms Parallel version exists (using MPI) 60

61 What s important Helper classes Smart pointers: working with pointers without having to worry about memory management Memory pools: specialized memory allocation for containers Iterator library: write your own iterator classes with ease (non trivial otherwise) 61

62 Other stuff in Boost String and Text processing Regex, parsing, format, conversion etc. Alorithms String algos, FOR_EACH, minmax etc. Math and numerics Conversion, interval, random, octonion, quarternion, special functions, rational, ublas Functional and higher order prgramming Bind, lambda, function, ref, signals etc. Generic and template metaprogramming Proto, mpl, fusion, phoenix, enable_if etc. Testing Unit tests, concept checks, static_assert 62

63 Conclusion Look at Boost first if you need something not available in Standard library Even if it s not in Boost look around, there are a lot of libraries in preparation for Boost (Boost Sandbox, File Vault) 63

64 Links Boost, current release V Web: CVS: Boost Sandbox CVS: File Vault: Boost mailing lists 64

65 Outlook Elliptic PDE discretized by Finite Volume Functional specification with a Domain Specific Embedded Language (DSEL) equation = sum<vertex_edge> [ sumf<edge_vertex>(0.0, _e) [ pot * orient(_e, _1) ] * A / d * eps ] - V * rho References: [1] 65

66 References 1. Rene Heinzl, Modern Application Design using Modern Programming Paradigms and a Library-Centric Software Approach, OOPSLA 2006, Workshop on Library Centric Software Design, Portland, Oregon, October

67 Outline Introduction to High Performance Libraries Linear Algebra Libraries (BLAS, LAPACK) PDE Solvers (PETSc) Mesh manipulation and load balancing (METIS/ParMETIS, JOSTLE) Special purpose libraries (FFTW) General purpose libraries (C++: Boost) Summary Materials for test 67

68 Summary Material for the Test High performance libraries 5,6,7 Linear algebra libraries: BLAS: 9, 11, 12 Linear algebra libraries: LinPACK: 18 PDE Solvers: 23, 24, 26, 27 Mesh decomposition & load balancing: 30, 31, 34, 35, 37, 44, 45, 46, 48, 49 FFTW: 53, 54 Boost: 58, 59, 60, 61, 62

Dense matrix algebra and libraries (and dealing with Fortran)

Dense matrix algebra and libraries (and dealing with Fortran) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Dense matrix algebra and libraries (and dealing with Fortran)