Scientific Software Ecosystems

Size: px

Start display at page:

Download "Scientific Software Ecosystems"

Susan Tyler
6 years ago
Views:

Scientific Software Ecosystems Michael A. Heroux Sandia National Laboratories http://www.users.csbsju.edu/~mheroux/sciswecosystems.

1 Scientific Software Ecosystems Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL

2 My Background 1998 now: Staff member at Sandia National Labs Lead these projects: Trilinos: collection of scientific libraries (more later) trilinos.org. xsdk: meta-collection of scientific libraries (more later) xsdk.info. ECP libraries: Oversight of 15 Exacale Computing Project libraries for DOE. Mantevo: Miniapps project for HPC co-design mantevo.org IDEAS Productivity: Scientific Productivity ideas-productivity.org HPCG Benchmark: Complementary benchmark for Top 500 hpcg-benchmark.org Concerned with scalable algorithms for HPC. Concurrent: Scientist in Residence, St. John s University, MN USA : Staff member at Cray Research 88 93: Math libraries developer, sparse solvers, Lapack, BLAS : Application analyst, computational engineering group : Scalable systems applications specialist. Always: Interested in numerical linear algebra for HPC. 2

3 Goals for this presentation Motivate the need for and value of reusable scientific software. Understand the sparse linear algebra ecosystem. Look ahead to next-generation systems. 3

4 Basic Concepts Framework: APIs, working software (defaults). Control inversion, extensibility. Scope: Big, ubiquitous. Toolkit: Plug-and-play libraries, insertable. Scope: Small, local. Lightweight framework: Goal is best of framework/toolkit. Ecosystem: Everything. 4

5 Modern Scientific App Design Goal Classic approach: Develop an application. App has its own framework, no reuse intended. Makes some use of libraries (toolkit components). Desired approach: Compose application within ecosystem. Adapt lightweight framework elements. For example: Use CMake, Doxygen, Unit Testing frameworks. Integrate & tune libraries. Load balancing, solvers, etc. 5

Extreme-scale Science Application (MyApp) Domain component interfaces Data mediator interactions. Hierarchical organization. Multiscale/multiphysics coupling. Shared data objects Meshes.

6 Extreme-scale Science Application (MyApp) Domain component interfaces Data mediator interactions. Hierarchical organization. Multiscale/multiphysics coupling. Shared data objects Meshes. Matrices, vectors. Library interfaces Parameter lists. Interface adapters. Function calls. Native code & data objects Single use code. Coordinated component use. Application specific. Documentation content Source markup. Embedded examples. Testing content Unit tests. Test fixtures. Build content Rules. Parameters. Extreme-Scale Scientific Software Ecosystem Extreme- Scale Scientific SW Dev Kit (xsdk) Domain components Reacting flow, etc. Reusable. Libraries Solvers, etc. Interoperable. Frameworks & tools Doc generators. Test, build framework. SW engineering Productivity tools. Models, processes. 6

$MyApp: Small fraction of total lines of code Orchestrates use of large body of existing software Extreme-scale Science Application (MyApp) Domain component interfaces Data mediator interactions.$

7 MyApp: Small fraction of total lines of code Orchestrates use of large body of existing software Extreme-scale Science Application (MyApp) Domain component interfaces Data mediator interactions. Hierarchical organization. Multiscale/multiphysics coupling. Shared data objects Meshes. Matrices, vectors. Library interfaces Parameter lists. Interface adapters. Function calls. Native code & data objects Single use code. Coordinated component use. Application specific. Documentation content Source markup. Embedded examples. Testing content Unit tests. Test fixtures. Build content Rules. Parameters. Extreme-Scale Scientific Software Ecosystem Extreme- Scale Scientific SW Dev Kit (xsdk) Domain components Reacting flow, etc. Reusable. Libraries Solvers, etc. Interoperable. Frameworks & tools Doc generators. Test, build framework. SW engineering Productivity tools. Models, processes. 7

8 MyApp: Small fraction of total lines of code Orchestrates use of large body of existing software Extreme-scale Science Application (MyApp) Domain component interfaces Data mediator interactions. Hierarchical organization. Multiscale/multiphysics coupling. Shared data objects Meshes. Matrices, vectors. Library interfaces Parameter lists. Interface adapters. Function calls. Native code & data objects Single use code. Coordinated component use. Application specific. Documentation content Source markup. Embedded examples. Testing content Unit tests. Test fixtures. Build content Rules. Parameters. Extreme-Scale Scientific Software Ecosystem Extreme- Scale Scientific SW Dev Kit (xsdk) Domain components Reacting flow, etc. Reusable. Libraries Solvers, etc. Interoperable. Frameworks & tools Doc generators. Test, build framework. SW engineering Productivity tools. Models, processes. 8 xsdk: Large collection of modular, parametrizable, reusable software components, tools, policies. State-of-the-art, always improving.

9 Some Popular Ecosystems (Frameworks) Cactus: FEniCS: Charm++: PETSc: 9

10 Matlab Matrix Laboratory: Industrial quality technical computing platform. Many toolboxes for important problem domains. A very rational productivity option on a single compute node. Some distributed parallel support, but more complicated. Solve Ax = b in Matlab? x = A\b; Backslash symbol represents complex decision tree. Considerations: Size, sparsity, condition number, Tim Davis: Backslash guy. If Matlab works for your problem sizes, use it. Makes great prototyping environment in other cases. 10

11 Problem Solving Environments Many Productivity Enhancing environments: Numpy, Julia, others. Python wrappers: SWIG-based and others. Wrap high performance libraries underneath. Example: PyTrilinos. Can be the right tool (and compete with Matlab): Especially for exploration, but even for production settings. Not generally used on supercomputers. Although always discussed. 11

12 Why Use Math Libraries 12

13 A farmer had chickens and pigs. There was a total of 60 heads and 200 feet. How many chickens and how many pigs did the farmer have? Let x be the number of chickens, y be the number of pigs. Then: x + y = 60 2x + 4y = 200 From first equation x = 60 y, so replace x in second equation: 2(60 y) + 4y = 200 Solve for y: 120 2y + 4y = 200 2y = 80 y = 40 Solve for x: x = = 20. The farmer has 20 chickens and 40 pigs. 13

14 A restaurant owner purchased one box of frozen chicken and another box of frozen pork for $60. Later the owner purchased 2 boxes of chicken and 4 boxes of pork for $200. What is the cost of a box of frozen chicken and a box of frozen pork? Let x be the price of a box of chicken, y the price of a box of pork. Then: x + y = 60 2x + 4y = 200 From first equation x = 60 y, so replace x in second equation: 2(60 y) + 4y = 200 Solve for y: 120 2y + 4y = 200 2y = 80 y = 40 Solve for x: x = = 20. A box of chicken costs $20 and a box of pork costs $40. 14

15 Problem Statement A restaurant owner purchased one box of frozen chicken and another box of frozen pork for $60. Later the owner purchased 2 boxes of chicken and 4 boxes of pork for $200. What is the cost of a box of frozen chicken and a box of frozen pork? Let x be the price of a box of chicken, y the price of a box of pork. Then: x + y = 60 2x + 4y = 200 From first equation x = 60 y, so replace x in second equation: 2(60 y) + 4y = 200 Solve for y: 120 2y + 4y = 200 2y = 80 y = 40 Solve for x: x = = 20. A box of chicken costs $20. A box of pork costs $40. Variables Problem Setup Solution Method Translate Back 15

16 Why Math Libraries? Many types of problems. Similar Mathematics. App Separation of concerns: w Problem Statement. w Translation to Math. w Set up problem. w Solve Problem. w Translate Back. SuperLU 16

17 Importance of Math Libraries Computer solution of math problems is hard: Floating point arithmetic not exact: 1 + ε = 1, for small ε > 0. (a + b) + c not always equal to a + (b + c). High fidelity leads to large problems: 1M to 10B equations. Clusters require coordinated solution across 100 1M processors. Sophisticated solution algorithms and libraries leveraged: Solver expertise highly specialized, expensive. Write code once, use in many settings. Maintenance cost of successful software? 70 80% of total roll-your-own is not just about writing the code. Use of libraries spread that cost. 17

18 Your Turn: List reasons to use or not use libraries 18

19 Your Turn: List reasons to use or not use libraries Leverage investment across more than one app. Separate concerns of using functionality from providing it. Cost of owning software is development + maintenance. Risk if library team stops development. Mitigate by having your own API, adapters for multiple libraries. Lack of availability on all systems (including emerging). Mitigate by having your own (inferior) portable capability. Complexity increase from library dependency. Mitigate by using libraries with similar complexity. 19

20 Sparse Direct Methods Construct L and U, lower and upper triangular, resp, s.t. Solve Ax = b: LU = A 1. Ly = b 2. Ux = y Symmetric versions: LL T = A, LDL T When are direct methods effective? 1D: Always, even on many, many processors. 2D: Almost always, except on many, many processors. 2.5D: Most of the time. 3D: Only for small/medium problems on small/medium processor counts. Bottom line: Direct sparse solvers should always be in your toolbox. 20

21 Sparse Direct Solver Packages HSL: MUMPS: Pardiso: PaStiX: SuiteSparse: SuperLU: UMFPACK : WSMP: Trilinos/Amesos/Amesos2: Notes: All have threaded parallelism. All but SuiteSparse and UMFPACK have distributed memory (MPI) parallelism. MUMPS, PaStiX, SuiteSparse, SuperLU, Trilinos, UMFPACK are freely available. HSL, Pardiso, WSMP are available freely, with restrictions. Some research efforts on GPUs, unaware of any products. Emerging hybrid packages: STRUMPACK Sherry Li, Pieter Ghysels. HIPS Gaidamour, Henon. Trilinos/ShyLU Rajamanickam,. 21

22 Other Sparse Direct Solver Packages Legagy packages that are open source but not under active development today. TAUCS : PSPASES : BCSLib : Eigen Newer, active, but sequential only (for sparse solvers). Sparse Cholesky (including LDL^T), Sparse LU, Sparse QR. Wrappers to quite a few third-party sparse direct solvers. 22

23 Emerging Trend in Sparse Direct New work in low-rank approximations to off-diagonal blocks. Typically: Off-diagonal blocks in the factorization stored as dense matrices. New: These blocks have low rank (up to the accuracy needed for solution). Can be represented by approximate SVD. Still uncertain how broad the impact will be. Will rank-k SVD continue to have low rank for hard problems? Potential: Could be breakthrough for extending sparse direct method to much larger 3D problems. 23

24 24 Iterative Methods Given an initial guess for x, called x 0, (x 0 = 0 is acceptable) compute a sequence x k, k = 1,2, such that each x k is closer to x. Definition of close : Suppose x k = x exactly for some value of k. Then r k = b - Ax k = 0 (the vector of all zeros). And norm(r i ) = sqrt(ddot(r k, r k )) = 0 (a number). For any x k, let r k = b - Ax k If norm(r k ) = sqrt(ddot(r k, r k )) is small (< 1.0E-6 say) then we say that x k is close to x. The vector r is called the residual vector.

25 Sparse Iterative Solver Packages PETSc: hypre: Trilinos: Paralution: (Manycore; GPL/Commercial license) HSL: (Academic/Commercial License) Eigen (Sequential CG, BiCGSTAB, ILUT/Sparskit) Sparskit: Notes: There are many other efforts, but I am unaware of any that have a broad user base like hypre, PETSc and Trilinos. Sparskit, and other software by Yousef Saad, is not a product with a large official user base, but these codes appear as embedded (serial) source code in many applications. PETSc and Trilinos support threading, distributed memory (MPI) and growing functionality for accelerators. Many of the direct solver packages support some kind of iteration, if only iterative refinement. 25

26 Which Type of Solver to Use? Dimension Type Notes 1D Direct Often tridiagonal (Thomas alg, periodic version). 2D very easy Iterative If you have a good initial guess, e.g., transient simulation. 2D otherwise Direct Almost always better than iterative. 2.5D Direct Example: shell problems. Good ordering can keep fill low. 3D smooth Direct? Emerging methods for low-rank SVD representation. 3D easy Iterative Simple preconditioners: diagonal scaling. CG or BiCGSTAB. 3D harder Iterative Prec: IC, ILU (with domain decomposition if in parallel). 3D hard Iterative Use GMRES (without restart if possible). 3D + large Iterative Add multigrid, geometric or algebraic. 26

27 Details about Sparse Matrices 27

28 General Sparse Matrix Example: a a 16 0 a 22 a A = 0 a 32 a 33 0 a a a 53 0 a 55 a 56 a a 65 a 66 28

29 Compressed Row Storage (CRS) Format AKA, CSR Format Idea: Create 1 length nnz array of non-zero value. 1 length nnz array of column indices. 1 length m+1 array of ints: double int int * values = new double [nnz]; * colindices = new int[nnz]; * rowpointers = new int[m+1]; nnz Number of nonzero terms in the matrix. m Matrix dimension. 29

30 Compressed Row Storage (CRS) Format Fill arrays as follows: rowpointer[0] = 0; double * curvalueptr = values; int * curindicesptr = colindices; for (i=0; i<m; ++i) { // for each row rowpointer[i+1] = rowpointer[i] + number of nonzero entries in row i. for (j=0; j<numrowentries; j++) { // for each entry in row i *curvalueptr++ = value of j th nonzero entry in row i. *curindicesptr ++ = column index of j th nonzero entry in row i. } } 30

31 CRS Example! # A = # # # " $ & & & & % values = {4, 1, 3, 2, 6, 5, 9, 8}, colindices = {0, 3, 1, 3, 2, 0, 2, 3} rowpointer = {0, 2, 4, 5, 8} 31

32 Your turn: CRS Example! # A = # # # " $ & & & & % values = colindices = rowpointer = 32

33 Your turn: CRS Example! # A = # # # " $ & & & & % values = colindices = rowpointer = values = {4, 2, 3, 2, 5, 7, 9, 8}, colindices = {0, 1, 1, 3, 0, 1, 2, 3} rowpointer = {0, 2, 4, 4, 8} 33

34 Serial Sparse MV int sparsemv( int m, double * values, int * colindices, int * rowpointers, double * x, double * y){ for (int i=0; i< m; ++i){ double sum = 0.0; curnumentries = rowpointers[i+1] rowpointers[i]; double * curvals = values[rowpointers[i]]; int * curinds = colindices[rowpointers[i]]; } for (int j=0; j< curnumentries; j++) sum += curvals[j]*x[curinds[j]]; y[i] = sum; } return(0); 34

35 Many, many Sparse Formats ELLPACK: Determine max number nonzeros per row. Make nonzero count constant for all rows Pad with zeros for rows with fewer. Eliminates rowpointer array. Works well if nonzero count is nearly uniform. Hybrid CRS/ELLPACK Many more variations. CCS Column version Has tranpose relationship with CRS More common for sparse direct solvers. 35

36 Trilinos Overview 36

37 What is Trilinos? Object-oriented software framework for Solving big complex science & engineering problems. Large collection of reusable scientific capabilities. More like LEGO bricks than Matlab. 37

38 Background/Motivation 38

Optimal Kernels to Optimal Solutions: w Geometry,

w Scalable I/O, GPU, Manycore w 60+ Packages.

39 Optimal Kernels to Optimal Solutions: w Geometry, Meshing w Discretizations, Load Balancing. w Scalable Linear, Nonlinear, Eigen, Transient, Optimization, UQ solvers. w Scalable I/O, GPU, Manycore w 60+ Packages. w Other distributions: w Cray LIBSCI w Public repo. w Thousands of Users. w Worldwide distribution. Laptops to Leadership systems

40 Trilinos Strategic Goals Scalable Computations: As problem size and processor counts increase, the cost of the computation will remain nearly fixed. Hardened Computations: Never fail unless problem essentially intractable, in which case we diagnose and inform the user why the problem fails and provide a reliable measure of error. Full Vertical Coverage: Provide leading edge enabling technologies through the entire technical application software stack: from problem construction, solution, analysis and optimization. Universal Interoperability: All Trilinos packages, and important external packages, will be interoperable, so that any combination of packages and external software (e.g., PETSc, Hypre) that makes sense algorithmically will be possible within Trilinos. Universal Accessibility: All Trilinos capabilities will be available to users of major computing environments: C++, Fortran, Python and the Web, and from the desktop to the latest scalable systems. Universal Solver RAS: Trilinos will be: Reliable: Leading edge hardened, scalable solutions for each of these applications Available: Integrated into every major application at Sandia Serviceable: Self-sustaining. Algorithmic Goals Software Goals

41 Product Leaders: Layer of Proactive Leadership Product: Framework (J. Willenbring). Data Services (K. Devine). Linear Solvers (S. Rajamanickam). Nonlinear Solvers (R. Pawlowski). Discretizations (M. Perego). Product focus: New, stronger leadership model. Focus: Published APIs High cohesion within product. Low coupling across. Deliberate product-level upstream planning & design. 41

42 Unique features of Trilinos Huge library of algorithms Linear and nonlinear solvers, preconditioners, Optimization, transients, sensitivities, uncertainty, Growing support for multicore & hybrid CPU/GPU Built into the new Tpetra linear algebra objects Therefore into iterative solvers with zero effort! Unified intranode programming model: Kokkos Spreading into the whole stack: Multigrid, sparse factorizations, element assembly Support for mixed and arbitrary precisions Don t have to rebuild Trilinos to use it Support for flexible 2D sparse partitioning Useful for graph analytics, other data science apps. Support for huge (> 2B unknowns) problems 42

43 Trilinos Access Trilinos current. 58 packages. Website: GitHub (preferred): 43

44 Trilinos software organization 44

45 Objective Trilinos Package Summary Package(s) Discretizations Methods Services Solvers Meshing & Discretizations Time Integration Automatic Differentiation Mortar Methods Linear algebra objects Interfaces Load Balancing Skins C++ utilities, I/O, thread API Iterative linear solvers Direct sparse linear solvers Direct dense linear solvers Iterative eigenvalue solvers ILU-type preconditioners Multilevel preconditioners Block preconditioners Nonlinear system solvers Optimization (SAND) Stochastic PDEs STK, Intrepid, Pamgen, Sundance, ITAPS, Mesquite Rythmos Sacado Moertel Epetra, Tpetra, Kokkos, Xpetra Thyra, Stratimikos, RTOp, FEI, Shards Zoltan, Isorropia, Zoltan2 PyTrilinos, WebTrilinos, ForTrilinos, Ctrilinos, Optika Teuchos, EpetraExt, Kokkos, Triutils, ThreadPool, Phalanx, Trios AztecOO, Belos, Komplex Amesos, Amesos2, ShyLU Epetra, Teuchos, Pliris Anasazi, Rbgen AztecOO, IFPACK, Ifpack2, ShyLU ML, CLAPS, Muelu Meros, Teko NOX, LOCA, Piro MOOCHO, Aristos, TriKota, Globipack, Optipack Stokhos

46 Interoperability vs. Dependence ( Can Use ) ( Depends On ) Although most Trilinos packages have no explicit dependence, often packages must interact with some other packages: w NOX needs operator, vector and linear solver objects. w AztecOO needs preconditioner, matrix, operator and vector objects. w Interoperability is enabled at configure time. w Trilinos cmake system is vehicle for: Establishing interoperability of Trilinos components Without compromising individual package autonomy. Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES option Architecture supports simultaneous development on many fronts. 46

47 Trilinos is made of packages Not a monolithic piece of software w Like LEGO bricks, not Matlab Each package: w Has its own development team and management w Makes its own decisions about algorithms, coding style, etc. w May or may not depend on other Trilinos packages Trilinos is not indivisible w You don t need all of Trilinos to get things done w Any subset of packages can be combined and distributed w Current public release contains ~50 of the 55+ Trilinos packages Trilinos top layer framework w Not a large amount of source code: ~1.5% w Manages package dependencies Like a GNU/Linux package manager w Runs packages tests nightly, and on every check-in Package model supports multifrontal development New effort to create apps by gluing Trilinos together: Albany 47

48 48 Software Development and Delivery

49 Are C++ templates safe? No, but they are good. Compile-time Polymorphism Templates and Sanity upon a shifting foundation Software delivery: Essential Activity How can we: Implement mixed precision algorithms? Implement generic fine-grain parallelism? Support hybrid CPU/GPU computations? Support extended precision? Explore redundant computations? Prepare for both exascale swim lanes? C++ templates only sane way, for now. Template Benefits: Compile time polymorphism. True generic programming. No runtime performance hit. Strong typing for mixed precision. Support for extended precision. Many more Template Drawbacks: Huge compile-time performance hit: But good use of multicore :) Eliminated for common data types. - Complex notation: - Esp. for Fortran & C programmers). - Can insulate to some extent. 49

50 Solver Software Stack Phase I packages: SPMD, int/double Phase II packages: Templated Optimization Unconstrained: Constrained: Bifurcation Analysis Transient Problems DAEs/ODEs: Nonlinear Problems Linear Problems Linear Equations: Eigen Problems: Distributed Linear Algebra Matrix/Graph Equations: Vector Problems: 50 Sensitivities (Automatic Differentiation: Sacado) ROL LOCA Tempus NOX Anasazi Ifpack, ML, etc... AztecOO Epetra Teuchos

51 Phase I packages Solver Software Stack Phase II packages Phase III packages: Manycore*, templated Optimization Unconstrained: Constrained: Bifurcation Analysis Transient Problems DAEs/ODEs: Nonlinear Problems Linear Problems Linear Equations: Eigen Problems: istributed Linear Algebra Matrix/Graph Equations: Vector Problems: 51 Sensitivities (Automatic Differentiation: Sacado) ROL LOCA T-LOCA Tempus NOX T-NOX Anasazi AztecOO Ifpack, ML, etc... Epetra Teuchos Belos* Ifpack2*, Muelu*,etc. Tpetra* Kokkos*

52 52 Using Trilinos Linear Solvers

53 Objective Trilinos Package Summary Package(s) Discretizations Methods Services Solvers Meshing & Discretizations Time Integration Automatic Differentiation Mortar Methods Linear algebra objects Interfaces Load Balancing Skins Utilities, I/O, thread API Iterative linear solvers Direct sparse linear solvers Incomplete factorizations Multilevel preconditioners Direct dense linear solvers Iterative eigenvalue solvers Block preconditioners Nonlinear solvers Optimization Stochastic PDEs STKMesh, Intrepid, Pamgen, Sundance, Mesquite Rythmos Sacado Moertel Epetra, Tpetra Xpetra, Thyra, Stratimikos, RTOp, FEI, Shards Zoltan, Isorropia, Zoltan2 PyTrilinos, WebTrilinos, ForTrilinos, Ctrilinos, Optika Teuchos, EpetraExt, Kokkos, Triutils, ThreadPool, Phalanx AztecOO, Belos, Komplex Amesos, Amesos2, ShyLU AztecOO, IFPACK, Ifpack2 ML, CLAPS, MueLu Epetra, Teuchos, Pliris Anasazi Meros, Teko NOX, LOCA MOOCHO, Aristos, TriKota, Globipack, Optipack Stokhos

54 AztecOO Iterative linear solvers: CG, GMRES, BiCGSTAB, Incomplete factorization preconditioners Aztec was Sandia s workhorse solver: Extracted from the MPSalsa reacting flow code Installed in dozens of Sandia apps external licenses AztecOO improves on Aztec by: Using Epetra objects for defining matrix and vectors Providing more preconditioners & scalings Using C++ class design to enable more sophisticated use AztecOO interface allows: Continued use of Aztec for functionality Introduction of new solver capabilities outside of Aztec 54

55 Belos Next-generation linear iterative solvers Decouples algorithms from linear algebra objects Linear algebra library has full control over data layout and kernels Improvement over AztecOO, which controlled vector & matrix layout Essential for hybrid (MPI+X) parallelism Solves problems that apps really want to solve, faster: Multiple right-hand sides: AX=B Sequences of related systems: (A + ΔA k ) X k = B + ΔB k Many advanced methods for these types of systems Block & pseudoblock solvers: GMRES & CG Recycling solvers: GCRODR (GMRES) & CG Seed solvers (hybrid GMRES) Block orthogonalizations (TSQR) Supports arbitrary & mixed precision, complex, If you have a choice, pick Belos over AztecOO 55

56 Ifpack(2): Algebraic preconditioners Preconditioners: Overlapping domain decomposition Incomplete factorizations (within an MPI process) (Block) relaxations & Chebyshev Accepts user matrix via abstract matrix interface Use {E,T}petra for basic matrix / vector calculations Perturbation stabilizations & condition estimation Can be used by all other Trilinos solver packages Ifpack2: Tpetra version of Ifpack Supports arbitrary precision & complex arithmetic Path forward to hybrid-parallel factorizations 56

57 Amesos2 Direct Solver interface for the Tpetra Stack. Typical Usage: preorder(), symbolicfactorization(), numericfactorization(), solve(). Easy to support new solvers (Current support for all the SuperLU variants). Easy to support new multivectors and sparse matrices. Can support third party solver specific parameters with little changes. Available in the current release of Trilinos.

: Multi-level Preconditioners Smoothed aggregation, multigrid, & domain decomposition Critical technology for scalable performance of many apps ML compatible with other Trilinos packages: Accepts

58 : Multi-level Preconditioners Smoothed aggregation, multigrid, & domain decomposition Critical technology for scalable performance of many apps ML compatible with other Trilinos packages: Accepts Epetra sparse matrices & dense vectors ML preconditioners can be used by AztecOO, Belos, & Anasazi Can also be used independent of other Trilinos packages Next-generation version of ML: MueLu Works with Epetra or Tpetra objects (via Xpetra interface) 58

59 MueLu: Next-gen algebraic multigrid Motivation for replacing ML Improve maintainability & ease development of new algorithms Decouple computational kernels from algorithms ML mostly monolithic (& 50K lines of code) MueLu relies more on other Trilinos packages Exploit Tpetra features MPI+X (Kokkos programming model mitigates risk) 64-bit global indices (to solve problems with >2B unknowns) Arbitrary Scalar types (Tramonto runs MueLu w/ double-double) Works with Epetra or Tpetra (via Xpetra common interface) Facilitate algorithm development Energy minimization methods Geometric or classic algebraic multigrid; mix methods together Better support for preconditioner reuse Explore options between blow it away & reuse without change 59

ShyLUand Subdomain Solvers : Overview Amesos2 Ifpack2

Tri-Solve (HTS) MPI+X based subdomain solvers Decouple

can span multiple MPI ranks each with its own subdomain

Kokkos-based options for on-node parallelism Basker : LU

(k) Fast-ILU: Fast-ILU factorization for GPUs

60 ShyLUand Subdomain Solvers : Overview Amesos2 Ifpack2 ShyLU KLU2 Basker Tacho FAST-ILU KokkosKernels SGS, Tri-Solve (HTS) MPI+X based subdomain solvers Decouple the notion of one MPI rank as one subdomain: Subdomains can span multiple MPI ranks each with its own subdomain solver using X or MPI+X Subpackages of ShyLU: Multiple Kokkos-based options for on-node parallelism Basker : LU or ILU (t) factorization Tacho: Incomplete Cholesky - IC (k) Fast-ILU: Fast-ILU factorization for GPUs KokkosKernels: Coloring based Gauss-Seidel (M. Deveci), Triangular Solves (A. Bradley) Lots more work (w/ Christian Trott) Under active development. Jointly funded by ASC, ATDM, FASTMath, LDRD.

61 Abstract solver interfaces & applications 61

62 62 Stratimikos package Uniform run-time interface to many different packages Linear solvers: Amesos, AztecOO, Belos, Preconditioners: Ifpack, ML, Defines common interface to create and use linear solvers Reads in options through a Teuchos::ParameterList Can change solver and its options at run time Can validate options, & read them from a string or XML file Accepts any linear system objects that provide E/Tpetra_Operator / E/Tpetra_RowMatrix view of the matrix Vector views (e.g., E/Tpetra_MultiVector) for right-hand side and initial guess

63 Stratimikos Parameter List and Sublists <ParameterList name= Stratimikos > <Parameter name="linear Solver Type" type="string" value= AztecOO"/> <Parameter name="preconditioner Type" type="string" value="ifpack"/> <ParameterList name="linear Solver Types"> <ParameterList name="amesos"> <Parameter name="solver Type" type="string" value="klu"/> <ParameterList name="amesos Settings"> <Parameter name="matrixproperty" type="string" value="general"/>... <ParameterList name="mumps">... </ParameterList> <ParameterList name="superludist">... </ParameterList> </ParameterList> </ParameterList> <ParameterList name="aztecoo"> <ParameterList name="forward Solve"> <Parameter name="max Iterations" type="int" value="400"/> <Parameter name="tolerance" type="double" value="1e-06"/> <ParameterList name="aztecoo Settings"> <Parameter name="aztec Solver" type="string" value="gmres"/>... </ParameterList> </ParameterList>... </ParameterList> <ParameterList name="belos">... (Details omitted)... </ParameterList> </ParameterList> <ParameterList name="preconditioner Types"> <ParameterList name="ifpack"> <Parameter name="prec Type" type="string" value="ilu"/> <Parameter name="overlap" type="int" value="0"/> <ParameterList name="ifpack Settings"> <Parameter name="fact: level-of-fill" type="int" value="0"/>... </ParameterList> </ParameterList> <ParameterList name="ml">... (Details omitted)... </ParameterList> </ParameterList> </ParameterList> Top level parameters Linear Solvers Preconditioners Sublists passed on to package code. Every parameter and sublist is handled by Thyra code and is fully validated.

64 Stratimikos Parameter List and Sublists <ParameterList name= Stratimikos > <Parameter name="linear Solver Type" type="string" value= Belos"/> <Parameter name="preconditioner Type" type="string" value= ML"/> <ParameterList name="linear Solver Types"> <ParameterList name="amesos"> <Parameter name="solver Type" type="string" value="klu"/> <ParameterList name="amesos Settings"> <Parameter name="matrixproperty" type="string" value="general"/>... <ParameterList name="mumps">... </ParameterList> <ParameterList name="superludist">... </ParameterList> </ParameterList> </ParameterList> <ParameterList name="aztecoo"> <ParameterList name="forward Solve"> <Parameter name="max Iterations" type="int" value="400"/> <Parameter name="tolerance" type="double" value="1e-06"/> <ParameterList name="aztecoo Settings"> <Parameter name="aztec Solver" type="string" value="gmres"/>... </ParameterList> </ParameterList>... </ParameterList> <ParameterList name="belos">... (Details omitted)... </ParameterList> </ParameterList> <ParameterList name="preconditioner Types"> <ParameterList name="ifpack"> <Parameter name="prec Type" type="string" value="ilu"/> <Parameter name="overlap" type="int" value="0"/> <ParameterList name="ifpack Settings"> <Parameter name="fact: level-of-fill" type="int" value="0"/>... </ParameterList> </ParameterList> <ParameterList name="ml">... (Details omitted)... </ParameterList> </ParameterList> </ParameterList> Top level parameters Linear Solvers Preconditioners Solver/preconditio ner changed by single argument. Parameter list is standard XML. Can be read from command line, file, string or handcoded.

65 Stratimikos Details Stratimikos has just one primary class: Stratimikos::DefaultLinearSolverBuilder An instance of this class accepts a parameter list that defines: Linear Solver: Amesos, AztecOO, Belos. Preconditioner: Ifpack, ML, AztecOO. Albany, other apps: Access solvers through Stratimikos. Parameter list is standard XML. Can be: Read from command line. Read from a file. Passed in as a string. Defined interactively. 65 Hand coded in source code.

66 A Glimpse at Current and Future Work

67 Node-level concurrency

Trilinos linear solvers Sparse linear algebra (Kokkos/KokkosKernels/Tpetra) Threaded construction, Sparse graphs, (block) sparse matrices, dense vectors, parallel solve kernels, parallel

68 Trilinos linear solvers Sparse linear algebra (Kokkos/KokkosKernels/Tpetra) Threaded construction, Sparse graphs, (block) sparse matrices, dense vectors, parallel solve kernels, parallel communication & redistribution Iterative (Krylov) solvers (Belos) CG, GMRES, TFQMR, recycling methods Sparse direct solvers (Amesos2) Algebraic iterative methods (Ifpack2) Jacobi, SOR, polynomial, incomplete factorizations, additive Schwarz Shared-memory factorizations (ShyLU) LU, ILU(k), ILUt, IC(k), iterative ILU(k) Direct+iterative preconditioners Segregated block solvers (Teko) Algebraic multigrid (MueLu) Kokkos Kernels 68

Must support > 3 architectures Coming systems to

NVIDIA GPUs + IBM multicore CPUs Plus everything

(big cores) Manycore CPUs (small cores) GPUs

Threads don t always pay on non-gpu architectures

69 Must support > 3 architectures Coming systems to support Trinity (Intel Haswell & KNL) Sierra: NVIDIA GPUs + IBM multicore CPUs Plus everything else 3 different architectures Multicore CPUs (big cores) Manycore CPUs (small cores) GPUs (highly parallel) MPI only, & MPI + threads Threads don t always pay on non-gpu architectures today Porting to threads must not slow down the MPI-only case 69

70 Kokkos: Performance, Portability, & Productivity LAMMPS# Trilinos# Sierra# Albany# Kokkos# HBM# HBM# HBM# HBM# DDR# DDR# DDR# DDR# DDR# 70

Kokkos: Common C++ - based programming model for thread parallelism on GPUs, CPUs, Parallel {for, reduce, scan} w/ custom user code Exposes different levels of parallelism Flat [0,N), or hierarchical

71 Kokkos: Common C++ - based programming model for thread parallelism on GPUs, CPUs, Parallel {for, reduce, scan} w/ custom user code Exposes different levels of parallelism Flat [0,N), or hierarchical (team, thread, vector) Experimental task parallelism too! Different memory & execution spaces Control where data live & code executes Enable hybrid (host + GPU) parallelism Multidimensional arrays (Kokkos::View) w/ slices Decouple array layout (row/column-major, tiled, ) from app Default layout optimized for the architecture (SoA / AoS) Unified interface to shared memory, texture fetch, atomic access, Goal: write code once, run well on many different back-ends 71

Kokkos protects us against Hardware divergence Programming model diversity Threads at all Kokkos::Serial back-end Kokkos semantics require vectorizable (ivdep) loops Expose

72 Kokkos protects us against Hardware divergence Programming model diversity Threads at all Kokkos::Serial back-end Kokkos semantics require vectorizable (ivdep) loops Expose parallelism to exploit later Hierarchical parallelism model encourages exploiting locality Kokkos protects our HUGE time investment of porting Trilinos Kokkos is our hedge 72

Foundational Technology: KokkosKernels Provide BLAS (1,2,3); Sparse; Graph and

if applicable (MKL, CuSparse, ) Goal: Provide kernels for all levels of node

Full Solve Core Thread Parallelism, Shared L1/L2, e.g.

73 Foundational Technology: KokkosKernels Provide BLAS (1,2,3); Sparse; Graph and Tensor Kernels Kokkosbased: Performance Portable Interfaces to vendor libraries if applicable (MKL, CuSparse, ) Goal: Provide kernels for all levels of node hierarchy Socket Thread Teams, Shared L3, e.g. Full Solve Core Thread Parallelism, Shared L1/L2, e.g. Subdomain Solve Hyper Thread Vector Parallelism, Synch free, e.g. Matrix Row x Vector Vector Lane Elemental Functions, Serial, e.g. 3x3 DGEMM 73

74 Performance Portability Example: SPMV Single Implementation Parameterized Hierarchical Parallelism Beats vendor libraries for relevant matrix sizes Normalized Time All > 7MB Kokkos/MKL Kokkos/CuSparse 74

NALU: a CFD Application as Protoype NaluProduction

com/spdomin/nalu Patterns, and data structures shared

Scaling demonstrated to >500k cores 2D/3D

75 NALU: a CFD Application as Protoype NaluProduction (Master) Patterns, and data structures shared with IC production apps Uses Trilinos Solver stack Scaling demonstrated to >500k cores 2D/3D Sliding/Overset Mesh Migrating Nalu with Kokkos MultiPhysics CHT Migrating Nalu with Kokkos Prototype Kokkos assembly Used to harden threaded solver stack Get Nalu ready for Trinity Phase II demo 75

76 Making Matrix Assembly Thread Scalable Simple Heat Conduction Problem, 2M elements, strong-scaling Portability Gap ~2.5x ~1.5x 76

77 Containers

$html docker pull johntfoster/trilinos docker pull johntfoster/peridigm docker run --name peridigm0 -d -v `pwd`:/output johntfoster/peridigm \ Peridigm fragmenting_cylinder.$

78 78 Trilinos usage via Docker WebTrilinos Tutorial docker pull johntfoster/trilinos docker pull johntfoster/peridigm docker run --name peridigm0 -d -v `pwd`:/output johntfoster/peridigm \ Peridigm fragmenting_cylinder.peridigm Etc Containerization technology Think: Virtual machine But light-weight, native. Composable, layered. Driven by the data science community.

79 First Docker MPI Results (Sean Deal) SJU Cluster, Epetra Basic Perf Test MatVec Lower Solve Norm2, Dot, Update Harmonic mean of 5 tests 4M Eq per proc Native Docker 48 MPI ranks 79

80 Future Docker/Shifter Efforts Standard Developer Environment: Enables reproducibility (error states). Contains all third-party libraries: Trilinos can use dozens of TPLs (SuperLU, MUMPS, ParMetis, etc., MKL? Licensing?) Contains one or a few compiler versions: Enables uniform error detection/correction. No ambiguity about built/test environment. Productivity improvement: Libraries pre-built. (Full build can take hours). Specialized containers: Shifter Docker for Supercomputing. Easy access to GPUs. 80

81 The Extreme-Scale Scientific Software Development Kit (xsdk)

Extreme-scale Science Applications Domain component

Hierarchical organization. Multiscale/multiphysics coupling.

Library interfaces Parameter lists. Interface adapters.

Coordinated component use. Application specific.

Testing content Unit tests. Test fixtures.

Extreme-scale Scientific Software Ecosystem Focus of key

flow, etc. Reusable. Libraries Solvers, etc. Interoperable.

82 Extreme-scale Science Applications Domain component interfaces Data mediator interactions. Hierarchical organization. Multiscale/multiphysics coupling. Shared data objects Meshes. Matrices, vectors. Library interfaces Parameter lists. Interface adapters. Function calls. Native code & data objects Single use code. Coordinated component use. Application specific. Documentation content Source markup. Embedded examples. Testing content Unit tests. Test fixtures. Build content Rules. Parameters. Extreme-scale Scientific Software Ecosystem Focus of key accomplishments: xsdk foundations Domain components Reacting flow, etc. Reusable. Libraries Solvers, etc. Interoperable. Frameworks & tools Doc generators. Test, build framework. SW engineering Productivity tools. Models, processes. Extreme-scale Scientific Software Development Kit (xsdk)

Building the foundation of a highly effective extreme-scale scientific software ecosystem 83 Focus: Increasing the functionality, quality, and interoperability of important scientific libraries,

/spack install xsdk Package interoperability n Numerical libraries n hypre, PETSc, SuperLU, Trilinos n Domain components n Alquimia, PFLOTRAN xsdk community policies: Address challenges in

83 Building the foundation of a highly effective extreme-scale scientific software ecosystem 83 Focus: Increasing the functionality, quality, and interoperability of important scientific libraries, domain components, and development tools xsdk release 0.2.0: April 2017 (soon) website: xsdk.info Spackpackage installation n./spack install xsdk Package interoperability n Numerical libraries n hypre, PETSc, SuperLU, Trilinos n Domain components n Alquimia, PFLOTRAN xsdk community policies: Address challenges in interoperability and sustainability of software developed by diverse groups at different institutions Impact: Improved code quality, usability, access, sustainability Inform potential users that an xsdk member package can be easily used with other xsdk packages Foundation for work on performance portability,deeper levels of package interoperability

84 84 xsdk community installation policies: GNU Autoconf and Cmake options Motivation: Obtaining, configuring, and installing multiple independent software packages is tedious and error prone. Need consistency of compiler (+version, options), 3rd-party packages, etc. Approach: Define xsdk community installation policies to which all xsdk packages will subscribe and be tested Do not require all packages to use the same installation software, merely that they follow the same interface; maximum flexibility for each package to choose its own toolchain Policies: Standard subset of configure and CMake options for xsdk and other HPC packages in order to make the configuration and installation as efficient as possible on common platforms Including standard Linux distributions and Mac OS X, as well as on target machines at DOE computing facilities (ALCF, NERSC, and OLCF) Topics include: select compilers and compiler flags, create packages with debugging information, build shared libraries, build interface for a particular language, determine precision, determine index size, set location for BLAS and LAPACK, determine other packages and include directories

85 xsdk community policies Draft 0.3, Dec xsdk compatible package: Must satisfy mandatory xsdk policies: M1. Support xsdk community GNU Autoconf or CMake options. M2. Provide a comprehensive test suite. M3. Employ user-provided MPI communicator. M4. Give best effort at portability to key architectures. M5. Provide a documented, reliable way to contact the development team. M6. Respect system resources and settings made by other previously called packages. M7. Come with an open source license. M8. Provide a runtime API to return the current version number of the software. M9. Use a limited and well-defined symbol, macro, library, and include file name space. M10. Provide an accessible repository (not necessarily publicly available). M11. Have no hardwired print or IO statements. M12. Allow installing, building, and linking against an outside copy of external software. M13. Install headers and libraries under <prefix>/include/ and <prefix>/lib/. M14. Be buildable using 64 bit pointers. 32 bit is optional. Also specify recommended policies, which currently are encouraged but not required: R1. Have a public repository. R2. Possible to run test suite under valgrind in order to test for memory corruption issues. R3. Adopt and document consistent system for error conditions/exceptions. R4. Free all system resources it has acquired as soon as they are no longer needed. R5. Provide a mechanism to export ordered list of library dependencies. xsdk member package: Must be an xsdk-compatible package, and it uses or can be used by another package in the xsdk, and the connecting interface is regularly tested for regressions.

86 xsdk package interoperability 86 Release 0.2.0, April 2017 (soon) xsdk numerical libraries hypre: High-performance preconditioners for the solution of large, sparse linear systems, featuring a semi-structured interface. PETSc: Suite of data structures and routines for the scalable solution of PDE-based applications. Includes linear solvers, preconditioners, nonlinear solvers, ODE integrators, and optimization solvers (TAO). SuperLU: Direct solvers for large, sparse, nonsymmetric linear systems on distributed memory systems with hybrid node architectures. Trilinos: Collection of approx 60 packages of reusable scientific software. Capabilities include geometry, meshing, discretization, partitioning and load balancing; parallel construction of distributed matrices, graphs and vectors; parallel, scalable solution of large linear, nonlinear and transient systems of equations; and solution of embedded optimization and UQ problems. xsdk domain components Alquimia: Biogeochemistry interface and wrapper. PFLOTRAN: Reactive flow and transport modeling for surface and subsurface processes. Package installation Spack: package management tool designed to support multiple versions and configurations of software on a wide variety of platforms and environments. Levels of package interoperability Interoperability level 1 both packages can be used (side by side) in an application Interoperability level 2 the libraries can exchange date (or control data) with each other Interoperability level 3 each library can call the other library to perform unique computations Great multi-institutional teamwork! Release lead: Jim Willenbring

87 87 xsdk release 0.2.0: Packages can be readily used in combination by multiphysics, multiscale apps Notation: A B: A can use B to provide functionality on behalf of A Application A Multiphysics Application C Application B xsdk functionality, February 2017 Tested on key machines at ALCF, NERSC, OLCF, also Linux, Mac OS X HDF5 Alquimia PETSc hypre BLAS More external software More contributed domain components PFLOTRAN SuperLU Trilinos More contributed libraries Domain components Reacting flow, etc. Reusable. Libraries Solvers, etc. Interoperable. Frameworks & tools Doc generators. Test, build framework. SW engineering Productivity tools. Models, processes. Extreme-Scale Scientific Software Development Kit (xsdk)

88 More xsdk info 88 Paper: xsdk Foundations: Toward an Extreme-scale Scientific Software Development Kit R. Bartlett, I. Demeshko, T. Gamblin, G. Hammond, M. Heroux, J. Johnson, A. Klinvex, X. Li, L.C. McInnes, D. Osei-Kuffuor, J. Sarich, B. Smith, J. Willenbring, U.M. Yang To appear in Supercomputing Frontiers and Innovations, 2017 CSE17 Posters: xsdk: Working toward a Community Software Ecosystem n Managing the Software Ecosystem with Spack n

xsdk: Next steps 89 xsdk4ecp: Enhancements needed for exascale applications Coordinated use of on-node resources Integrated execusion Control inversion, adaptive

adaptive solution of PDEs Compatible with all xsdk community policies ALExa: Accelerated Libraries for Exascale (AMP, DTK, TASMANIAN) Dense linear algebra packages:

89 xsdk: Next steps 89 xsdk4ecp: Enhancements needed for exascale applications Coordinated use of on-node resources Integrated execusion Control inversion, adaptive execution strategies Coordinated and sustainable documentation, testing, packaging, and deployment Packages working toward xsdk compatibility Chombo: Software for adaptive solution of PDEs Compatible with all xsdk community policies ALExa: Accelerated Libraries for Exascale (AMP, DTK, TASMANIAN) Dense linear algebra packages: MAGMA, PLASMA, DPLASMA, ScaLAPACK, LAPACK SUNDIALS: CVODE(S) and IDA(S) multi-step ODE and DAE time integrators (with sensitivities), ARKode multistage IMEX integrator, KINSOL nonlinear solver

90 Your Turn: Collaboration Strategies Scenario: You are part of a new development group, coming together from diverse backgrounds. What specific activities and goals could help the team succeed? 90

91 Your Turn: Collaboration Strategies Scenario: You are part of a new development group, coming together from diverse backgrounds. What specific activities and goals could help the team succeed? Don t assume everything will go fine, just because you are all great people. Catalog existing approaches for each development practices and processes. Identify maturity gaps that may lead to frustration: Example: Someone does not use source management, others do. Develop policies for the new team. Declare expected behaviors, practices. Be rigorous and specific. Establish a plan for all team members to conform to new team policies. Track progress. 91

92 Final Take-Away Points Some knowledge of data structures is important. Serve as reference for other data structures. Example: Compressed Row Storage. Trilinos is investing heavily in algorithms and software for next-generation systems. We are fortunate to have good funding to explore and implement next-gen. Your investment in next-gen capabilities should wait until you are ready. Prepare by learning fundamental concepts, while library software matures. Trilinos, and now xsdk, are large collections of scientific libraries. Lots to learn, but lots of capabilities. Integrated support, as your modeling and simulation matures. There s a Library for That: Most linear algebra problems can be solved using a robust, parallel library. Many other libraries: Become a software archeologist first! Even if you need to write your own, you can use libraries for some steps. 92

93 After You Leave Visit the Trilinos Tutorial site: OnTutorial Use the web portal: /c++/index.html Visit the xsdk site: 93

The Trilinos Project Exascale Roadmap Michael A. Heroux Sandia National Laboratories

The Trilinos Project Exascale Roadmap Michael A. Heroux Sandia National Laboratories Contributors: Mark Hoemmen, Siva Rajamanickam, Tobias Wiesner, Alicia Klinvex, Heidi Thornquist, Kate Evans, Andrey