Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation Alexander Kalinkin Anton Anders Roman Anders 1

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vpro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vpro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2013. Intel Corporation. http://intel.com/software/products 2

Agenda Problem statement Algorithm Experiments Conclusion 3

Problem statement Cons No extra info available for the matrix, only some generic properties (positive definite, Hermitian, ) Huge size Pros Clusters with modern Intel CPUs Intel Math Kernel Library (Intel MKL) with optimized BLAS, LAPACK, PARDISO functionality 4

Algorithm (Ax=b) Input: matrix A, vector b; special parameters. Matrix reordering and symbolic factorization Reorder matrix A to reduce fill-in in factor L, create dependency tree representation of matrix A Numeric factorization Compute decomposition A=LL T or LDL T or LU The most time-consuming part Forward and backward substitution Solve Ly=b (forward step), Dz=y (diagonal step), then L T x=z (backward step) Output: vector x. 5

Factorization step Matrix A after reordering (example of 4 leafs/processes) E B C D E F G A B C D E F G - non-zero block - L-block updates R-block (or Right depends on Left) Tree representation of matrix A after reordering C G 0 1 2 3 F 0 1 2 3 A B D E 0 1 2 3 Both tree and tree-node parallelization are used All computations within the node are based on functionality from Intel MKL Computation of leafs & updates of a block are independent on each process Data is distributed between processes uniformly 6

Factorization step Matrix A after reordering (example of 4 leafs/process) E B C D E F G A B C D E F G - non-zero block - L-block updates R-block (or Right depends on Left) Tree representation of matrix A after reordering C G 0 1 2 3 F 0 1 2 3 A B D E 0 1 2 3 Both tree and tree-node parallelization are used All computations within the node are based on functionality from Intel MKL Computation of leafs & updates of a block are independent on each process Data is distributed between processes uniformly 7

Implementation of LU decomposition within a node G 0 1 2 3 Selecting one thread per process allows us to mask data transfers behind computations 8

Current status/interface Supported as 2 additional libraries for Linux* & Microsoft Windows*, 64-bit only. Supports different MPI implementations via user-compiled wrapper. C: {. PARDISO (pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, &idum, &nrhs, iparm, &msglvl, b, x, &error); } {. comm = MPI_Comm_c2f(MPI_COMM_WORLD); CPARDISO (pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, &idum, &nrhs, iparm, &msglvl, b, x, comm, &error); } Fortran:. call PARDISO(pt, maxfct, mnum, mtype, phase, n, a, ia, ja, idum, nrhs, iparm, msglvl, b, x, error) call CPARDISO(pt, maxfct, mnum, mtype, phase, n, a, ia, ja, idum, nrhs, iparm, msglvl, b, x, comm, &error) 9

Experiments (scalability of time) ** ** Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ **Here and further: The University of Florida Sparse Matrix Collection T. A. Davis and Y. Hu, ACM Transactions on Mathematical Software, Vol 38, Issue 1, 2011, pp 1:1-1:25. http://www.cise.ufl.edu/research/sparse/matrices. 10

Experiments (scalability of time) Additional processes reduce computational time!!! ** ** Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ **Here and further: The University of Florida Sparse Matrix Collection T. A. Davis and Y. Hu, ACM Transactions on Mathematical Software, Vol 38, Issue 1, 2011, pp 1:1-1:25. http://www.cise.ufl.edu/research/sparse/matrices. 11

time, sec Experiments (scalability of time) 400 350 300 ** 3Dspectralwave, material problem G 0 1 2 3 250 200 150 100 solve fact reorder C F 0 1 2 3 50 0 2 4 8 16 32 64 number of nodes, each node use 16 threads A B D E 0 1 2 3 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmarklimitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ Factorization and solving steps scale well in terms of memory and performance. Parallelization of reordering step might lead to worse reordering affecting overall time Deeper investigation is needed here. 12

Ratio of time Ratio of time Experiments (balancing) 1,3 ga41as41h72 ** G 0 1 2 3 1,2 1,1 1 0,9 0,8 16 32 64 128 256 number of mpi processes (2 threads per process) Long_coup_dt6 ** approach 1 approach 2 approach 3 approach 4 A 0 C 0 1 1 B D F 2 3 2 3 E 1,4 1,3 1,2 1,1 1 0,9 0,8 16 32 64 128 256 number of mpi processes (2 threads per process) approach 1 approach 2 approach 3 approach 4 In case of non-uniform tree, there are a few approaches to divide nodes of the tree between computational nodes. But there is no best approach, so to achieve good performance we switch between them at reordering step. 13

Max memory per node, Gb Max memory per node, Gb Experiments (scalability of memory) 7 6 5 4 3 2 1 0 NDOF=398K, NNZ=15.7M Absolute memory per node scalability (Lower is better) 1 2 4 8 16 Number of MPI processes (1 per HW node) 18 16 14 12 10 8 6 4 2 0 NDOF=1.7M, NNZ=12M Absolute memory per node scalability (Lower is better) 1 2 4 8 16 Number of MPI processes (1 per HW node) Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ 14

Max memory per node, Gb Max memory per node, Gb Experiments (scalability of memory) Additional processes decrease memory size per host!!! 7 6 5 4 3 2 1 0 NDOF=398K, NNZ=15.7M Absolute memory per node scalability (Lower is better) 1 2 4 8 16 Number of MPI processes (1 per HW node) 18 16 14 12 10 8 6 4 2 0 NDOF=1.7M, NNZ=12M Absolute memory per node scalability (Lower is better) 1 2 4 8 16 Number of MPI processes (1 per HW node) Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ 15

Conclusion Intel Direct Sparse Solver for Clusters based on Intel MKL functionality results in Good scaling of computational time Good scaling of memory per node A lot of work/investigation still needs to be completed 16

Automatic Offload support in Intel Math Kernel Library for Intel Xeon Phi coprocessors Alexander Kalinkin Nikita Shustrov 17

Algorithm The matrix factorization method is based on panel factorization approach. It shows advantages over communication avoiding and tail methods or theirs combinations: no additional computational cost no additional memory consumption The panel factorization approach has the same computational cost and memory usage as classic LAPACK algorithms The implementation preserves the LAPACK standard interfaces and the algorithm can be applied for any matrix The implementation is DAG-based and uses panel factorization kernels that were redesigned and rewritten for Intel Xeon Phi coprocessors

Algorithm: Features Adaptable data/task distribution on-the-fly between CPUs and coprocessors with good load balancing Efficient utilization of all available computational units in heterogeneous systems Support of heterogeneous systems with big number of coprocessors Scalability. A host with 1 coprocessor can show >2x performance improvement and a host + 2 coprocessors can show >3x speed-up No limitations on matrix sizes The algorithm provides a high degree of parallelism while minimizing synchronizations and communications.

Performance results The algorithm is implemented in the frame of Intel MKL LU, QR and Cholesky factorization routines. The implemented routines detect the presence of Intel Xeon Phi coprocessors and automatically offload the computations that benefit from additional computational resources. The usage model ensures ease of use by hiding the complexity of heterogeneous systems from the user and providing the same API as standard Intel MKL routines. User no need to change code for application

Intel Math Kernel Library Extended Eigensolver for Solving Symmetric Eigenvalue Problems Alexander Kalinkin Sergey Kuznetsov 21

Intel MKL Extended Eigensolver Functionality Extended Eigensolver computes all the eigenvalues within a given search interval λ min, λ max and respective eigenvectors. Standard Eigenvalue Problem Ax x, A A Generalized Eigenvalue Problem Ax Bx A Ais real) Supported matrix storages: sparse, banded, and dense. Similar LAPACK routine are named as: *evx and *gvx. * * A, B B * ( A 0 The Feast algorithm is inspired by the contour integration technique in quantum mechanics. The FEAST algorithm deviates fundamentally from the traditional techniques based on Krylov subspace iteration (Arnoldi and Lanczos algorithms) or other Davidson-Jacobi techniques. A t if

Extended Eigensolver Interfaces Reverse Communications Interfaces Users provide their own complex-precision solver of linear systems and their own matrix-matrix multiply. More flexible for specific applications Interfaces for Predefined Formats/Solvers Easy (plug-and-play) Intended for the following matrix formats: sparse (CSR format, 1-based only supported) banded (LAPACK storage) dense (LAPACK storage) More than 90% of computations are done by PARDISO (as sparse solver) Spike (as banded solver) LAPACK (as dense matrices solver) All data types are supported (real, real*8, complex, complex*16).

Speedup Scalability of Intel MKL Extended Eigensolver 12 OpenMP* scalability of dfeast_scsrev on Intel Xeon CPU E5-2680 (2.7 Ghz, RAM 32 GB) 10 8 6 4 Speedup on 2 threads Speedup on 4 threads Speedup on 8 threads Speedup on 16 threads 2 0 27000 42875 64000 91125 125000 166375 216000 274625 Matrix size Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ 7/5/2013 24

Intel MKL Sparse BLAS: performance optimizations on modern architectures Alexander Kalinkin Sergey Pudov 25

Intel MKL Sparse BLAS: Introduction Intel MKL Sparse BLAS supports 6 sparse formats: CSR, CSC, BSR, DIA, COO, and SKY. It is primarily designed for applications where the computations are completed a few times only. Every function calculates the result in a single call, which includes simple matrix analysis and execution steps. Deep investigation of the sparse matrix pattern is not performed because it is a time consuming operation that affects the performance. 26

Two-step Computations Approach It is known that for best performance, computational kernels and the workload balancing algorithm should depend on the structure of the matrix. When multiple calls are expected with a particular sparse matrix pattern, it is better to organize computations in two steps: Analysis, which chooses the best kernel and workload balancing algorithm for a given computer architecture. Execution, where the information from the previous step is used to get high performance. We can try to use this approach since the time required for a single analysis step can be less than the overall performance benefit from multiple execution steps. Limitations imposed by the single-call Sparse BLAS interfaces are mostly visible on modern architectures with multiple cores where even small workload imbalance may result in a significant performance deficiency. 27

Experimental Library Experimental library for an Intel Xeon Phi coprocessor contains some SpMV functionality with two-step interface to investigate performance benefits of this approach. The library supports: Two formats: CSR and ESB A couple of workload balancing algorithms ** The goal of the experiment is to collect early feedback. The library will be available via request to intel.mkl@intel.com after release. 28

Q & A 29

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804