Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation

Similar documents
Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

Intel Math Kernel Library (Intel MKL) Sparse Solvers. Alexander Kalinkin Intel MKL developer, Victor Kostin Intel MKL Dense Solvers team manager

Overview of Intel MKL Sparse BLAS. Software and Services Group Intel Corporation

GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Intel MKL Data Fitting component. Overview

C Language Constructs for Parallel Programming

What's new in VTune Amplifier XE

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Enabling DDR2 16-Bit Mode on Intel IXP43X Product Line of Network Processors

Techniques for Lowering Power Consumption in Design Utilizing the Intel EP80579 Integrated Processor Product Line

Getting Compiler Advice from the Optimization Reports

Software Tools for Software Developers and Programming Models

Intel Parallel Amplifier Sample Code Guide

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

Using Intel Inspector XE 2011 with Fortran Applications

Intel IT Director 1.7 Release Notes

Optimizing the operations with sparse matrices on Intel architecture

Open FCoE for ESX*-based Intel Ethernet Server X520 Family Adapters

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on SuSE*Enterprise Linux Server* using Xen*

Intel MPI Library for Windows* OS

Overview of Intel Parallel Studio XE

Intel C++ Compiler Documentation

Using the Intel VTune Amplifier 2013 on Embedded Platforms

Intel(R) Threading Building Blocks

Vectorization Advisor: getting started

Installation Guide and Release Notes

Intel Performance Libraries

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

PARDISO Version Reference Sheet Fortran

Intel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor

Product Change Notification

H.J. Lu, Sunil K Pandey. Intel. November, 2018

Product Change Notification

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python*

Product Change Notification

Product Change Notification

OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel

Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Boot-Up Options

Product Change Notification

Product Change Notification

IXPUG 16. Dmitry Durnov, Intel MPI team

Intel(R) Threading Building Blocks

Product Change Notification

Product Change Notification

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Installation Guide and Release Notes

Product Change Notification

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation

Intel Platform Controller Hub EG20T

Intel Software Development Products Licensing & Programs Channel EMEA

ECC Handling Issues on Intel XScale I/O Processors

Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes

Product Change Notification

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Introduction to Intel Fortran Compiler Documentation. Document Number: US

Product Change Notification

LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015

INTEL MKL Vectorized Compact routines

Product Change Notification

Intel Math Kernel Library (Intel MKL) Latest Features

Product Change Notification

Bei Wang, Dmitry Prohorov and Carlos Rosales

Intel Platform Controller Hub EG20T

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Product Change Notification

Product Change Notification

HPCG Results on IA: What does it tell about architecture?

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

VTune(TM) Performance Analyzer for Linux

Continuous Speech Processing API for Host Media Processing

Product Change Notification

Product Change Notification

Product Change Notification

Parallel Programming Models

High Performance Computing The Essential Tool for a Knowledge Economy

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Product Change Notification

Third Party Hardware TDM Bus Administration

MayLoon User Manual. Copyright 2013 Intel Corporation. Document Number: xxxxxx-xxxus. World Wide Web:

Product Change Notification

Graphics Performance Analyzer for Android

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Product Change Notification

Intel EP80579 Software Drivers for Embedded Applications

Fastest and most used math library for Intel -based systems 1

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Real World Development examples of systems / iot

Intel IXP400 Software: Integrating STMicroelectronics* ADSL MTK20170* Chipset Firmware

Product Change Notification

Overview of Data Fitting Component in Intel Math Kernel Library (Intel MKL) Intel Corporation

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor PCI 16-Bit Read Implementation

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Enabling Hardware Accelerated Playback for Intel Atom /Intel US15W Platform and IEGD

Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Transcription:

Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation Alexander Kalinkin Anton Anders Roman Anders 1

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vpro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vpro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2013. Intel Corporation. http://intel.com/software/products 2

Agenda Problem statement Algorithm Experiments Conclusion 3

Problem statement Cons No extra info available for the matrix, only some generic properties (positive definite, Hermitian, ) Huge size Pros Clusters with modern Intel CPUs Intel Math Kernel Library (Intel MKL) with optimized BLAS, LAPACK, PARDISO functionality 4

Algorithm (Ax=b) Input: matrix A, vector b; special parameters. Matrix reordering and symbolic factorization Reorder matrix A to reduce fill-in in factor L, create dependency tree representation of matrix A Numeric factorization Compute decomposition A=LL T or LDL T or LU The most time-consuming part Forward and backward substitution Solve Ly=b (forward step), Dz=y (diagonal step), then L T x=z (backward step) Output: vector x. 5

Factorization step Matrix A after reordering (example of 4 leafs/processes) E B C D E F G A B C D E F G - non-zero block - L-block updates R-block (or Right depends on Left) Tree representation of matrix A after reordering C G 0 1 2 3 F 0 1 2 3 A B D E 0 1 2 3 Both tree and tree-node parallelization are used All computations within the node are based on functionality from Intel MKL Computation of leafs & updates of a block are independent on each process Data is distributed between processes uniformly 6

Factorization step Matrix A after reordering (example of 4 leafs/process) E B C D E F G A B C D E F G - non-zero block - L-block updates R-block (or Right depends on Left) Tree representation of matrix A after reordering C G 0 1 2 3 F 0 1 2 3 A B D E 0 1 2 3 Both tree and tree-node parallelization are used All computations within the node are based on functionality from Intel MKL Computation of leafs & updates of a block are independent on each process Data is distributed between processes uniformly 7

Implementation of LU decomposition within a node G 0 1 2 3 Selecting one thread per process allows us to mask data transfers behind computations 8

Current status/interface Supported as 2 additional libraries for Linux* & Microsoft Windows*, 64-bit only. Supports different MPI implementations via user-compiled wrapper. C: {. PARDISO (pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, &idum, &nrhs, iparm, &msglvl, b, x, &error); } {. comm = MPI_Comm_c2f(MPI_COMM_WORLD); CPARDISO (pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, &idum, &nrhs, iparm, &msglvl, b, x, comm, &error); } Fortran:. call PARDISO(pt, maxfct, mnum, mtype, phase, n, a, ia, ja, idum, nrhs, iparm, msglvl, b, x, error) call CPARDISO(pt, maxfct, mnum, mtype, phase, n, a, ia, ja, idum, nrhs, iparm, msglvl, b, x, comm, &error) 9

Experiments (scalability of time) ** ** Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ **Here and further: The University of Florida Sparse Matrix Collection T. A. Davis and Y. Hu, ACM Transactions on Mathematical Software, Vol 38, Issue 1, 2011, pp 1:1-1:25. http://www.cise.ufl.edu/research/sparse/matrices. 10

Experiments (scalability of time) Additional processes reduce computational time!!! ** ** Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ **Here and further: The University of Florida Sparse Matrix Collection T. A. Davis and Y. Hu, ACM Transactions on Mathematical Software, Vol 38, Issue 1, 2011, pp 1:1-1:25. http://www.cise.ufl.edu/research/sparse/matrices. 11

time, sec Experiments (scalability of time) 400 350 300 ** 3Dspectralwave, material problem G 0 1 2 3 250 200 150 100 solve fact reorder C F 0 1 2 3 50 0 2 4 8 16 32 64 number of nodes, each node use 16 threads A B D E 0 1 2 3 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmarklimitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ Factorization and solving steps scale well in terms of memory and performance. Parallelization of reordering step might lead to worse reordering affecting overall time Deeper investigation is needed here. 12

Ratio of time Ratio of time Experiments (balancing) 1,3 ga41as41h72 ** G 0 1 2 3 1,2 1,1 1 0,9 0,8 16 32 64 128 256 number of mpi processes (2 threads per process) Long_coup_dt6 ** approach 1 approach 2 approach 3 approach 4 A 0 C 0 1 1 B D F 2 3 2 3 E 1,4 1,3 1,2 1,1 1 0,9 0,8 16 32 64 128 256 number of mpi processes (2 threads per process) approach 1 approach 2 approach 3 approach 4 In case of non-uniform tree, there are a few approaches to divide nodes of the tree between computational nodes. But there is no best approach, so to achieve good performance we switch between them at reordering step. 13

Max memory per node, Gb Max memory per node, Gb Experiments (scalability of memory) 7 6 5 4 3 2 1 0 NDOF=398K, NNZ=15.7M Absolute memory per node scalability (Lower is better) 1 2 4 8 16 Number of MPI processes (1 per HW node) 18 16 14 12 10 8 6 4 2 0 NDOF=1.7M, NNZ=12M Absolute memory per node scalability (Lower is better) 1 2 4 8 16 Number of MPI processes (1 per HW node) Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ 14

Max memory per node, Gb Max memory per node, Gb Experiments (scalability of memory) Additional processes decrease memory size per host!!! 7 6 5 4 3 2 1 0 NDOF=398K, NNZ=15.7M Absolute memory per node scalability (Lower is better) 1 2 4 8 16 Number of MPI processes (1 per HW node) 18 16 14 12 10 8 6 4 2 0 NDOF=1.7M, NNZ=12M Absolute memory per node scalability (Lower is better) 1 2 4 8 16 Number of MPI processes (1 per HW node) Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ 15

Conclusion Intel Direct Sparse Solver for Clusters based on Intel MKL functionality results in Good scaling of computational time Good scaling of memory per node A lot of work/investigation still needs to be completed 16

Automatic Offload support in Intel Math Kernel Library for Intel Xeon Phi coprocessors Alexander Kalinkin Nikita Shustrov 17

Algorithm The matrix factorization method is based on panel factorization approach. It shows advantages over communication avoiding and tail methods or theirs combinations: no additional computational cost no additional memory consumption The panel factorization approach has the same computational cost and memory usage as classic LAPACK algorithms The implementation preserves the LAPACK standard interfaces and the algorithm can be applied for any matrix The implementation is DAG-based and uses panel factorization kernels that were redesigned and rewritten for Intel Xeon Phi coprocessors

Algorithm: Features Adaptable data/task distribution on-the-fly between CPUs and coprocessors with good load balancing Efficient utilization of all available computational units in heterogeneous systems Support of heterogeneous systems with big number of coprocessors Scalability. A host with 1 coprocessor can show >2x performance improvement and a host + 2 coprocessors can show >3x speed-up No limitations on matrix sizes The algorithm provides a high degree of parallelism while minimizing synchronizations and communications.

Performance results The algorithm is implemented in the frame of Intel MKL LU, QR and Cholesky factorization routines. The implemented routines detect the presence of Intel Xeon Phi coprocessors and automatically offload the computations that benefit from additional computational resources. The usage model ensures ease of use by hiding the complexity of heterogeneous systems from the user and providing the same API as standard Intel MKL routines. User no need to change code for application

Intel Math Kernel Library Extended Eigensolver for Solving Symmetric Eigenvalue Problems Alexander Kalinkin Sergey Kuznetsov 21

Intel MKL Extended Eigensolver Functionality Extended Eigensolver computes all the eigenvalues within a given search interval λ min, λ max and respective eigenvectors. Standard Eigenvalue Problem Ax x, A A Generalized Eigenvalue Problem Ax Bx A Ais real) Supported matrix storages: sparse, banded, and dense. Similar LAPACK routine are named as: *evx and *gvx. * * A, B B * ( A 0 The Feast algorithm is inspired by the contour integration technique in quantum mechanics. The FEAST algorithm deviates fundamentally from the traditional techniques based on Krylov subspace iteration (Arnoldi and Lanczos algorithms) or other Davidson-Jacobi techniques. A t if

Extended Eigensolver Interfaces Reverse Communications Interfaces Users provide their own complex-precision solver of linear systems and their own matrix-matrix multiply. More flexible for specific applications Interfaces for Predefined Formats/Solvers Easy (plug-and-play) Intended for the following matrix formats: sparse (CSR format, 1-based only supported) banded (LAPACK storage) dense (LAPACK storage) More than 90% of computations are done by PARDISO (as sparse solver) Spike (as banded solver) LAPACK (as dense matrices solver) All data types are supported (real, real*8, complex, complex*16).

Speedup Scalability of Intel MKL Extended Eigensolver 12 OpenMP* scalability of dfeast_scsrev on Intel Xeon CPU E5-2680 (2.7 Ghz, RAM 32 GB) 10 8 6 4 Speedup on 2 threads Speedup on 4 threads Speedup on 8 threads Speedup on 16 threads 2 0 27000 42875 64000 91125 125000 166375 216000 274625 Matrix size Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/enru/articles/optimization-notice/ 7/5/2013 24

Intel MKL Sparse BLAS: performance optimizations on modern architectures Alexander Kalinkin Sergey Pudov 25

Intel MKL Sparse BLAS: Introduction Intel MKL Sparse BLAS supports 6 sparse formats: CSR, CSC, BSR, DIA, COO, and SKY. It is primarily designed for applications where the computations are completed a few times only. Every function calculates the result in a single call, which includes simple matrix analysis and execution steps. Deep investigation of the sparse matrix pattern is not performed because it is a time consuming operation that affects the performance. 26

Two-step Computations Approach It is known that for best performance, computational kernels and the workload balancing algorithm should depend on the structure of the matrix. When multiple calls are expected with a particular sparse matrix pattern, it is better to organize computations in two steps: Analysis, which chooses the best kernel and workload balancing algorithm for a given computer architecture. Execution, where the information from the previous step is used to get high performance. We can try to use this approach since the time required for a single analysis step can be less than the overall performance benefit from multiple execution steps. Limitations imposed by the single-call Sparse BLAS interfaces are mostly visible on modern architectures with multiple cores where even small workload imbalance may result in a significant performance deficiency. 27

Experimental Library Experimental library for an Intel Xeon Phi coprocessor contains some SpMV functionality with two-step interface to investigate performance benefits of this approach. The library supports: Two formats: CSR and ESB A couple of workload balancing algorithms ** The goal of the experiment is to collect early feedback. The library will be available via request to intel.mkl@intel.com after release. 28

Q & A 29

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

31