MAGMA: a New Generation

Similar documents
MAGMA. Matrix Algebra on GPU and Multicore Architectures

Mixed Precision Methods

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

A Standard for Batching BLAS Operations

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Making Dataflow Programming Ubiquitous for Scientific Computing

NEW ADVANCES IN GPU LINEAR ALGEBRA

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

HPC with Multicore and GPUs

High Performance Linear Algebra

Fast and reliable linear system solutions on new parallel architectures

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Technology on Dense Linear Algebra

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Some notes on efficient computing and high performance computing environments

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

AperTO - Archivio Istituzionale Open Access dell'università di Torino

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels

Toward a Scalable Multi-GPU Eigensolver via Compute-Intensive Kernels and Efficient Communication

An Overview of High Performance Computing and Challenges for the Future

Intel Performance Libraries

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators

MAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory

Modern GPUs (Graphics Processing Units)

Solving Dense Linear Systems on Graphics Processors

Automatic Development of Linear Algebra Libraries for the Tesla Series

Intel Math Kernel Library 10.3

Toward a supernodal sparse direct solver over DAG runtimes

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels

Overview of research activities Toward portability of performance

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

Iterative Sparse Triangular Solves for Preconditioning

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Applications of Berkeley s Dwarfs on Nvidia GPUs

International Conference on Computational Science (ICCS 2017)

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

KAUST Repository. A High Performance QDWH-SVD Solver using Hardware Accelerators. Authors Sukkari, Dalal E.; Ltaief, Hatem; Keyes, David E.

High performance matrix inversion of SPD matrices on graphics processors

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

A Scalable Approach to Solving Dense Linear Algebra Problems. on Hybrid CPU-GPU Systems

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Accelerating GPU kernels for dense linear algebra

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Sparse Direct Solvers for Extreme-Scale Computing

Out Of Memory SVD Solver for Big Data

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

QR Decomposition on GPUs

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

CUDA Accelerated Compute Libraries. M. Naumov

Modeling and Simulation of Hybrid Systems

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

CPU-GPU Heterogeneous Computing

A GPU Sparse Direct Solver for AX=B

Intel Math Kernel Library

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

n N c CIni.o ewsrg.au

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Xeon Phi. Bill Barth January 11, 2013

Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA 1

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

LAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

Scientific Computing. Some slides from James Lambers, Stanford

High Performance Computing on GPUs using NVIDIA CUDA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Sparse Matrix Algorithms

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Optimization for Performance and Energy for Batched Matrix Computations on GPUs

Transcription:

1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

MAGMA: LAPACK for GPUs MAGMA Matrix algebra for GPU and multicore architecture To provide LAPACK/ScaLAPACK on hybrid architectures http://icl.cs.utk.edu/magma/ MAGMA 1.3 For NVIDIA CUDA GPUs on shared memory systems 8+ hybrid algorithms have been developed (total of 32+ routines) One-sided factorizations and linear system solvers Two-sided factorizations and eigenproblem solvers A subset of BLAS and auxiliary routines in CUDA MAGMA developers & collaborators UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) Community effort, similarly to LAPACK/ScaLAPACK

Key Features of MAGMA 1.3 High performance Multiple precision support (Z, C, D, S, and MP) Hybrid algorithms Out-of-GPU memory algorithms MultiGPU support

Key Features of MAGMA 1.3

MAGMA Software Stack C P U H Y B R I D GPU / Coprocessor distr. multi Hybrid LAPACK/ScaLAPACK & Tile Algorithms / StarPU / DAGuE MAGMA 1.3 Hybrid Tile (PLASMA) Algorithms PLASMA / Quark StarPU run-time system MAGMA 1.3 Hybrid LAPACK and Tile Kernels single LAPACK MAGMA 1.3, clmagma 1., MAGMA MIC.3 MAGMA SPARSE MAGMA BLAS BLAS BLAS CUDA / OpenCL / MIC Support: Linux, Windows, Mac OS X; C/C++, Fortran; Matlab, Python

Multiple precision support Performance of the LU factorization in various precisions 9 8 7 6 5 4 CGETRF_GPU SGETRF_GPU ZGETRF_GPU DGETRF_GPU CPU DP peak (134 GFlop/s) 3 2 1 Keeneland GPU M29 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X566 (2x6 cores @2.8GHz) Matrix size

Methodology overview A methodology to use all available resources: MAGMA uses hybridization methodology based on Representing linear algebra algorithms as collections of tasks and data dependencies among them Properly scheduling tasks' execution over multicore and GPU hardware components Hybrid CPU+GPU algorithms (small tasks for multicores and large tasks for GPUs) Successfully applied to fundamental linear algebra algorithms One- and two-sided factorizations and solvers Iterative linear and eigensolvers Productivity 1) High level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions

A Hybrid Algorithm Example Left-looking hybrid Cholesky factorization in MAGMA The difference with LAPACK the 4 additional lines in red Line 8 (done on CPU) is overlapped with work on the GPU (from line 6)

Mixed precision iterative refinement Solving general dense linear systems using mixed precision iterative refinement 7 SP Solve 6 DP Solve (MP Iter.Ref.) 5 DP Solve 4 3 2 1 Keeneland GPU M29 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X566@2.8GHz (2 x 6 cores) Matrix size

Out of GPU Memory Algorithms Solving large problems that do not fit in the GPU memory 35 DGETRF 3 25 2 15 Matrices of size that do not fit in a specified GPU memory 1 5 Keeneland GPU M29 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X566@2.8GHz (2 x 6 cores) Matrix size

Out of GPU Memory Algorithms Solving large problems that do not fit in the GPU memory 35 DGETRF 3 25 2 15 Out-of-GPU-memory Algorithms can now solve large problems 1 5 Keeneland GPU M29 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X566@2.8GHz (2 x 6 cores) Matrix size

Out of GPU Memory Algorithms Perform left-looking factorizations on sub-matrices that fit in the GPU memory (using existing algorithms) The rest of the matrix stays on the CPU Left-looking versions minimize writing on the CPU Factored sub-matric A 1 on CPU To be factored sub-matrix A 2 on GPU Untouched part of the matrix... 1) Copy A 2 to the GPU 2) Update A 2 using A 1 (a panel of A 1 at a time) 3) Factor the updated A 2 using existing hybrid code 4) Copy factored A 2 to the CPU Trivially extended to multigpus: A 2 is larger with 1-D block cyclic distribution, again reusing existing algorithms

MultiGPU Support Data distribution 1-D block-cyclic distribution nb Algorithm GPU holding current panel is sending it to CPU All updates are done in parallel on the GPUs Look-ahead is done with GPU holding the next panel GPU GPU 1 GPU 2 GPU...

LU Factorization on multigpus in DP 1 9 8 7 1 GPU CPU (MKL) 6 5 4 3 2 1 Keeneland GPU M29 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X566@2.8GHz (2 x 6 cores) Matrix size

LU Factorization on multigpus in DP 1 9 8 7 2 GPUs 1 GPU CPU (MKL) 6 5 4 3 2 1 Keeneland GPU M29 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X566@2.8GHz (2 x 6 cores) Matrix size

LU Factorization on multigpus in DP 1 9 8 7 6 3 GPUs 2 GPUs 1 GPU CPU (MKL) 5 4 3 2 1 Keeneland GPU M29 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X566@2.8GHz (2 x 6 cores) Matrix size

LU Factorization on Kepler in DP 1 9 8 7 6 Kepler (K2X) 3 GPUs 2 GPUs 1 GPU CPU (MKL) 5 4 3 2 1 Keeneland GPU M29 (14 MP @1.3 GHz, peak 583 GFlop/s) CPU Intel Xeon X566@2.8GHz (2 x 6 cores) Matrix size

LU Scalability on Kepler in DP 18 16 14 12 "MAGMA (host+1 GPU)" MKL (host) 1 8 6 4 CPU: Intel Xeon ES-267 [ 16 cores @2.6 GHz ] GPU: K2X [ 14 MP x192 @.73 GHz ] 2 Matrix size

LU Scalability on Kepler in DP 18 16 14 12 "MAGMA (host+2 GPUs)" "MAGMA (host+1 GPU)" MKL (host) 1 8 6 4 CPU: Intel Xeon ES-267 [ 16 cores @2.6 GHz ] GPU: K2X [ 14 MP x192 @.73 GHz ] 2 Matrix size

Eigenproblem Solvers in MAGMA A x = λ x Quantum mechanics (Schrödinger equation) Quantum chemistry Principal component analysis (in data mining) Vibration analysis (of mechanical structures) Image processing, compression, face recognition Eigenvalues of graph, e.g., in Google s page rank... Need to solve it fast Current MAGMA results: MAGMA with 1 GPU can be 12x faster vs vendor libraries on state-of-art multicore systems T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix- vector mul4plica4on on mul4ple GPUs and its applica4on to symmetric dense and sparse eigenvalue problems, ICL Technical report, 3/212. J. Dongarra, A. Haidar, T. Schulthess, R. Solca, and S. Tomov, A novel hybrid CPU- GPU generalized eigensolver for electronic structure calcula4ons based on fine grained memory aware tasks, ICL Technical report, 3/212.

Gflop/s Toward fast Eigensolver 48 DSYTRD_MKL 44 4 36 32 28 24 2 16 12 8 4 2k 3k 4k 6k 8k 1k 12k 14k 16k 18k 2k 22k 24k 26k 28k 3k Matrix size flops formula: n 3 /3*time Higher is faster Keeneland system, using one node 3 NVIDIA GPUs (M29@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X566 @ 2.8 GHz, 23 GB) A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU- GPU generalized eigensolver for electronic structure calcula4ons based on fine grained memory aware tasks, ICL Technical report, 3/212.

Gflop/s Toward fast Eigensolver 48 44 4 36 32 28 24 2 16 12 8 4 DSYTRD_3GPUs DSYTRD_2GPUs DSYTRD_1GPU DSYTRD_MKL 2k 3k 4k 6k 8k 1k 12k 14k 16k 18k 2k 22k 24k 26k 28k 3k Matrix size flops formula: n 3 /3*time Higher is faster Keeneland system, using one node 3 NVIDIA GPUs (M29@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X566 @ 2.8 GHz, 23 GB) A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU- GPU generalized eigensolver for electronic structure calcula4ons based on fine grained memory aware tasks, ICL Technical report, 3/212.

Gflop/s Toward fast Eigensolver 48 44 4 36 32 28 24 2 16 12 8 4 DSYTRD_2stages_3GPUs DSYTRD_2stages_2GPUs DSYTRD_2stages_1GPU DSYTRD_3GPUs DSYTRD_2GPUs DSYTRD_1GPU DSYTRD_MKL 2k 3k 4k 6k 8k 1k 12k 14k 16k 18k 2k 22k 24k 26k 28k 3k Matrix size Characteristics Stage 1: BLAS-3, increasing computational intensity, Stage 2: BLAS-1.5, new cache friendly kernel, 4X/12X faster than standard approach, Bottelneck: if all Eigenvectors are required, it has 1 back transformation extra cost. A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU- GPU generalized eigensolver for electronic structure calcula4ons based on fine grained memory aware tasks, ICL Technical report, 3/212. flops formula: n 3 /3*time Higher is faster Keeneland system, using one node 3 NVIDIA GPUs (M29@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X566 @ 2.8 GHz, 23 GB) 1 2 3 4 5 6 2 4 6 nz = 36 1 2 3 4 5 6 1 2 3 4 5 6 nz = 116 first stage 1 2 3 4 5 6 1 second 2 stage 1 2 3 4 5 6 nz = 116 3 4 5 6 1 2 3 4 5 6 nz = 178

Current and Future Directions Synchronization avoiding algorithms using Dynamic Runtime Systems 48 cores POTRF, TRTRI and LAUUM. The matrix is 4 x 4,tile size is 2 x 2 24

Current and Future Directions High-productivity w/ Dynamic Runtime Systems From Sequential Nested-Loop Code to Parallel Execution for (k = ; k < min(mt, NT); k++){ } zgeqrt(a[k;k],...); for (n = k+1; n < NT; n++) zunmqr(a[k;k], A[k;n],...); for (m = k+1; m < MT; m++){ } ztsqrt(a[k;k],,a[m;k],...); for (n = k+1; n < NT; n++) ztsmqr(a[m;k], A[k;n], A[m;n],...);

Current and Future Directions High-productivity w/ Dynamic Runtime Systems From Sequential Nested-Loop Code to Parallel Execution for (k = ; k < min(mt, NT); k++){ } starpu_insert_task(&cl_zgeqrt, k, k,...); for (n = k+1; n < NT; n++) starpu_insert_task(&cl_zunmqr, k, n,...); for (m = k+1; m < MT; m++){ } starpu_insert_task(&cl_ztsqrt, m, k,...); for (n = k+1; n < NT; n++) starpu_insert_task(&cl_ztsmqr, m, n, k,...); Use of StarPU runtime system to exploit heterogeneous architectures (multicore + GPUs) or Quark if only targetting multicore.

Matrices Over Runtime Systems @ Exascale (MORSE) Mission statement: "Design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale Hybrid systems Runtime challenges due to the ever growing hardware complexity Algorithmic challenges to exploit the hardware capabilities at most Integrated into MAGMA software stack MAGMA 1.3 includes Level 3 BLAS, linear and least squares solvers

Collaborators and Support MAGMA team http://icl.cs.utk.edu/magma PLASMA team http://icl.cs.utk.edu/plasma Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France (StarPU team) KAUST, Saudi Arabia