MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Similar documents
MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

MAGMA: a New Generation

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

HPC with Multicore and GPUs

Accelerating GPU Kernels for Dense Linear Algebra

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

A Standard for Batching BLAS Operations

Mixed Precision Methods

NEW ADVANCES IN GPU LINEAR ALGEBRA

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Fast and reliable linear system solutions on new parallel architectures

Accelerating GPU kernels for dense linear algebra

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Technology on Dense Linear Algebra

High Performance Linear Algebra

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Automatic Development of Linear Algebra Libraries for the Tesla Series

Hierarchical DAG Scheduling for Hybrid Distributed Systems

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach

Toward a supernodal sparse direct solver over DAG runtimes

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006

Solving Dense Linear Systems on Graphics Processors

An Overview of High Performance Computing and Challenges for the Future

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Computing xxx (2010) xxx xxx. Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

MAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Advanced Numerical Techniques for Cluster Computing

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

CUDA Accelerated Compute Libraries. M. Naumov

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Some notes on efficient computing and high performance computing environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Making Dataflow Programming Ubiquitous for Scientific Computing

ABSTRACT 1. INTRODUCTION. * phone ; fax ; emphotonics.com

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy

On the limits of (and opportunities for?) GPU acceleration

CUDA 6.0 Performance Report. April 2014

Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

A GPU Sparse Direct Solver for AX=B

Out Of Memory SVD Solver for Big Data

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Toward a Scalable Multi-GPU Eigensolver via Compute-Intensive Kernels and Efficient Communication

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

LU Factorization for Accelerator-based Systems

High-performance Cholesky factorization for GPU-only execution

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Self Adapting Numerical Software (SANS-Effort)

A Scalable Approach to Solving Dense Linear Algebra Problems. on Hybrid CPU-GPU Systems

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

High Performance Computing with Accelerators

Redesigning Triangular Dense Matrix Computations on GPUs

How HPC Hardware and Software are Evolving Towards Exascale

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Accelerating Dense Linear Algebra on the GPU

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

Benchmarking GPUs to Tune Dense Linear Algebra

Iterative Sparse Triangular Solves for Preconditioning

Outline. Availability Routines Code Testers Methodology

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

KAUST Repository. A High Performance QDWH-SVD Solver using Hardware Accelerators. Authors Sukkari, Dalal E.; Ltaief, Hatem; Keyes, David E.

Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

Accelerating cublas/cudnn using Input-Aware Auto-Tuning

Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters

Analysis and Design of Communication Avoiding Algorithms for Out of Memory(OOM) SVD

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

Transcription:

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010 1/24

Outline Challenges related to GPU and Multicore The Hybridization Methodology of MAGMA One-sided factorizations and solvers Two-sided factorizations and solvers MAGMA and the Fermi GPU Iterative linear and eigen-solvers Future work 2/24

Challenges Increase in parallelism Tesla C2050 (Fermi): 448 CUDA cores @1.15 GHz SP peak is 1075 GFlop/s, DP peak is 515 Gflop/s Increase in communication cost [vs computation] Processor speed improves ~59% / year memory bandwidth by only 23% Heterogeneity 3/24

Matrix Algebra on GPU and Multicore Architectures (MAGMA) MAGMA: a new generation linear algebra (LA) libraries to achieve the fastest possible time to an accurate solution on hybrid/heterogeneous architectures, starting with current multicore+multigpu systems Homepage: http://icl.cs.utk.edu/magma/ MAGMA & LAPACK MAGMA - based on LAPACK and extended for hybrid systems (multi-gpus + multicore systems); MAGMA - designed to be similar to LAPACK in functionality, data storage and interface, in order to allow scientists to effortlessly port any of their LAPACK-relying software components to take advantage of the new architectures MAGMA - to leverage years of experience in developing open source LA software packages and systems like LAPACK, ScaLAPACK, BLAS, ATLAS as well as the newest LA developments (e.g. communication avoiding algorithms) and experiences on homogeneous multicores (e.g. PLASMA) Support - NSF, Microsoft, NVIDIA [ now CUDA Center of Excellence at UTK on the development of Linear Algebra Libraries for CUDA-based Hybrid Architectures ] MAGMA developers University of Tennessee, Knoxville; University of California, Berkeley; University of Colorado, Denver 4/24

A New Generation of Algorithms delayed update to organize successive delayed update Level 2 BLAS to organize as a successive single Level Level 3 BLAS 2 BLAS as a single Level 3 BLAS Localized (over tiles) elementary transformations Localized (over tiles) elementary transformations MAGMA Hybrid Algorithms (heterogeneity friendly) Rely on - hybrid scheduler (of DAGs) - hybrid kernels (for nested parallelism) - existing software infrastructure 5/24

Hybridization methodology MAGMA uses HYBRIDIZATION methodology based on Representing linear algebra algorithms as collections of TASKS and DATA DEPENDANCIES among them Properly SCHEDULING the tasks' execution over the multicore and the GPU hardware components Successfully applied to fundamental linear algebra algorithms Hybrid CPU+GPU algorithms (small Hybrid tasks CPU+GPU for multicores algorithms and large (small tasks tasks for GPUs) for multicores and large tasks for GPUs) One and two-sided factorizations and slvers Iterative linear and eigen-solvers Faster, cheaper, better? High-level Leveraging prior developments Exceeding in performance (and sometimes accuracy) homogeneous solutions 6/24

MAGMA Status MAGMA 0.2 LU, QR, Cholesky (S, C, D, Z) Linear solvers In working precision, based on LU, QR, and Cholesky Mixed-precision iterative refinement CPU and GPU interfaces Two-sided factorizations Reduction to upper Hessenberg form for the general eigenvalue problem MAGMA BLAS Routines critical for MAGMA (GEMM, SYRK, TRSM, GEMV, SYMV, etc.) Unreleased Bidiagonal two-sided reduction for SVD Tridiagonal two-sided for the symmetric eigenvalue problem Divide & Conquer for the symmetric eigenvalue problem GEMM for FERMI Cholesky and QR for multigpus on MAGNUM tiles Hybrid kernels (building blocks) for tile algorithms (e.g., dynamically scheduled) GMRES and PCG 7/24

MAGMA Software Stack 8/24

Statically Scheduled One-Sided Factorizations (LU, QR, and Cholesky) Hybridization Note Panels (Level 2 BLAS) are factored on CPU using LAPACK Trailing matrix updates (Level 3 BLAS) are done on the GPU using look-ahead Panels are memory bound but are only O(N 2 ) flops and can be overlapped with the O(N 3 ) flops of the updates In effect, the GPU is used only for the high-performance Level 3 BLAS updates, i.e., no low performance Level 2 BLAS is scheduled on the GPU 9/24

Performance of the one-sided statically scheduled hybrid factorizations QR factorization in single precision arithmetic, CPU interface Performance of MAGMA vs MKL MAGMA QR time breakdown GFlop/s 320 280 240 200 MAGMA MKL 8 cores MKL 1 core 160 50% 120 40% 80 30% 20% 40 10% 0 0% 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 Matrix size x 1000 Matrix size x 1000 8 9 10 Time 100% 90% 80% 70% 60% Overhead CPU CPU+GPU GPU GPU : NVIDIA GeForce GTX 280 (240 cores @ 1.30GHz) GPU BLAS : CUBLAS 2.2, sgemm peak: 375 GFlop/s CPU : Intel Xeon dual socket quad-core (8 cores @2.33 GHz) CPU BLAS : MKL 10.0, sgemm peak: 128 GFlop/s [ for more performance data, see http://icl.cs.utk.edu/magma ] 10/24

Linear Solvers Solving Ax = b using LU factorization Intel(R) Xeon(R)E541@2.34GHz / 8 Cores + GTX 280 @1.30GHz / 240 Cores GFlop/s 350 300 250 200 150 100 SP Factorization SP Solve MP Solve DP Factorization DP Solve Direct Direct solvers solvers - Factor and do triangular solves - Factor and do triangular solves in the same, working precision in the same, working precision Mixed Mixed Precision Precision Iterative Iterative Refinement Refinement - Factor in single (i.e. the bulk of the computation - Factor in single (i.e. the bulk of the computation in fast arithmetic) and use it as preconditioner in fast arithmetic) and use it as preconditioner in simple double precision iteration, e.g. in simple double precision iteration, e.g. x i+1 = x x i+1 = i + (LU x i + (LU SP ) -1 P (b A x SP ) -1 P (b A i ) x i ) 50 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Matrix Size 11/24

MultiGPU and Multicore MAGNUM tiles Tasks are hybrid, GPU BLAS-based, or multicore kernels Scheduling: Using PLASMA with customized extensions to reduce communication and w/ hybrid MAGMA kernels Demonstrated scalability for one-sided factorizations Highly optimized, used as benchmark to compare with dynamic schedulers [e.g., performance on 4 C1060 GPUs for Cholesky is up to 1200 Gflop/s, for QR is up to 830 Gflop/s in SP ] Using StarPU to schedule hybrid, GPU and multicore kernels (from PLASMA and MAGMA) http://runtime.bordeaux.inria.fr/starpu/ [in collaboration with INRIA's StarPU team] Rectangular tiles Tiles of variable sizes to be used to account for the heterogeneity of the system To experiment with communication-avoiding algorithms PLASMA tile algorithms A single GPU kernel processing multiple tile tasks in parallel (Jakub) [vs only one, but magnum tile, at a time] Using the DPLASMA scheduler 12/24

Scheduling with PLASMA Cholesky factorization in SP HOST: 4x AMD Opteron core @1.8GHz GPUs: 4x C1060 (240 cores each @1.44GHz) 13/24

Scheduling with StarPU Statistics for codelet spotrf CUDA 0 (Quadro FX 5800) -> 3 / 36 (8.33 %) CUDA 1 (Quadro FX 5800) -> 1 / 36 (2.78 %) CUDA 2 (Quadro FX 5800) -> 3 / 36 (8.33 %) CPU 0 -> 6 / 36 (16.67 %) CPU 1 -> 9 / 36 (25.00 %) CPU 2 -> 4 / 36 (11.11 %) CPU 3 -> 6 / 36 (16.67 %) CPU 4 -> 4 / 36 (11.11 %) Statistics for codelet ssyrk CUDA 0 (Quadro FX 5800) -> 41 / 630 (6.51 %) CUDA 1 (Quadro FX 5800) -> 40 / 630 (6.35 %) CUDA 2 (Quadro FX 5800) -> 49 / 630 (7.78 %) CPU 0 -> 105 / 630 (16.67 %) CPU 1 -> 85 / 630 (13.49 %) CPU 2 -> 105 / 630 (16.67 %) CPU 3 -> 102 / 630 (16.19 %) CPU 4 -> 103 / 630 (16.35 %) Statistics for codelet strsm CUDA 0 (Quadro FX 5800) -> 125 / 630 (19.84 %) CUDA 1 (Quadro FX 5800) -> 127 / 630 (20.16 %) CUDA 2 (Quadro FX 5800) -> 122 / 630 (19.37 %) CPU 0 -> 50 / 630 (7.94 %) CPU 1 -> 52 / 630 (8.25 %) CPU 2 -> 52 / 630 (8.25 %) CPU 3 -> 54 / 630 (8.57 %) CPU 4 -> 48 / 630 (7.62 %) Statistics for codelet sgemm CUDA 0 (Quadro FX 5800) -> 2258 / 7140 (31.62 %) CUDA 1 (Quadro FX 5800) -> 2182 / 7140 (30.56 %) CUDA 2 (Quadro FX 5800) -> 2261 / 7140 (31.67 %) CPU 0 -> 87 / 7140 (1.22 %) CPU 1 -> 94 / 7140 (1.32 %) CPU 2 -> 85 / 7140 (1.19 %) CPU 3 -> 85 / 7140 (1.19 %) CPU 4 -> 88 / 7140 (1.23 %) Performance of Cholesky factorization in SP SGEMM gpu : 333.04 GFlop/s cpu : 20.06 GFlop/s STRSM gpu : 59.46 GFlop/s cpu : 18.96 GFlop/s SSYRK gpu : 298.74 GFlop/s cpu : 19.50 GFlop/s SPOTRF gpu : 57.51 GFlop/s cpu : 17.45 GFlop/s 14/24

Scheduling with StarPU 15/24

Scheduling with DPLASMA 400 350 300 250 200 1 core + C1060 (MAGMA) 8 cores + C1060 + 8600GTS 150 100 50 0 1024 3072 5120 7360 9216 11520 13440 15360 17664 19584 21504 16/24

Two-sided matrix factorizations Used in singular-value and eigen-value problems LAPACK-based two-sided factorizations are rich in Level 2 BLAS and therefore can not be properly accelerated on multicore CPUs We developed hybrid algorithms exploring GPUs' high bandwidth GPU vs CPU GEMV GFlop/s 70 60 50 40 30 20 10 GPU SGEMV GPU DGEMV CPU SGEMV CPU DGEMV High-performance CUDA kernels were developed High-performance CUDA kernels were developed for various matrix-vector products for various matrix-vector products [ e.g., ssymv reaching up to 102 Gflop/s for the [ e.g., ssymv reaching up to 102 Gflop/s for the symmetric eigenvalue problem ] symmetric eigenvalue problem ] 0 1000 2000 3000 4000 5000 6000 7000 Matrix size GPU: GTX280 (240 cores @ 1.30GHz, 141 GB/s) CPU: GPU: 2 x GTX280 4 cores Intel (240 Xeon cores @@ 2.33GHz, 1.30GHz, 10.4 141 GB/s) GB/s) CPU: 2 x 4 cores Intel Xeon @ 2.33GHz, 10.4 GB/s) 17/24

Two-sided factorizations (performance in single precision arithmetic) GFlop/s 160 140 120 100 80 60 40 20 GPU Performance 0 1024 4032 7040 10112 13024 Matrix size HR Tridiag. Bidiag. 26 x 12 x 22 x GFlop/s 50 45 40 35 30 25 20 15 10 5 Multicore Performance 0 1024 2048 3072 4032 5184 6016 7040 8064 9088 10112 Matrix size Hessenberg Tridiagonalization Bidiagonalization GPU : NVIDIA GeForce GTX 280 (240 cores @ 1.30GHz) GPU BLAS : CUBLAS 2.3, dgemm peak: 75 GFlop/s CPU : Intel Xeon dual socket quad-core (8 cores @2.33 GHz) CPU BLAS : MKL 10.0, dgemm peak: 65 GFlop/s 18/24

FERMI What has changed regarding MAGMA algorithms? MAGMA is coded on high-level, extracting its performance from the performance of low-level kernels, i.e., everything works for FERMI and nothing has changed on high-level We have relied on being able to develop the low-level kernels needed of very high-performance as GPUs become more complex, this has become more difficult Auto-tuning has become more important 19/24

GEMM for FERMI (Tesla C2050: 448 CUDA cores @ 1.15GHz, theoretical SP peak is 1.03 Tflop/s, DP peak 515 GFlop/s) 600 SGEMM 600 DGEMM 500 400 500 400 MAGMA CUBLAS 3.1 GFlop/s 300 200 MAGMA CUBLAS 3.1 GFlop/s 300 200 100 100 0 512 1536 2560 3584 4608 Matrix size 0 512 1536 2560 3584 4608 Matrix Size Kernels have to be redesigned and auto-tuned for Fermi, e.g., inner-most blocking sizes have to be increased; add register blocking, etc. May even need to be written in assembly 20/24

LU factorization in double precision GFlop/s 240 FERMI MAGMA 220 200 180 160 140 120 100 80 60 40 20 0 1024 3072 5184 7040 9088 Matrix Size FERMI ISTANBUL PLASMA GTX280 MAGMA Tesla C2050: 448 CUDA cores @ 1.15GHz SP/DP peak is 1030 / 515 GFlop/s ISTANBUL AMD 8 socket 6 core (48 cores) @2.8GHz SP/DP peak is 1075 / 538 GFlop/s 21/24

LU, Cholesky, and QR on Fermi in DP 22/24

Sparse Linear Algebra The hybridization approach naturally works [e.g., Richardson iteration in mixed-precision iterative refinement solvers, Krylov space iterative solvers and eigen-solvers ] Observed better accuracy (compared to CPU) Fast sparse matrix-vector product on Fermi does not have to use texture memory Explore ideas to reduce communication [ e.g., mixed precision, reduced storage for integers for the indexing, etc. ] Need high bandwidth 23/24

MAGMA Future Plans Experimentation with various schedulers (StarPU, PLASMA, DPLASMA) and improvements for multicore with multigpus Communication avoiding algorithms for systems of multicore CPUs and GPUs Kernels for FERMI (GEMM, SYRK, TRSM, SpMV) Auto-tuning framework Port [or facilitate the port] to Windows, Python and Matlab Krylov space iterative linear solvers and block eigensolvers 24/24

Collaborators / Support MAGMA Matrix Algebra on GPU and Multicore Architectures [ ICL team: J. Dongarra, S. Tomov, R. Nath, H. Ltaief ] http://icl.cs.utk.edu/magma/ PLASMA Parallel Linear Algebra for Scalable Multicore Architectures http://icl.cs.utk.edu/plasma Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver University of Coimbra, Portugal INRIA, France (StarPU team)