MAGMA. Matrix Algebra on GPU and Multicore Architectures

Similar documents
MAGMA: a New Generation

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

Mixed Precision Methods

A Standard for Batching BLAS Operations

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Making Dataflow Programming Ubiquitous for Scientific Computing

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

HPC with Multicore and GPUs

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

High Performance Linear Algebra

Fast and reliable linear system solutions on new parallel architectures

Technology on Dense Linear Algebra

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

Accelerating GPU Kernels for Dense Linear Algebra

Toward a supernodal sparse direct solver over DAG runtimes

Toward a Scalable Multi-GPU Eigensolver via Compute-Intensive Kernels and Efficient Communication

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

NEW ADVANCES IN GPU LINEAR ALGEBRA

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Solving Dense Linear Systems on Graphics Processors

An Overview of High Performance Computing and Challenges for the Future

Automatic Development of Linear Algebra Libraries for the Tesla Series

Some notes on efficient computing and high performance computing environments

Accelerating GPU kernels for dense linear algebra

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Self Adapting Numerical Software (SANS-Effort)

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

AperTO - Archivio Istituzionale Open Access dell'università di Torino

Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

One-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA 1

Out Of Memory SVD Solver for Big Data

Quantum ESPRESSO on GPU accelerated systems

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Analysis and Design of Communication Avoiding Algorithms for Out of Memory(OOM) SVD

Intel Math Kernel Library 10.3

Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Applications of Berkeley s Dwarfs on Nvidia GPUs

Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing

LU Factorization for Accelerator-based Systems

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Modern GPUs (Graphics Processing Units)

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

A Scalable Approach to Solving Dense Linear Algebra Problems. on Hybrid CPU-GPU Systems

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

A Framework for Out of Memory SVD Algorithms

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Parallel Computing xxx (2010) xxx xxx. Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

KAUST Repository. A High Performance QDWH-SVD Solver using Hardware Accelerators. Authors Sukkari, Dalal E.; Ltaief, Hatem; Keyes, David E.

CUDA Accelerated Compute Libraries. M. Naumov

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix

COMPUTATIONAL LINEAR ALGEBRA

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

Trends in HPC (hardware complexity and software challenges)

Optimization for Performance and Energy for Batched Matrix Computations on GPUs

Chapter 1 Accelerating Numerical Dense Linear Algebra Calculations with GPUs

MAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory

Sparse Direct Solvers for Extreme-Scale Computing

A GPU Sparse Direct Solver for AX=B

Overview of research activities Toward portability of performance

Software Packages on Multi-Core Hardware

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization

PORTING PARALLEL APPLICATIONS TO HETEROGENEOUS SUPERCOMPUTERS: LIBRARIES AND TOOLS CAN MAKE IT TRANSPARENT

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

n N c CIni.o ewsrg.au

Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction

ABSTRACT 1. INTRODUCTION. * phone ; fax ; emphotonics.com

Transcription:

MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

MAGMA: LAPACK for GPUs MAGMA Matrix Algebra for GPU and Multicore Architecture To provide LAPACK/ScaLAPACK on hybrid architectures http://icl.cs.utk.edu/magma/ MAGMA BLAS A subset of BLAS for GPUs Highly optimized for NVIDIA GPGPUs Fast GEMM for Fermi MAGMA developers & collaborators UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) Community effort, similar to LAPACK/ScaLAPACK

MAGMA Software Stack CPU HYBRID GPU distributed Hybrid LAPACK/ScaLAPACK & Tile Algorithms / MORSE / ParSEC MAGMA 1.3 Hybrid Tile (PLASMA) Algorithms multi-gpu PLASMA / QUARK StarPU run-time system MAGMA 1.3 Hybrid LAPACK and Tile Kernels MAGMA 1.3 MAGMA SPARSE single MAGMA BLAS LAPACK BLAS BLAS CUDA / OpenCL / MIC Linux, Windows, Mac OS X C/C++, Fortran Matlab, Python

MAGMA Functionality 80+ hybrid algorithms have been developed (total of 320+ routines) Every algorithm is in 4 precisions (s/c/d/z) There are 3 mixed precision algorithms (zc & ds) These are hybrid algorithms, expressed in terms of BLAS MAGMA BLAS A subset of GPU BLAS, optimized for Tesla and Fermi GPUs

MAGMA Methodology Overview A methodology to use all available resources: MAGMA uses hybridization methodology based on Representing linear algebra algorithms as collections of tasks and data dependencies among them Properly scheduling tasks' execution over multicore and GPU hardware components Successfully applied to fundamental linear algebra algorithms One- and two-sided factorizations and solvers Iterative linear and eigensolvers Productivity Use high-level description; low-level hidden with proper abstractions Leverage prior efforts Exceed the performance of homogeneous solutions Hybrid Hybrid CPU+GPU CPU+GPU algorithms algorithms (small (small tasks tasks for for multicores multicores and and large large tasks tasks for for GPUs) GPUs)

Hybrid Algorithms Use case: one-sided factorization LU, QR, Cholesky Hybridization procedure Panels are factored on CPU using LAPACK (or equivalent) It is slow on the GPU Off-load from GPU to CPU Trailing matrix updates are done on the GPU Look-ahead helps in hiding communication and panel factorization PANEL Trailing matrix...

A Hybrid Algorithm Example Left-looking hybrid Cholesky factorization in MAGMA The difference with LAPACK the 4 additional lines in red Line 8 (done on CPU) is overlapped with work on the GPU (from line 6)

LU Factorization (single GPU)

Data distribution From Single to Multi-GPU Support 1-D block-cyclic distribution Algorithm nb GPU holding current panel is sending it to CPU All updates are done in parallel on the GPUs Look-ahead is done with GPU holding the next panel GPU 0 GPU 1 GPU 2 GPU 0...

LU Factorization: Multiple GPUs Matrix too large for a single GPU memory

Out of GPU Memory Algorithms Perform left-looking factorizations on sub-matrices that fit in the GPU memory (using existing algorithms) The rest of the matrix stays on the CPU Left-looking versions minimize writing on the CPU Factored sub-matric A1 on CPU To be factored sub-matrix A2 on GPU Untouched part of the matrix... 1) 1) Copy Copy A2 A2 to to the the GPU GPU 2) 2) Update A2 A2 using using A1 A1 (a (a panel panel of of A1 A1 at at a time) time) 3) 3) Factor the the updated A2 A2 using using existing hybrid code code 4) 4) Copy Copy factored A2 A2 to to the the CPU CPU Trivially extended to to multi-gpus: A2 A2 is is larger with with 1-D 1-D block block cyclic cyclic distribution, again again reusing existing algorithms

A New Generation of DLA Software Package Era Features Concept Abstractions LINPACK 70's Vector operations Level-1 BLAS LAPACK 80's Blocking, cache friendly Level-3 BLAS ScaLAPACK 90's Distributed memory PBLAS, MPI PLASMA mid 00's Multicore, Manycore DAG scheduler, tile data layout, extra kernels MAGMA late 00's MAGMA Hybrid Algorithms (heterogeneity friendly) Accelerated multicore Rely on - hybrid scheduler - hybrid kernels Hybrid scheduler, hybrid kernels

Hybrid Algorithms: One-Sided Transformations One-Sided Factorizations LU QR, and Cholesky Hybridization Panels (Level 2 BLAS) are factored on CPU using LAPACK Trailing matrix updates (Level 3 BLAS) are done on the GPU using look-ahead

Hybrid Algorithms: Two-Sided Transformations Two-Sided Factorizations Bidiagonal singular values Tridiagonal symmetric/generalized eigenvalues Upper Hessenberg non-symmetric eigenvalues Hybridization Trailing matrix updates (Level 3 BLAS) are done on the GPU Similar to the one-sided factorizations Panels (Level 2 BLAS) are hybrid Operations with memory footprint restricted to the panel are done on CPU The time consuming matrix-vector products involving the entire trailing matrix are done on the GPU

Additional 4x Speedup from Faster GPU BLAS DSYTRD (symmetric tri-diag Reduction) Keenland 3 NVIDIA Fermi M2070 1.1 GHz 5.4 GiB; 2x6 Intel Xeon X5660 2.8 GHz 26 GiB 300 250 Gflop/s 200 150 DSYTRD MKL 12 cores DSYTRD 1 GPU DSYTRD 2 GPUs DSYTRD 3 GPUs DSYTRD 2-stages 1 GPU 100 50 0 0 5000 10000 15000 20000 25000 30000 Matrix size 12 x speedup over 12 Intel cores Keeneland (X5660 system, @2.8 using one GHz) node 3 NVIDIA GPUs (M2070@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores A two-stage (X5660 @ 2.8 approach GHz, 23 GB) leading to increased computational intensity A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks, ICL Technical report, 03/2012.

Multi-GPU Two-Sided Factorizations Need HPC multi-gpu Level 2 BLAS (e.g., 50% of flops in the tridiagonal reduction) Performance of DSYMV on M2090's 140 120 100 Gflop/s 80 60 CUBLAS 1 GPU 2 GPUs 3 GPUs 40 20 0 0 5000 10000 15000 20000 25000 Matrix size T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

Hybrid Two-Sided Factorizations

From Fast BLAS to Fast Tridiagonalization Performance of MAGMA DSYTRD on multi M2090 GPUs 50 % of the flops are in SYMV Memory bound, i.e. does not scale well on multicore CPUs Use the GPU s high memory bandwidth and optimized SYMV 8 x speedup over 12 Intel cores (X5660 @2.8 GHz) Keeneland system, using one node 3 NVIDIA GPUs (M2070@ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB) T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

From Static to Dynamic Scheduling Static may stall in situations where work is available Hand tuned optimizations Hardware heterogeneity Kernel heterogeneity Separation of concerns Dynamic Runtime System

Matrices Over Runtime Systems at Exascale MORSE Mission statement: "Design dense and sparse linear algebra methods that achieve the fastest possible time to an accurate solution on large-scale Hybrid systems Runtime challenges due to the ever growing hardware complexity Algorithmic challenges to exploit the hardware capabilities to the fullest Integrated into MAGMA software stack

MAGMA-MORSE: x86 + Multiple GPUs Lessons Learned from PLASMA New high performance numerical kernels StarPU Runtime System Augonnet et. Al, INRIA, Bordeaux Use of both: x86 and GPUs leads to Hybrid Computations Similar to LAPACK in functionality

High Productivity: Sequential Code From Sequential Nested-Loop Code to Parallel Execution for (k = 0; k < min(mt, NT); k++) { } zgeqrt(a[k;k],...); for (n = k+1; n < NT; n++) zunmqr(a[k;k], A[k;n],...); for (m = k+1; m < MT; m++) { } ztsqrt(a[k;k],,a[m;k],...); for (n = k+1; n < NT; n++) ztsmqr(a[m;k], A[k;n], A[m;n],...);

High Productivity: Parallel Code From Sequential Nested-Loop Code to Parallel Execution for (k = 0; k < min(mt, NT); k++) { } starpu_insert_task( &cl_zgeqrt, A, k, k,...); for (n = k+1; n < NT; n++) starpu_insert_task( &cl_zunmqr( A, k, n,...); for (m = k+1; m < MT; m++) { } starpu_insert_task( &cl_ztsqrt( A m, k,...); for (n = k+1; n < NT; n++) starpu_insert_task( &cl_ztsmqr( A, m, n, k,...);

Contact Information and Generous Sponsors Stan Tomov tomov@eecs.utk.edu MAGMA team http://icl.cs.utk.edu/magma/ PLASMA team http://icl.cs.utk.edu/plasma/ Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France (StarPU team) KAUST, Saudi Arabia