A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Similar documents
An Overview of High Performance Computing and Challenges for the Future

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006

Jack Dongarra University of Tennessee Oak Ridge National Laboratory

MAGMA. Matrix Algebra on GPU and Multicore Architectures

HPC with Multicore and GPUs

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Mixed Precision Methods

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

A Standard for Batching BLAS Operations

MAGMA: a New Generation

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators

Fast and reliable linear system solutions on new parallel architectures

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

Trends in HPC (hardware complexity and software challenges)

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

COMPUTATIONAL LINEAR ALGEBRA

Advanced Numerical Techniques for Cluster Computing

A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines

Solving Dense Linear Systems on Graphics Processors

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

Accelerating GPU Kernels for Dense Linear Algebra

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Accelerating GPU kernels for dense linear algebra

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

MAGMA QR, 1 GPU, All Available Cores. Mitch Horton, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory

High Performance Linear Algebra

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Automatic Development of Linear Algebra Libraries for the Tesla Series

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Making Dataflow Programming Ubiquitous for Scientific Computing

How HPC Hardware and Software are Evolving Towards Exascale

NEW ADVANCES IN GPU LINEAR ALGEBRA

The Mont-Blanc approach towards Exascale

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Parallel Systems I The GPU architecture. Jan Lemeire

On the limits of (and opportunities for?) GPU acceleration

Software Packages on Multi-Core Hardware

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

CME 213 S PRING Eric Darve

The Future of High- Performance Computing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Toward a supernodal sparse direct solver over DAG runtimes

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers

AperTO - Archivio Istituzionale Open Access dell'università di Torino

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Technology on Dense Linear Algebra

Intel Many Integrated Core (MIC) Architecture

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Parallel and Distributed Programming Introduction. Kenjiro Taura

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

QR Decomposition on GPUs

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Master Informatics Eng.

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

General Purpose GPU Computing in Partial Wave Analysis

The Future of High Performance Computing

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Optimization for Performance and Energy for Batched Matrix Computations on GPUs

High-performance Cholesky factorization for GPU-only execution

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures

Adrian Tate XK6 / openacc workshop Manno, Mar

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Introduction to CELL B.E. and GPU Programming. Agenda

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

A Note on Auto-tuning GEMM for GPUs

Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to GPU computing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Transcription:

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1

Gflop/s LAPACK LU - Intel64-16 cores DGETRF - Intel64 Xeon quad-socket quad-core (16 cores) - th. peak 153.6 Gflop/s 140 120 100 DGEMM 80 60 LAPACK 40 20 0 0 2000 4000 6000 8000 10000 12000 14000 Matrix size

Major Changes to Software Must rethink the design of our software Another disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software Numerical libraries for example will change For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this 3

LAPACK LU Step 1 Step 2 Step 3 Step 4... Fork-join, bulk synchronous processing 4

Parallel Tasks in LU Break into smaller tasks and remove dependencies

PLASMA (Redesign LAPACK/ScaLAPACK) Parallel Linear Algebra Software for Multicore Architectures Asychronicity Avoid fork-join (Bulk sync design) Dynamic Scheduling Out of order execution Fine Granularity Independent block operations Locality of Reference Data storage Block Data Layout Lead by Tennessee and Berkeley similar to LAPACK/ScaLAPACK as a community effort 6

Gflop/s LU - Intel64-16 cores 140 DGETRF - Intel64 Xeon quad-socket quad-core (16 cores) theoretical peak 153.6 Gflop/s 120 100 80 60 DGEMM PLASMA LAPACK 40 20 0 0 2000 4000 6000 8000 10000 12000 14000 Matrix size

Future Computer Systems Most likely be a hybrid design Think standard multicore chips and accelerator (GPUs) Today accelerators are attached Next generation more integrated Intel s Larrabee in 2010 8,16,32,or 64 x86 cores AMD s Fusion in 2011 Multicore with embedded graphics ATI Nvidia s plans? 8 Intel Larrabee

Hardware to Software Trends The MAGMA Project [ Matrix Algebra on GPU and Multicore Architectures ] - create LA libraries for the next-generation of highly parallel, and heterogeneous processors - to be similar to LAPACK in functionality, data storage, and interface - to allow effortless port of LAPACK-relying software to take advantage of the new hardware - to run on homogeneous x86-based multicores and take advantage of GPUs (if available) Slide 9 / 16

Hybrid Computing Match algorithmic requirements to architectural strengths of the hybrid components Multicore : small tasks/tiles Accelerator: large data parallel tasks Algorithms as DAGs Current hybrid CPU+GPU algorithms (small tasks/tiles for multicore) (small tasks for multicores and large tasks for GPUs) e.g. split the computation into tasks; define critical path that clears the way for other large data parallel tasks; proper schedule the tasks execution Design algorithms with well defined search space to facilitate auto-tuning

NVIDIA GeForce GTX 280 NVIDIA-Speak 240 stream processors Generic speak 30 processing cores 8 SIMD functional units per core 1 mul-add (2 flops) + 1 mul per functional unit (3 flops/cycle) Best case theoretically: 240 mul-adds + 240 muls per cycle 1.3 GHz clock 30 * 8 * (2 + 1) * 1.33 = 933 Gflop/s peak Best case reality: 240 mul-adds per clock Just able to do the mul-add so 2/3 or 624 Gflop/s All this is single precision Processing Core Double precision is 78 Gflop/s peak (Factor of 8 from SP; exploit mixed prec) 141 GB/s bus, 1 GB memory 4 GB/s via PCIe (we see: T = 11 us + Bytes/3.3 GB/s) In SP SGEMM performance 375 Gflop/s

MAGMA version 0.1 performance QR factorization in single precision arithmetic, CPU interface Performance of MAGMA vs MKL MAGMA QR time breakdown GFlop/s 320 280 240 200 MAGMA MKL 8 cores MKL 1 core 100% 90% 80% 70% 160 50% 40% 120 30% 80 20% 40 10% 0 0% 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Matrix size x 1000 Matrix size x 1000 GPU : NVIDIA GeForce GTX 280 (240 cores @ 1.30GHz) GPU BLAS : CUBLAS 2.2, sgemm peak: 375 GFlop/s CPU : Intel Xeon dual socket quad-core (8 cores @2.33 GHz) CPU BLAS : MKL 10.0, sgemm peak: 128 GFlop/s [ for more performance data, see http://icl.cs.utk.edu/magma ] Time 60% Overhead CPU CPU+GPU GPU

Single Core and GTX-280 Single Precision Double Precision LU QR LL T LU QR LL T GPU : NVIDIA GeForce GTX 280 GPU BLAS : CUBLAS 2.2, s/d gemm peak: 375 / 75 GFlop/s CPU : Intel Xeon dual socket quad-core @2.33 GHz CPU BLAS : MKL 10.0, s/d gemm peak: 17.5 / 8.6 GFlop/s

MAGMA Software Available through MAGMA's homepage http://icl.cs.utk.edu/magma/ Included are the 3 one-sided matrix factorizations Iterative Refinement Algorithm (Mixed Precision) Standard (LAPACK) data layout and accuracy Two LAPACK-style interfaces CPU interface: both input and output are on the CPU GPU interface: both input and output are on the GPU This release is intended for single GPU

Cholesky on multicore + multi-gpus Hardware HOST: Two socket - dual core AMD Opteron 1.8GHz, 2GB memory DEVICE: 4 GPU TESLA C1070 1.44GHz 240 computing cores per GPU 4GB memory per GPU Single precision floating point performance (NVIDIA PEAK): 4.14 Tflop/s Memory bandwidth: 408 GB/s System interface: PCIexpress Preliminary results on single precision RAM limitations on HOST prevented runs on larger sizes

Gflop/s Single Precision - Cholesky on Multicore + Multi GPUs Parallel Performance of the hybrid SPOTRF (4 Opteron 1.8GHz and 4 GPU TESLA C1060 1.44GHz) 1CPU-1GPU 2CPUs-2GPUs 3CPUs-3GPUs 4CPUs-4GPUs 1200 1000 800 600 400 200 0 0 5000 10000 15000 20000 25000 Matrix sizes 16

Gflop/s DP Cholesky with Multiple GPUs Parallel Performance of the hybrid DPOTRF (4 Opteron 1.8GHz and 4 GPU TESLA C1060 1.44GHz) 300 250 4CPUs-4GPUs 3CPUs-3GPUs 2CPUs-2GPUs 1CPU-1GPU 200 150 100 50 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Matrix Sizes 17

Performance of Single Precision on Conventional and GPU s Realized have the similar situation on our commodity processors. That is, SP is 2X as fast as DP on many systems The Intel Pentium and AMD Opteron have SSE2 2 flops/cycle DP 4 flops/cycle SP IBM PowerPC has AltiVec 8 flops/cycle SP 4 flops/cycle DP No DP on AltiVec NVIDIA Tesla Best case reality: 240 mul-adds per clock Just able to do the mul-add so 2/3 or 624 Gflop/s of theoretical peak All this is single precision Double precision is 78 Gflop/s peak (Factor of 8 from SP; exploit mixed prec) Single precision is faster because: Operations are faster Reduced data motion Larger blocks gives higher locality in cache

Idea Goes Something Like This Exploit 32 bit floating point as much as possible. Especially for the bulk of the computation Correct or update the solution with selective use of 64 bit floating point to provide a refined results Intuitively: Compute a 32 bit result, Calculate a correction to 32 bit result using selected higher precision and, Perform the update of the 32 bit results with the correction using high precision. 19

Mixed-Precision Iterative Refinement Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(a) SINGLE O(n 3 ) x = L\(U\b) SINGLE O(n 2 ) r = b Ax DOUBLE O(n 2 ) WHILE r not small enough END z = L\(U\r) SINGLE O(n 2 ) x = x + z DOUBLE O(n 1 ) r = b Ax DOUBLE O(n 2 ) Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.

Mixed-Precision Iterative Refinement Iterative refinement for dense systems, Ax = b, can work this way. L U = lu(a) SINGLE O(n 3 ) x = L\(U\b) SINGLE O(n 2 ) r = b Ax DOUBLE O(n 2 ) WHILE r not small enough END z = L\(U\r) SINGLE O(n 2 ) x = x + z DOUBLE O(n 1 ) r = b Ax DOUBLE O(n 2 ) Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Requires extra storage, total is 1.5 times normal; O(n 3 ) work is done in lower precision O(n 2 ) work is done in high precision Problems if the matrix is ill-conditioned in sp; O(10 8 )

GFlops/s General (LU) Solver 300 Solver-LU-GPU Intel (R) Xeon (R) E5410 2.33 GHz ( 8 Core ) GForce GTX 280 1.3 GHz (240 Core) 250 Single Prec 200 Double Prec 150 100 50 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Dimension GPU : NVIDIA GeForce GTX 280 GPU BLAS : CUBLAS 2.2, s/d gemm peak: 375 / 75 GFlop/s CPU : Intel Xeon dual socket quad-core @2.33 GHz CPU BLAS : MKL 10.0, s/d gemm peak: 17.5 / 8.6 GFlop/s 22

GFlops/s General (LU) Solver 300 Solver-LU-GPU Intel (R) Xeon (R) E5410 2.33 GHz ( 8 Core ) GForce GTX 280 1.3 GHz (240 Core) 250 Single Prec Double Prec 200 150 Iter Refine 4x 100 50 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Dimension GPU : NVIDIA GeForce GTX 280 GPU BLAS : CUBLAS 2.2, s/d gemm peak: 375 / 75 GFlop/s CPU : Intel Xeon dual socket quad-core @2.33 GHz CPU BLAS : MKL 10.0, s/d gemm peak: 17.5 / 8.6 GFlop/s 23

Exascale Computing Exascale systems are likely feasible by 2017 2 10-100 Million processing elements (cores or mini-cores) with chips perhaps as dense as 1,000 cores per socket, clock rates will grow more slowly 3D packaging likely Large-scale optics based interconnects 10-100 PB of aggregate memory Hardware and software based fault management Heterogeneous cores Performance per watt stretch goal 100 GF/watt of sustained performance >> 10 100 MW Exascale system Power, area and capital costs will be significantly higher than for today s fastest systems Google: exascale computing study 24

Major Changes to Software Must rethink the design of our software Another disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software 25

Conclusions For the last decade or more, the research investment strategy has been overwhelmingly biased in favor of hardware. This strategy needs to be rebalanced - barriers to progress are increasingly on the software side. Moreover, the return on investment is more favorable to software. Hardware has a half-life measured in years, while software has a half-life measured in decades. High Performance Ecosystem out of balance Hardware, OS, Compilers, Software, Algorithms, Applications No Moore s Law for software, algorithms and applications

Collaborators / Support Employment opportunities for post-docs in the PLASMA/MAGMA projects PLASMA Parallel Linear Algebra Software for Multicore Architectures http://icl.cs.utk.edu/plasma/ MAGMA Matrix Algebra on GPU and Multicore Architectures http://icl.cs.utk.edu/magma/ Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julie & Julien Langou, Hatem Ltaief, Piotr Luszczek, Stan Tomov

28

From LAPACK (multithreaded) // Matrix Initialization DLARNV( &IONE, ISEED, &N, D ); DLAGSY( &N, &Nminus1, D, A,& N, ISEED, WORK, &INFO ); // Make A definite positive For (int I = 0; I < N; I++) A( I, I ) = A( I, I ) + N; // Cholesky Factorization dpotrf(uplo, &N, A, &LDA, &INFO); To Multicores-MultiGPUs // CUDA Initialization CublasInit(); // Matrix Allocation on the HOST cudamallochost( (void**)&h_a, N*N*sizeof(double) ); // Matrix Allocation on the DEVICE cublasalloc(n*n, sizeof(double), (void**)&d_a); // Matrix Initialization DLARNV( &IONE, ISEED, &N, D ); DLAGSY( &N, &Nminus1, D, h_a,& N, ISEED, WORK, &INFO ); // Make h_a definite positive For (int I = 0; I < N; I++) h_a( I, I ) = h_a( I, I ) + N; // Attach a DEVICE to a core cudasetdevice(my_core_id); // Cholesky Factorization PLASMA_dpotrf(UPLO, &N, h_a, &LDA, d_a, INFO); // Matrix Initialization DLARNV( &IONE, ISEED, &N, D ); DLAGSY( &N, &Nminus1, D, A,& N, ISEED, WORK, &INFO ); // Make A definite positive For (int I = 0; I < N; I++) A( I, I ) = A( I, I ) + N; // Initialize PLASMA INFO = PLASMA_Init( CORES); // Cholesky Factorization INFO = PLASMA_dpotrf(UPLO, N, A, LDA); // Finalize PLASMA INFO = PLASMA_Finalize( ); // CUDA Initialization CublasInit(); To PLASMA (Multicores) To MAGMA (1CPU-1GPU) // Matrix Allocation on the HOST cudamallochost( (void**)&h_a, N*N*sizeof(double) ); // Matrix Allocation on the DEVICE cublasalloc(n*n, sizeof(double), (void**)&d_a); // Matrix Initialization DLARNV( &IONE, ISEED, &N, D ); DLAGSY( &N, &Nminus1, D, h_a,& N, ISEED, WORK, &INFO ); // Make h_a definite positive For (int I = 0; I < N; I++) h_a( I, I ) = h_a( I, I ) + N; // Cholesky Factorization magma_dpotrf(uplo, &N, h_a, &LDA, d_a, INFO);

Coding for an Abstract Multicore Parallel software for multicores should have two characteristics: Fine granularity: High level of parallelism is needed Cores will probably be associated with relatively small local memories. This requires splitting an operation into tasks that operate on small portions of data in order to reduce bus traffic and improve data locality. Asynchronicity: As the degree of thread level parallelism grows and granularity of the operations becomes smaller, the presence of synchronization points in a parallel execution seriously affects the efficiency of an algorithm.

Steps in the LAPACK LU DGETF2 (Factor a panel) LAPACK DLSWP (Backward swap) LAPACK DLSWP (Forward swap) LAPACK DTRSM (Triangular solve) BLAS DGEMM (Matrix multiply) BLAS 31

LU Timing Profile (16 core system) Threads no lookahead Time for each component DGETF2 DLASWP(L) DLASWP(R) DTRSM DGEMM DGETF2 DLSWP DLSWP DTRSM Bulk Sync Phases DGEMM

Adaptive Lookahead - Dynamic Event Driven Multithreading Ideas not new. Many papers use the DAG approach. Reorganizing algorithms to use this approach 33

Moore s Law Reinterpreted Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). Need to deal with systems with millions of concurrent threads Future generation will have billions of threads! Need to be able to easily replace inter-chip parallelism with introchip parallelism Number of threads of execution doubles every 2 year