Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Similar documents
Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Solving Dense Linear Systems on Graphics Processors

Automatic Development of Linear Algebra Libraries for the Tesla Series

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Unleashing the Power of Multi-GPU Accelerators with FLAME

FLAME Working Note #30

Making Programming Synonymous with Programming for. Linear Algebra Libraries

Representing Linear Algebra Algorithms in Code: The FLAME Application Program Interfaces

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems

FLAMES2S: From Abstraction to High Performance

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin

Out-of-Core Solution of Linear Systems on Graphics Processors

Transforming Linear Algebra Libraries: From Abstraction to High Performance

Solving Large Dense Matrix Problems on Multi-Core Processors

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

An M-script API for Linear Algebra Operations on Graphics Processors

An M-script API for Linear Algebra Operations on Graphics Processors

Beautiful Parallel Code: Evolution vs. Intelligent Design

Solving Dense Linear Systems on Graphics Processors

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

FLAME Working Note #23

libflame The Complete Reference ( version ) Field G. Van Zee The University of Texas at Austin

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

High performance matrix inversion of SPD matrices on graphics processors

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

Exploiting the capabilities of modern GPUs for dense matrix computations

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Dealing with Asymmetry for Performance and Energy Efficiency

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Automatic Derivation of Linear Algebra Algorithms with Application to Control Theory

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

rcuda: an approach to provide remote access to GPU computational power

Toward Scalable Matrix Multiply on Multithreaded Architectures

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Strategies for Parallelizing the Solution of Rational Matrix Equations

Technology for a better society. hetcomp.com

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism

Tree-based methods on GPUs

Technology on Dense Linear Algebra

Solution of the Transport Equation Using Graphical Processing Units

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Representing Linear Algebra Algorithms in Code: The FLAME API

Matrix Computations. Enrique S. Quintana-Ortí. September 11, 2012, Hamburg, Germany

OpenMP * Past, Present and Future

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Overview of research activities Toward portability of performance

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

CellSs Making it easier to program the Cell Broadband Engine processor

Introduction to Runtime Systems

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

MAGMA: a New Generation

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Copyright by Tze Meng Low 2013

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

CSC573: TSHA Introduction to Accelerators

NEW ADVANCES IN GPU LINEAR ALGEBRA

Applications of Berkeley s Dwarfs on Nvidia GPUs

Architecture, Programming and Performance of MIC Phi Coprocessor

CUDA Accelerated Compute Libraries. M. Naumov

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Max Planck Institute Magdeburg Preprints

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

CUDA 6.0 Performance Report. April 2014

HPC with Multicore and GPUs

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Parallel Programming on Ranger and Stampede

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Simulation and Benchmarking of Modelica Models on Multi-core Architectures with Explicit Parallel Algorithmic Language Extensions

Introduction to CUDA Programming

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

Intel Performance Libraries

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Tesla GPU Computing A Revolution in High Performance Computing

Accelerating GPU kernels for dense linear algebra

StarPU: a runtime system for multigpu multicore machines

Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

GPU Fundamentals Jeff Larkin November 14, 2016

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

A Standard for Batching BLAS Operations

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Transcription:

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de Geijn The University of Texas at Austin PPoPP09 February, 2009 http://www.cs.utexas.edu/users/flame/ 1

Joint work with: Universidad Jaime I (Spain) Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Rafael Rubio UT-Austin Ernie Chan Robert A. van de Geijn Field G. Van Zee Supported by: National Science Foundation (NSF) Spanish Office of Science nvidia Anonymous corporation http://www.cs.utexas.edu/users/flame/ 2

Motivation The sky is falling Parallelism is thrust upon the masses Popular libraries like LAPACK must be completely rewritten Give us money or the world as we know it will come to an end http://www.cs.utexas.edu/users/flame/ 3

Motivation Evolution vs intelligent design Parallelism is thrust upon the masses Popular libraries like LAPACK must be completely rewritten (end of an evolutionary path) Great, let s finally toss them and start over Cheaper than trying to evolve? http://www.cs.utexas.edu/users/flame/ 4

Motivation LAPACK (Linear Algebra Package) Fortran-77 codes One routine (algorithm) per operation in the library Storage in column major order Parallelism extracted from calls to multithreaded BLAS Problems Extracting parallelism increases synchronization and thus limits performance Column major order hurts data locality http://www.cs.utexas.edu/users/flame/ 5

Motivation FLAME (Formal Linear Algebra Methods Environment) Notation for expressing algorithms Systematic derivation procedure (automated using Mathematica) Families of algorithms for each operation APIs to transform algorithms into codes Storage and algorithm are independent We will show FLAME elegantly supports Storage-by-blocks Parallelism with data dependencies High performance even on exotic architectures like multigpus http://www.cs.utexas.edu/users/flame/ 6

A preview... 700 600 Performance of the Cholesky factorization on GPU/CPU MKL 10.0 spotrf on two Intel Xeon QuadCore (2.2 GHz) Algorithm-by-blocks on Tesla S870 Algorithm-by-blocks on Tesla S1070 500 GFLOPS 400 300 200 100 0 0 5000 10000 15000 20000 Matrix size http://www.cs.utexas.edu/users/flame/ 7

Outline 1 Motivation 2 Cholesky factorization (Overview of FLAME) 3 Parallelization 4 Targeting platforms with multiple accelerators 5 Experimental results 6 Concluding remarks http://www.cs.utexas.edu/users/flame/ 8

The Cholesky Factorization Definition Given A n n symmetric positive definite, compute with L n n lower triangular A = L L T, http://www.cs.utexas.edu/users/flame/ 9

Algorithms should ideally be represented in a way that captures how we reason about them http://www.cs.utexas.edu/users/flame/ 10

The Cholesky Factorization: On the Whiteboard done done done A (partially updated) α 11 := α11 a 21 := A a 21 / 22 := α A 22 a 21 a T 11 21 α 11 done done a 21 A 22 done A (partially updated) http://www.cs.utexas.edu/users/flame/ 11

FLAME Notation done done done A (partially updated) α 11 a T 12 Repartition AT L A T R A BL 0 B @ A BR «A 00 a 01 A 02 a T 10 α 11 a T 12 A 20 a 21 A 22 where α 11 is a scalar 1 C A a 21 A 22 http://www.cs.utexas.edu/users/flame/ 12

FLAME Notation Algorithm: [A] := Chol unb(a) AT Partition A L A T R A BL A BR where A T L is 0 0 while n(a BR ) 0 do Repartition AT L A T R A BL A BR 0 «@ where α 11 is a scalar A 00 a 01 A 02 a T 10 α 11 a T 12 A 20 a 21 A 22 1 A α 11 := α 11 a 21 := a 21 /α 11 A 22 := A 22 a 21 a T 21 Continue with AT L A T R endwhile A BL A BR 0 «@ A 00 a 01 A 02 a T 10 α 11 a T 12 A 20 a 21 A 22 1 A http://www.cs.utexas.edu/users/flame/ 13

FLAME/C Code From algorithm to code... FLAME notation Repartition AT L A T R A BL A BR 0 «@ where α 11 is a scalar A 00 a 01 A 02 a T 10 α 11 a T 12 A 20 a 21 A 22 1 A FLAME/C representation in code FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************** */ /* *************************** */ &a10t, /**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); http://www.cs.utexas.edu/users/flame/ 14

(Unblocked) FLAME/C Code int FLA_Cholesky_unb( FLA_Obj A ) { /*... FLA_Part_2x2( );... */ while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){ FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &a01, &A02, /* ************* */ /* ************************** */ &a10t, /**/ &alpha11, &a12t, ABL, /**/ ABR, &A20, /**/ &a21, &A22, 1, 1, FLA_BR ); /*------------------------------------------------------------*/ FLA_Sqrt( alpha11 ); /* a21 := sqrt( alpha11 ) */ FLA_Inv_Scal( alpha11, a21 ); /* a21 := a21 / alpha11 */ FLA_Syr ( FLA_MINUS_ONE, a21, A22 ); /* A22 := A22 - a21 * a21t */ /*------------------------------------------------------------*/ /* FLA_Cont_with_3x3_to_2x2( );... */ } } http://www.cs.utexas.edu/users/flame/ 15

Blocked FLAME/C Code int FLA_Cholesky_blk( FLA_Obj A, int nb_alg ) { /*... FLA_Part_2x2( );... */ while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ************* */ /* ******************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, b, b, FLA_BR ); /*------------------------------------------------------------*/ FLA_Cholesky_unb( A11 ); /* A21 := Cholesky( A11 ) */ FLA_Trsm_rltn( FLA_ONE, A11, A21 ); /* A21 := A21 * inv( A11 ) */ FLA_Syrk_ln ( FLA_MINUS_ONE, A21, A22 );/* A22 := A22 - A21 * A21 */ /*------------------------------------------------------------*/ /* FLA_Cont_with_3x3_to_2x2( );... */ } } http://www.cs.utexas.edu/users/flame/ 16

FLAME Code Visit http://www.cs.utexas.edu/users/flame/spark/ C code: FLAME/C M-script code for matlab: FLAME@lab Other APIs: FL A TEX Fortran-77 LabView Message-passing parallel: PLAPACK FLAG: GPUs http://www.cs.utexas.edu/users/flame/ 17

Outline 1 Motivation 2 Cholesky factorization (Overview of FLAME) 3 Parallelization 4 Targeting platforms with multiple accelerators 5 Experimental results 6 Concluding remarks http://www.cs.utexas.edu/users/flame/ 18

Parallelization on Multithreaded Architectures LAPACK parallelization: multithreaded BLAS A 11 A 21 A 22, A 11 is b b Pro?: Evolve legacy code Con: Continue to code in the LINPACK style (1970s) Each call to BLAS (compute kernels) is a synchronization point for threads As the number of threads increases, serial operations with cost O(nb 2 ) are no longer negligible compared with O(n 2 b) http://www.cs.utexas.edu/users/flame/ 19

Parallelization on Multithreaded Architectures Separation of concern Improve parallelism and data locality: algorithms-by-blocks Matrix of matrix blocks Matrix blocks as unit of data Computation with matrix blocks as unit of computation Execute sequential code to generate DAG of tasks SuperMatrix Runtime system for scheduling tasks to threads Sequential kernels to be executed by the threads Always be sure to make the machine-specific part someone else s problem http://www.cs.utexas.edu/users/flame/ 20

Matrix of blocks A = A (0,0) A (1,0) A (1,1) A (2,0) A (2,1) A (2,2)....... A (M 1,0) A (M 1,1) A (M 1,2) A (M 1,N 1) http://www.cs.utexas.edu/users/flame/ 21

FLAME implementation of algorithm-by-blocks: no change int FLA_Cholesky_blk( FLA_Obj A, int nb_alg ) { /*... FLA_Part_2x2( );... */ while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg ); /*... FLA_Repart_2x2_to_3x3( );... */ /*------------------------------------------------------------*/ FLA_Cholesky_unb( A11 ); /* A21 := Cholesky( A11 ) */ FLA_Trsm_rltn( FLA_ONE, A11, A21 ); /* A21 := A21 * inv( A11 ) */ FLA_Syrk_ln ( FLA_MINUS_ONE, A21, A22 );/* A22 := A22 - A21 * A21 */ /*------------------------------------------------------------*/ /* FLA_Cont_with_3x3_to_2x2( );... */ } } The FLAME runtime system pre-executes the code: Whenever a routine is encountered, a pending task is annotated in a global task queue http://www.cs.utexas.edu/users/flame/ 22

A (0,0) A (1,0) A (1,1) A (2,0) A (2,1) A (2,2)....... A (M 1,0) A (M 1,1) A (M 1,2) A (M 1,N 1) A (0,0) A (1,0) A (1,1) A (2,0) A (2,1) A (2,2)....... A (M 1,0) A (M 1,1) A (M 1,2) A (M 1,N 1) http://www.cs.utexas.edu/users/flame/ 23

FLAME Parallelization: SuperMatrix A (0,0) A (1,0) A (1,1) A (2,0) A (2,1) A (2,2)...... at runtime build DAG FLA Cholesky unb(a (0,0) ) A (1,0) := A (1,0) tril A (0,0) T A (2,0) := A (2,0) tril A (0,0) T. A (1,1) := A (1,1) A (1,0) A (1,0) T. SuperMatrix Once all tasks are entered on DAG, the real execution begins! Tasks with all input operands available are ready, other tasks must wait in the global queue Upon termination of a task, the corresponding thread updates the list of pending tasks http://www.cs.utexas.edu/users/flame/ 24

Outline 1 Motivation 2 Cholesky factorization (Overview of FLAME) 3 Parallelization 4 Targeting platforms with multiple accelerators 5 Experimental results 6 Concluding remarks http://www.cs.utexas.edu/users/flame/ 25

Mapping the Approach to MultiGPU Systems The potential of a GPU http://www.cs.utexas.edu/users/flame/ 26

CUDA Hardware A CUDA-enabled device is seen as a coprocessor to the CPU, capable of executing a very high number of threads in parallel Example: nvidia G80 as a set of SIMD Multiprocessors with On-Chip Shared Memory Up to 128 Streaming Processors (SP), grouped in clusters SP are SIMD processors Small and fast Shared Memory shared per SP cluster Local 32-bit registers per processor http://www.cs.utexas.edu/users/flame/ 27

CUBLAS Example i n t main ( v o i d ){... f l o a t h v e c t o r, d v e c t o r ; h v e c t o r = ( f l o a t ) m a l l o c (M s i z e o f ( f l o a t ) ) ;... // I n i t i a l i z e v e c t o r o f M f l o a t s c u b l a s A l l o c (M, s i z e o f ( f l o a t ), ( v o i d ) &d v e c t o r ) ; c u b l a s S e t V e c t o r (M, s i z e o f ( f l o a t ), h v e c t o r, d v e c t o r, 1 ) ; c u b l a s S s c a l (M, ALPHA, d v e c t o r, 1 ) ; c u b l a s G e t V e c t o r (M, s i z e o f ( f l o a t ), d v e c t o r, h v e c t o r, 1 ) ; c u b l a s F r e e ( d v e c t o r ) ;... } A typical CUDA (and CUBLAS) program has 3 phases: 1 Allocation and transfer of data to GPU 2 Execution of the BLAS kernel 3 Transfer of results back to main memory http://www.cs.utexas.edu/users/flame/ 28

How to use multiple GPUs? Thread 1 Thread 2 DAG SuperMatrix Thread 3 Thread 4 http://www.cs.utexas.edu/users/flame/ 29

How to use multiple GPUs? Thread 1 Thread 2 DAG SuperMatrix Thread 3 Thread 4 http://www.cs.utexas.edu/users/flame/ 30

Porting SuperMatrix to multiple GPUs Employ the equivalence: 1 core 1 GPU Difference: Explicitly transfer data from RAM to video memory Run-time system (scheduling), storage, and code are independent No significative modification to the FLAME codes: Interfacing to CUBLAS Run-time system can implement different scheduling policies Programming effort http://www.cs.utexas.edu/users/flame/ 31

Porting SuperMatrix to multiple GPUs Employ the equivalence: 1 core 1 GPU Difference: Explicitly transfer data from RAM to video memory Run-time system (scheduling), storage, and code are independent No significative modification to the FLAME codes: Interfacing to CUBLAS Run-time system can implement different scheduling policies Programming effort Two hours! http://www.cs.utexas.edu/users/flame/ 32

Outline 1 Motivation 2 Cholesky factorization (Overview of FLAME) 3 Parallelization 4 Targeting platforms with multiple accelerators 5 Experimental results 6 Concluding remarks http://www.cs.utexas.edu/users/flame/ 33

Porting SuperMatrix. Experimental results 140 Cholesky factorization on 4 GPU of NVIDIA S870 120 100 GFLOPS 80 60 40 AB + no 2d 20 AB + 2d AB + 2d + block cache AB + 2d + block cache + page-locked 0 0 2048 4096 6144 Matrix size http://www.cs.utexas.edu/users/flame/ 34

Porting SuperMatrix. Experimental results A more elaborate port required for high-performance: 2-D work distribution Memory/cache coherence techniques to reduce transferences between RAM and video memory: write-back and write-invalidate For details, see PPoPP09 paper. http://www.cs.utexas.edu/users/flame/ 35

Porting SuperMatrix. Experimental results GFLOPS (1e+9 operations per second) 500 400 300 200 100 Cholesky factorization on several platforms Dynamic spotrf on Tesla S870 (4 GPUs) Dynamic spotrf on SGI Altix 350 (16 Intel Itanium2) MKL spotrf on Xeon QuadCore (4 cores) MKL spotrf on SGI Altix 350 (16 Itanium2) 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Matrix size http://www.cs.utexas.edu/users/flame/ 36

More recent results 700 600 Performance of the Cholesky factorization on GPU/CPU MKL 10.0 spotrf on two Intel Xeon QuadCore (2.2 GHz) Algorithm-by-blocks on Tesla S870 Algorithm-by-blocks on Tesla S1070 500 GFLOPS 400 300 200 100 0 0 5000 10000 15000 20000 Matrix size http://www.cs.utexas.edu/users/flame/ 37

Other Operations Level-3 BLAS (matrix-matrix operations) Follow very similarly LU/QR factorization New algorithms-by-blocks have been developed Require kernels for an individual GPU to be written http://www.cs.utexas.edu/users/flame/ 38

Outline 1 Motivation 2 Cholesky factorization (Overview of FLAME) 3 Parallelization 4 Targeting platforms with multiple accelerators 5 Experimental results 6 Concluding remarks http://www.cs.utexas.edu/users/flame/ 39

Related Approaches Cilk (MIT) and SMPSs (Barcelona SuperComputing Center) General-purpose parallel programming Cilk irregular/recursive problems SMPSs more general, also manages dependencies High-level language based on OpenMP-like pragmas + compiler + runtime system Modest results for dense linear algebra PLASMA Project Next step in the LAPACK evolutionary path Traditional style of implementing algorithms Does not solve the programmability problem http://www.cs.utexas.edu/users/flame/ 40

Outline 1 Motivation 2 Cholesky factorization (Overview of FLAME) 3 Parallelization 4 Targeting platforms with multiple accelerators 5 Experimental results 6 Concluding remarks http://www.cs.utexas.edu/users/flame/ 41

Conclusion A success story For the domain of dense linear algebra libraries, FLAME+SuperMatrix appears to solve the programmability problem Very good performance can be achieved with multiple accelerators in this domain For more information... Visit http://www.cs.utexas.edu/users/flame/ http://www.cs.utexas.edu/users/flame/ 42