NEW ADVANCES IN GPU LINEAR ALGEBRA

Size: px

Start display at page:

Download "NEW ADVANCES IN GPU LINEAR ALGEBRA"

Leona Wheeler
6 years ago
Views:

1 GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012

2 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear Algebra

3 INTRODUCTION TO OUR LIBRARIES» CULA Dense» Linear algebra routines» CULA Sparse» Iterative sparse system solvers and preconditioners» pcula» Scalable solvers for multiple GPUs» Ongoing work

4 INTRODUCTION - COMMON POINTS» Easy to use» No GPU programming experience necessary» dgetrf( ) culadgetrf( )» Exhaustively tested and benchmarked» Accuracy & stability first!» Cross platform» Linux, Windows, Mac OS X» Multiple languages» C/C++, Fortran, Python, Matlab

5 CULA DENSE

6 CULA DENSE INTRODUCTION» First released in 2009» LAPACK and BLAS implementations» Host or device memory» Almost 300 routines» Upcoming release (R15)» Tuned for Kepler architecture» Now free for personal academic use

7 CULA DENSE - FUNCTIONALITY LAPACK BLAS LU factorization Cholesky factorization Matrix-matrix multiply QR decomposition Orthogonal factorization Matrix-vector multiply Least squares System solve Rank updates Eigenvalue routines Matrix inversion Conjugate Singular value decomposition Auxiliary routines Transpose

GFLOPs CULA DENSE - PERFORMANCE 800 700 600 500 CULA Dense - Cholesky Factorization (SPOTRF) CPU (MKL) GPU (GTX680) Performance numbers include transfer time across

8 GFLOPs CULA DENSE - PERFORMANCE CULA Dense - Cholesky Factorization (SPOTRF) CPU (MKL) GPU (GTX680) Performance numbers include transfer time across PCI-Express (Gen2) bus CPU Intel Core i7 2600K GPU NVIDIA GTX 680 (1.5 GB) Matrix Size

9 CULA DENSE LINK INTERFACE» GPU acceleration with no code changes!» Intercepts calls to BLAS & LAPACK libraries» Analyze routine, parameters, and hardware» Forward to GPU if appropriate» Pass-through to CPU otherwise

10 CULA SPARSE (ITERATIVE)

11 CULA SPARSE INTRODUCTION» First released in 2011» Iterative solvers and preconditioners» Multiple matrix storage formats supported» Upcoming release (S3)» Tuned for Kepler» Free for personal academic use

12 CULA SPARSE - FUNCTIONALITY Solvers Preconditioners Data CG Jacobi Double / Complex BiCG Block Jacobi CSR / CSC / COO BiCG-Stab / (L) ILU0 GMRES Reordered ILU0 MINRES

13 Speed Up CULA SPARSE - PERFORMANCE Iterative Solver Performance 16x 14x 12x System Size = 1.5M GPU = NVIDIA C2070 CPU = Xeon X5560 (MKL) 10x 8x 6x 4x 2x 0x CG GMRES BiCG MINRES BiCGSTAB BiCGSTABL

14 CULA SPARSE PERFORMANCE FEATURES» Hybrid performance» CPU begins working during initial transfer» Preconditioner generation» Initial iterations» Matrix reordering» Can increase parallelism

15 PCULA MULTI-GPU + CPU PERFORMANCE

16 PCULA INTRODUCTION» Scale to multiple GPUs and CPUs in a single node» Currently in alpha release» Greatly increased performance, scalability, and functionality coming soon!

17 PCULA TILED ALGORITHMS n (0,0) (0,1) (0,2) (0,3) m (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,0) (3,1) (3,2) (3,3) original matrix tiled matrix

18 PCULA TASK SCHEDULING Completed Tasks POTRF TRSM TRSM TRSM SYRK Pending Valid Tasks GEMM GEMM GEMM POTRF SYRK SYRK Hardware Busy (3 tasks) Free (1 task) Free (0 tasks)

19 PCULA HETEROGENEOUS TASK SCHEDULING» Data locality is critical» Hardware performance» Persistent live tuning performance database» Task queue depth» Too long idle hardware if not perfect» Too short worker starvation

(0,2) (0,3) (0,0) (0,1) (0,2) (0,3) Modified Invalid (1,0) (1,1) (1,2) (1,3) (1,0) (1,1) (1,2) (1,3)

20 PCULA OUT OF (GPU) CORE» Solve problems larger than GPU memory» Natural extension of tiled data partitioning» MESI memory coherence protocol» Least recently used replacement strategy (0,0) (0,1) (0,2) (0,3) (0,0) (0,1) (0,2) (0,3) Modified Invalid (1,0) (1,1) (1,2) (1,3) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (2,0) (2,1) (2,2) (2,3) Exclusive Shared (3,0) (3,1) (3,2) (3,3) (3,0) (3,1) (3,2) (3,3)

21 PCULA FUNCTION LIST» Currently supports» BLAS Routines (GEMM, TRSM, GEMV)» LU Factorization & Solve (GETRF + GESV)» Cholesky Factorization & Solve (POTRF + POSV)» QR Factorization & Solve (GEQRF + GEQRS)» Eigenvalue and SVD routines in future release

22 GFLOPs PCULA - PERFORMANCE pcula - DGEMM Performance CPU GPU CPU + GPU CPU + 2xGPU Performance numbers include transfer time across PCI-Express (Gen2) bus CPU Intel Xeon 5560 GPU 2x NVIDIA C Matrix Size

23 ONGOING WORK

24 ONGOING WORK - CULA» CULA Dense» More routines/tuning» CULA Sparse» Direct solvers» Algebraic Multi-Grid (AMG)» pcula» Multi-node cluster support» NUMA optimizations

25 ONGOING WORK C++ AMP» Microsoft s C++ AMP library» ampblas development project» Linear algebra to C++ AMP ecosystem» Multiple talks today and tomorrow» C++ AMP Lounge

26 CULA PARTNERS & INTEGRATORS» Here at GTC

27 THANKS! Thanks! Questions?» Convention booth #20» More

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»