GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Similar documents
GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

MAGMA. Matrix Algebra on GPU and Multicore Architectures

NEW ADVANCES IN GPU LINEAR ALGEBRA

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Accelerating Sparse Cholesky Factorization on GPUs

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

A GPU Sparse Direct Solver for AX=B

On the limits of (and opportunities for?) GPU acceleration

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Automatic Development of Linear Algebra Libraries for the Tesla Series

Solving Dense Linear Systems on Graphics Processors

NVBLAS LIBRARY. DU _v6.0 February User Guide

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Toward a supernodal sparse direct solver over DAG runtimes

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

A Standard for Batching BLAS Operations

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

QR Decomposition on GPUs

MAGMA: a New Generation

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

GPU COMPUTING WITH MSC NASTRAN 2013

Trends in HPC (hardware complexity and software challenges)

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

CUDA 6.0 Performance Report. April 2014

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

Accelerating GPU kernels for dense linear algebra

rcuda: an approach to provide remote access to GPU computational power

Applications of Berkeley s Dwarfs on Nvidia GPUs

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures

Accelerating GPU Kernels for Dense Linear Algebra

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Accelerating Implicit LS-DYNA with GPU

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CUDA Toolkit 4.0 Performance Report. June, 2011

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

High Performance Linear Algebra

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Some notes on efficient computing and high performance computing environments

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

Using OpenACC With CUDA Libraries

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Fast and reliable linear system solutions on new parallel architectures

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Accelerating MCAE with GPUs

OP2 FOR MANY-CORE ARCHITECTURES

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Building NVLink for Developers

Accelerating Linpack with CUDA on heterogeneous clusters

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Fundamental CUDA Optimization. NVIDIA Corporation

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Programming Dense Linear Algebra Kernels on Vectorized Architectures

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Double-Precision Matrix Multiply on CUDA

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

High Performance Computing with Accelerators

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Using OpenACC With CUDA Libraries

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Adrian Tate XK6 / openacc workshop Manno, Mar

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Optimizing the operations with sparse matrices on Intel architecture

CME 213 S PRING Eric Darve

Performance Benefits of NVIDIA GPUs for LS-DYNA

Quantum ESPRESSO on GPU accelerated systems

Fundamental CUDA Optimization. NVIDIA Corporation

Transcription:

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA

WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization Distributed- and shared-memory algorithms Scalable, employs theoretically most scalable factorization algorithm Uses multifrontal method This work concentrates on accelerating numerical factorization phase

OBJECTIVE Acceleration of WSMP Demonstrate suitability of GPUs for Sparse Direct Solves Demonstrate suitability of GPUs for very irregular workloads Evaluate methods for accelerating WSMP Simple methods can work well More sophisticated methods can work better

OUTLINE Minimally invasive acceleration of BLAS-intensive applications High level interface Acceleration techniques MPI acceleration

GPU ACCELERATION: MINIMALLY INVASIVE APPROACH Intercepting BLAS level 3 calls with big dimensions and sending them to GPU Tiling to hide copies CPU H2D Kernel D2H CPU CPU CPU time PCIe and host memory bandwidth are limiting But many of the matrices are not big enough to be tiled (sizes)

GPU ACCELERATION: MINIMALLY INVASIVE APPROACH We are intercepting simultaneous moderate-size calls each one is not big enough to fully occupy GPU Use streams to Increase GPU utilization Hide copy-up and copy-down Kernels may overlap Thread 0 Thread 1 Thread 2 Thread 3 CPU CPU CPU CPU CPU CPU CPU CPU

GPU ACCELERATION: MINIMALLY INVASIVE APPROACH Use host pinned buffers to increase copy-up/copy-down speed and enable asynchronous memory copies Send small BLAS calls to the CPU (m,n<128, k<512) Large matrices are tiled individually Can be used with ANY application, not just WSMP No recompilation of the code required

RESULTS SYSTEM USED Dual-socket Ivy-Bridge Xeon @ 3.0 Ghz 20 cores total, PCIe gen3, E5-2690 v2 Tesla K40, ECC on Intel MKL for host BLAS calls CUDA 6.5 Standard cublas routines are used for BLAS and BLAS-Like.

Performance, Gflops/s RESULTS DROP-IN GPU ACCELERATION 700 600 500 400 300 200 1.48 1.45 1.66 1.53 1.37 1.67 1.77 0.88 1.30 1.82 1.7x CPU Drop-in 100 0 2 x Xeon E5-2690 v2 + K40 (max boost, ECC=on)

SPARSE DIRECT SOLVERS Supernodes collections of columns with similar non-zero pattern provide opportunity for dense matrix math grow with mesh size due to fill The larger the model, the larger the supernodes Supernodes in the factors are detected during symbolic factorization

DENSE BLOCK CHOLESKY Basis for sparse direct algorithms L 11 L t 11 = A 11 L 21 L t 11 = A 21 A * 22 = A 22 L 21 L t 21 Emphasizes dense math Dominated by computation of Schur complement POTRF TRSM SYRK POTRF element-wise Cholesky factorization A 11 A t 21 A 21 A 22 L 11 0 I 0 L t 11 = X X L 21 I 0 A 22 L 21 L t 21 GEMM L t 21 0 I TRSM triangular solve Schur complement

PARALLEL MULTIFRONTAL METHOD A task that owns the supernode Assembles frontal matrix Factors the supernode Computes update frontal matrix, that will be used by the parent task In the beginning, many independent supernodes to be factored by parallel threads In the end, threads cooperate on the factorization of fewer remaining supernodes

POTENTIAL FOR IMPROVEMENT Potentially big BLAS calls are split into smaller ones, hurting performance Same data is moved back and forth many times A lot of pressure on PCIe for data movement A lot of pressure on host memory bandwidth A few GB of memory allocated on the device and has to be pinned on the host

SOLUTION HIGH LEVEL INTERFACE Most of the work in Cholesky factorization is performed in dtrsmsyrk calls (dtrsm followed by dsyrk) Bigger BLAS dimensions are more favorable for the GPU Big dtrsmsyrk calls are sent to GPU (inner dimension >= 512) Still need to hide data transfer, but there is much less data motion between the host and the device Less memory needed on the device, less pinned memory on the host

DTRSM IMPLEMENTATION L 21 L 11T =A 21 DTRSM is tiled Tiles and related copies are submitted to different streams Results are kept on the GPU, as they will be needed for dsyrk L 11 Copy is sent to the CPU Host buffers are used for staging host data L 21

DSYRK IMPLEMENTATION m A * 22 = A 22 L 21 L t 21 Device Memory dtrsm Host Memory, dsyrk Dsyrk is tiled Different tiles and related copies are submitted to the different streams L 21 L 21 t result is sent to host buffer On the host update A 22 - L 21 L 21 t is performed m=384 with 4 streams enough to have GPU occupied

TILED DTRSM/DTRSMSYRK A 11 A 21 A 22 B 1, X 1 B 2 When inner dimension of dtrsm is too big, it has to be tiled X 1 is calculated from X 1 A 11T =B 1 as described before dsyrk update is performed if needed B 2 <- B 2 -X 1 A 21T, X 1 is in the GPU memory, B 2 and A 21 are on the CPU Process is repeated for the remainder of the matrix

MEMORY REQUIREMENTS For tiled dgemm with at most 2048 by 2048 quad-tiles: 12 tiles, 400 MB For tiled dtrsm-syrk, with at most 2048 by 2048 tiles: 12 tiles, 400 MB Buffer for dtrsm results: With k <=2048 1 GB is enough for front size up to 61000, bigger fronts can be handled by successive calls to dtrsm and tiled dsyrk Total is less than 2 GB

Numerical factorization, Gflops/s RESULTS 900 800 700 600 500 400 300 200 100 0 1.97 1.48 1.93 1.45 2.30 1.66 2.23 1.53 1.67 1.37 2.36 1.67 2.01 1.77 1.12 0.88 2.22 1.52 2.48 1.82 CPU Drop-in High-level 2 x Xeon E5-2690 v2 + K40 (max boost, ECC=on)

PROFILE Drop-in High-level API

HIGH LEVEL INTERFACE FOR LDL FACTORIZATION WSMP signals what data is likely to be reused, and it is cached on the GPU

LDL T FACTORIZATION 1) L 11 2) 3) L 21 S 21 = L 21 D C=C-S 21 L 21 L 21 L t 11 = A 21 dtrsm scaling (cublasddgmm) symmetric update (cublasdsyrkx) L 21 and S 21 are kept in the GPU memory between operations

Numerical Factorization, GFLop/s LDL RESULTS 800.00 700.00 600.00 500.00 400.00 300.00 200.00 1.70 1.60 2.03 2.12 1.30 1.94 1.40 1.36 2.25 2.21 LDL CPU LDL GPU 100.00 0.00 2 x Xeon E5-2690 v2 + K40 (max boost, ECC=on)

Speed-up SPEEDUP 3.00 2.50 2.00 1.50 1.00 Cholesky LDL 0.50 0.00 0 0.2 0.4 0.6 0.8 1 Fraction of factorization flops with k>=512

DISTRIBUTED MEMORY PARALLEL FACTORIZATION (MPI) Factorization by p processes Best performance when p is power of 2 Factorization of levels below top log p levels is performed by the processes independently Processes cooperate on factoring upper levels of the elimination tree

DISTRIBUTED MEMORY PARALLEL FACTORIZATION Update matrices are distributed between processes working on them in a block-cyclic fashion Smaller block size provides better load balancing With bigger block size, efficiency of BLAS3 operations increases 4 6 0 0 2 2 0 0 0 0 2 2 0 2 2 0 0 0 0 2 2 3 3 1 1 1 1 3 3 3 1 1 1 1 3 3 4 4 4 6 4 4 4 6 5 5 7 7 5 5 7 0 0 0 6

DISTRIBUTED MEMORY PARALLEL FACTORIZATION On the GPU block size ideally should be 512 or more Determined by PCIe and host memcopy copy speed Large tiles hurt load balancing For our test system, 512 block size provides best performance

CACHING FOR MPI Dtrsm Dgemm Dsyrk results are kept in the GPU cache One of the matrices that is going to be reused is put in the GPU cache in the course of performing dgemm operation One of the matrices are in the GPU cache, another matrix is not reused Matrix can be in the GPU cache from previous operations, or on the host

Performance, GFlops/s MPI SCALING RESULTS 2000 1800 1600 1400 1200 1000 800 600 400 200 0 0 2 4 6 8 10 Serena CPU Serena GPU Audi CPU Audi GPU # of mpi ranks, 10 cores and 1 GPU per rank

Factorization Performance, GB/s MPI SCALING ON BLUE WATERS 100000 10000 2.7x M10, CPU 1000 3.4x M10, GPU M20, CPU M20, GPU 100 16 64 256 1024 4096 Number of cpu cores Blue Waters, AMD Interlagos, 16 CPU cores vs. 8 CPU cores + 1 K20 GPU

FUTURE WORK Share the work with CPU now for many models it is idle as almost all work is sent to the GPU Multi-GPU Tuning to automatically set off-load cutoffs for a variety of systems

ACKNOWLEDGEMENTS The Private Sector Program at NCSA Blue Waters project, supported by NSF (award number OCI 07-25070) and the state of Illinois