Modern GPUs (Graphics Processing Units)

Similar documents
A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

QR Decomposition on GPUs

AMS526: Numerical Analysis I (Numerical Linear Algebra)

NEW ADVANCES IN GPU LINEAR ALGEBRA

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Parallel FFT Program Optimizations on Heterogeneous Computers

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

A GPU Sparse Direct Solver for AX=B

Accelerated Machine Learning Algorithms in Python

Matrix algorithms: fast, stable, communication-optimizing random?!

A MATLAB Interface to the GPU

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

MAGMA: a New Generation

A MATLAB Interface to the GPU

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

A Multi-Tiered Optimization Framework for Heterogeneous Computing

QUIC-SVD: Fast SVD Using Cosine Trees

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Fast BVH Construction on GPUs

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

DEGENERACY AND THE FUNDAMENTAL THEOREM

An Approximate Singular Value Decomposition of Large Matrices in Julia

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

AMS 148: Chapter 1: What is Parallelization

GPU COMPUTING WITH MSC NASTRAN 2013

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing...

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms

Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Applications of Berkeley s Dwarfs on Nvidia GPUs

CALCULATING RANKS, NULL SPACES AND PSEUDOINVERSE SOLUTIONS FOR SPARSE MATRICES USING SPQR

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

DECOMPOSITION is one of the important subjects in

Dense Matrix Algorithms

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

HPC with Multicore and GPUs

Accelerating image registration on GPUs

Scientific Computing on GPUs: GPU Architecture Overview

Sparse Matrices in Image Compression

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Exploiting the Graphics Hardware to solve two compute intensive problems: Singular Value Decomposition and Ray Tracing Parametric Patches

designing a GPU Computing Solution

Maths for Signals and Systems Linear Algebra in Engineering. Some problems by Gilbert Strang

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

(Creating Arrays & Matrices) Applied Linear Algebra in Geoscience Using MATLAB

Haar Wavelet Image Compression

Exploiting Parallelism to Support Scalable Hierarchical Clustering

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

Parallel Reduction from Block Hessenberg to Hessenberg using MPI

Accelerating Spark Workloads using GPUs

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Novel implementations of recursive discrete wavelet transform for real time computation with multicore systems on chip (SOC)

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Recognition, SVD, and PCA

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Addition/Subtraction flops. ... k k + 1, n (n k)(n k) (n k)(n + 1 k) n 1 n, n (1)(1) (1)(2)

State of Art and Project Proposals Intensive Computation

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Lecture 17: Array Algorithms

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

International Journal of Computer Science and Network (IJCSN) Volume 1, Issue 4, August ISSN

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

DCT SVD Based Hybrid Transform Coding for Image Compression

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Scan Primitives for GPU Computing

Determinant Computation on the GPU using the Condensation AMMCS Method / 1

Communication-Avoiding QR Decomposition for GPUs

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Realization of Hardware Architectures for Householder Transformation based QR Decomposition using Xilinx System Generator Block Sets

Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture

International Conference on Computational Science (ICCS 2017)

CS 664 Structure and Motion. Daniel Huttenlocher

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Sampling Using GPU Accelerated Sparse Hierarchical Models

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Transcription:

Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB memory 200GB/s bandwidth $499 1

GPU for Scientific Computing [NVIDIA] 2

GPU based QUIC SVD Algorithm Department of Computer Science

Introduction Singular Value Decomposition (SVD) of large matrices. Low Rank Approximation. QUIC SVD: Fast SVD using cosine trees, by Michael Holmes, Alexander Gray and Charles Isbell, NIPS 2008. 4

QUIC SVD: Fast SVD using Cosine Trees [Holmes et al. 2008] 5

QUIC SVD: Fast SVD using Cosine Trees [Holmes et al. 2008] Start from a root node that owns the entire matrix. 6

QUIC SVD: Fast SVD using Cosine Trees [Holmes et al. 2008] Select pivot row based on length square distribution; Compute inner product of every row with the pivot row; 7

QUIC SVD: Fast SVD using Cosine Trees [Holmes et al. 2008] The inner product results determine how the rows are partitioned to 2 subsets, each represented by a child node. 8

QUIC SVD: Fast SVD using Cosine Trees [Holmes et al. 2008] Compute the row mean (centroid) of each subset; add it to the current basis set (keep orthogonalized). 9

QUIC SVD: Fast SVD using Cosine Trees [Holmes et al. 2008] Repeat, until the estimated error is below a threshold. The final basis set then provides a good low rank approximation. 10

GPU based Implementation of QUIC SVD Most computation time is spent on splitting nodes. Computationally expensive steps: Computing vector inner products Computing row means (centroids) Basis orthogonalization Key: find enough computation to keep the GPU busy! 11

Parallelize Vector Inner Products and Row Means Compute all inner products and row means in parallel. There are enough row vectors to keep the GPU busy. 12

Parallelize Gram Schmidt Orthogonalization Classical Gram Schmidt: Assume vectors u are already orthgonalized, to 1, u2, u3... un orthogonalize a new vector v with respect to them: vu, vu, vu, n u k v u u... u u, u u, u u, u 1 2 1 1 2 1 1 2 2 This is easy to parallelize (as each projection is independently calculated), but it has poor numerical stability due to rounding errors. n n n 13

Parallelize Gram Schmidt Orthogonalization Modified Gram Schmidt: v, u v, u v v u, v v u, v v u,...... (1) (2) (1) vu, 1 (2) (1) 2 (3) (2) 3 1 2 3 u1, u1 u2, u2 u3, u3, u ( k 1) ( k 1) k uk 1 v uk uk, uk v This has good numerical stability, but is hard to parallelize, as the computation is sequential! 14

Parallelize Gram Schmidt Orthogonalization Our approach: Blocked Gram Schmidt vu, vu, vu, (1) 1 2 m v v u1 u2... um u1, u1 u2, u2 um, um v, u v, u v, u (1) (1) (1) (2) (1) m 1 m 2 2m v v um 1 um 2... u2m um 1, um 1 um 2, um 2 u2m, u2m v, u v, u v, u u v u u (2) (2) (2) (2) 2m 1 2m 2 k k 1 2m 1 m 2... u2m 1, u2m 1 u2m 2, u2m 2 uk, uk u k This hybrid approach proves to be both numerically stable, and GPU friendly. 15

Partitioned QUIC SVD As the advantage of exploiting the GPU is only obvious for large scale problems, we need to test our algorithm on large matrices (>10,000x10,000). For dense matrices, this will soon become an out of core problem. We modified our algorithm to use a matrix partitioning scheme, so that each partition can fit in GPU memory. 16

Partitioned QUIC SVD 17

Performance We generate test data by multiplying two randomly generated left and right matrices, resulting in a low rank matrix. We set the termination error in QUIC SVD to 10 6. This forces the algorithm to produce accurate results. We compare 1) Matlab s svds; 2) CPU implementation of QUIC SVD; 3) GPU implementation of QUIC SVD. CPU version considerably optimized using Intel MKL and tested on an Intel Core i7 GPU version is tested on an NVIDIA 480 GTX 18

Performance 19

Performance 20

Performance 21

Performance 22

Integration with Manifold Alignment Framework 23

Conclusion and Future Work Our initial effort has shown that a GPU based implementation of QUIC SVD can achieve reasonable speedup. With additional optimizations, 10X speedup foreseeable. Handle sparse matrices. Components from this project can become fundamental building blocks for other algorithms: random projection trees, diffusion wavelets etc. Design new algorithms that are suitable for data parallel computation. 24

25

Programming on the GPU General Purpose GPU Programming Language CUDA, OpenCL, DirectCompute, BrookGPU [NVIDIA] 26