A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

Similar documents
Modern GPUs (Graphics Processing Units)

QR Decomposition on GPUs

An Approximate Singular Value Decomposition of Large Matrices in Julia

QUIC-SVD: Fast SVD Using Cosine Trees

CALCULATING RANKS, NULL SPACES AND PSEUDOINVERSE SOLUTIONS FOR SPARSE MATRICES USING SPQR

NEW ADVANCES IN GPU LINEAR ALGEBRA

A MATLAB Interface to the GPU

Matrix algorithms: fast, stable, communication-optimizing random?!

A comparison of parallel rank-structured solvers

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems

Recognition, SVD, and PCA

Large-Scale Face Manifold Learning

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

The Singular Value Decomposition: Let A be any m n matrix. orthogonal matrices U, V and a diagonal matrix Σ such that A = UΣV T.

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

A MATLAB Interface to the GPU

Exploiting the Graphics Hardware to solve two compute intensive problems: Singular Value Decomposition and Ray Tracing Parametric Patches

Math 2B Linear Algebra Test 2 S13 Name Write all responses on separate paper. Show your work for credit.

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Maths for Signals and Systems Linear Algebra in Engineering. Some problems by Gilbert Strang

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

(Creating Arrays & Matrices) Applied Linear Algebra in Geoscience Using MATLAB

Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems

Sparse LU Factorization for Parallel Circuit Simulation on GPUs

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Accelerating image registration on GPUs

AMS 148: Chapter 1: What is Parallelization

The Fast Multipole Method on NVIDIA GPUs and Multicore Processors

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Solving Dense Linear Systems on Graphics Processors

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Unsupervised learning in Vision

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Parallel FFT Program Optimizations on Heterogeneous Computers

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Dense Matrix Algorithms

Communication-Avoiding QR Decomposition for GPUs

Computational Methods in Statistics with Applications A Numerical Point of View. Large Data Sets. L. Eldén. March 2016

Accelerating GPU Kernels for Dense Linear Algebra

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

Determinant Computation on the GPU using the Condensation AMMCS Method / 1

Automatic Development of Linear Algebra Libraries for the Tesla Series

Contents. Implementing the QR factorization The algebraic eigenvalue problem. Applied Linear Algebra in Geoscience Using MATLAB

Data Structures and Algorithms for Counting Problems on Graphs using GPU

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

SIFT Descriptor Extraction on the GPU for Large-Scale Video Analysis. Hannes Fassold, Jakub Rosner

Unrolling parallel loops

GPU COMPUTING WITH MSC NASTRAN 2013

8. Hardware-Aware Numerics. Approaching supercomputing...

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

8. Hardware-Aware Numerics. Approaching supercomputing...

Randomization Algorithm to Compute Low-Rank Approximation

Principal Component Analysis for Distributed Data

Fast BVH Construction on GPUs

Numerical Linear Algebra

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

High-Performance Packet Classification on GPU

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Applications of Berkeley s Dwarfs on Nvidia GPUs

MTH309 Linear Algebra: Maple Guide

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Image Compression using Singular Value Decomposition

A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Technology for a better society. hetcomp.com

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Sparse Matrices in Image Compression

Johnson-Lindenstrauss Lemma, Random Projection, and Applications

GPGPU. Peter Laurens 1st-year PhD Student, NSC

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Sampling Using GPU Accelerated Sparse Hierarchical Models

DECOMPOSITION is one of the important subjects in

Acceleration of Hessenberg Reduction for Nonsymmetric Matrix

Fast Tridiagonal Solvers on GPU

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

Diffusion Wavelets for Natural Image Analysis

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

A GPU Sparse Direct Solver for AX=B

Practical Introduction to CUDA and GPU

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms

A GPU Implementation of a Jacobi Method for Lattice Basis Reduction

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

A Kernel-independent Adaptive Fast Multipole Method

Abstract. Introduction. Kevin Todisco

Fast Multiscale Algorithms for Information Representation and Fusion

Applications Video Surveillance (On-line or off-line)

Fast Multipole and Related Algorithms

An Eigenspace Update Algorithm for Image Analysis

Height field ambient occlusion using CUDA

DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS

Transcription:

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang University of Massachusetts Amherst

Introduction Singular Value Decomposition (SVD) A: m n matrix (m n) U, V: orthogonal matrices Σ: diagonal matrix Exact SVD can be computed in O(mn 2 ) time. 2

Introduction Approximate SVD A: m n matrix, whose intrinsic dimensionality is far less than n Usually sufficient to compute the k ( n) largest singular values and corresponding vectors. Approximate SVD can be solved in O(mnk) time ( compared to O(mn 2 ) for full svd). 3

Introduction Sample-based Approximate SVD Construct a basis V by directly sampling or taking linear combinations of rows from A. An approximate SVD can be extracted from the projection of A onto V. 4

Introduction Sample-based Approximate SVD Construct a basis V by directly sampling or taking linear combinations of rows from A. An approximate SVD can be extracted from the projection of A onto V. Sampling Methods Length-squared [Frieze et al. 2004, Deshpande et al. 2006] Random projection [Friedland et al 2006, Sarlos et al. 2006, Rokhlin et al. 2009] 5

Introduction QUIC-SVD [Holmes et al. NIPS 2008] Sampled-based approximate SVD. Progressively constructs a basis V using a cosine tree. Results have controllable L2 error basis construction adapts to the matrix and the desired error. 6

Our Goal Map QUIC-SVD onto the GPU (Graphics Processing Units) Powerful data-parallel computation platform. High computation density, high memory bandwidth. Relatively low-cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 3 GB memory 200 GB/s bandwidth $549 7

Related Work OpenGL-based: rasterization, framebuffers, shaders [Galoppo et al. 2005] [Bondhugula et al. 2006] CUDA-based: Matrix factorizations [Volkov et al. 2008] QR decomposition [Kerr et al. 2009] SVD [Lahabar et al. 2009] Commercial GPU linear algebra package [CULA 2010] 8

Our Work A CUDA implementation of QUIC-SVD. Up to 6~7 times speedup over an optimized CPU version. A partitioned version that can process large, out-ofcore matrices on a single GPU The largest we tested are 22,000 x 22,000 matrices (double precision FP). 9

Overview of QUIC-SVD [Holmes et al. 2008] A V 10

Overview of QUIC-SVD [Holmes et al. 2008] A Start from a root node that owns the entire matrix (all rows). 11

Overview of QUIC-SVD [Holmes et al. 2008] A pivot row Select a pivot row by importance sampling the squared magnitudes of the rows. 12

Overview of QUIC-SVD [Holmes et al. 2008] A pivot row Compute inner product of every row with the pivot row. (this is why it s called Cosine Tree) 13

Overview of QUIC-SVD [Holmes et al. 2008] A Closer to zero The rest Based on the inner products, the rows are partitioned into 2 subsets, each represented by a child node. 14

Overview of QUIC-SVD [Holmes et al. 2008] A V Compute the row mean (average) of each subset; add it to the current basis set (keep orthogonalized). 15

Overview of QUIC-SVD [Holmes et al. 2008] A V Repeat, until the estimated L2 error is below a threshold. The basis set now provides a good low-rank approximation. 16

Overview of QUIC-SVD Monte Carlo Error Estimation: Input: any subset of rows A s, the basis V, δ [0,1] Output: estimated, s.t. with probability at least 1 δ, the L2 error: Termination Criteria: 17

Overview of QUIC-SVD 5 Steps: 1. Select a leaf node with maximum estimated L2 error; 2. Split the node and create two child nodes; 3. Calculate and insert the mean vector of each child to the basis set V (replacing the parent vector), orthogonalized; 4. Estimate the error of newly added child nodes; 5. Repeat steps 1 4 until the termination criteria is satisfied. 18

GPU Implementation of QUIC-SVD Computation cost mostly on constructing the tree. Most expensive steps: Computing vector inner products and row means Basis orthogonalization 19

Parallelize Vector Inner Products and Row Means Compute all inner products and row means in parallel. Assume a large matrix, there are sufficient number of rows to engage all GPU threads. An index array is used to point to the physical rows. 20

Parallelize Gram Schmidt Orthogonalization Classical Gram Schmidt: Assume v1, v2, v3... vn are in the current basis set (i.e. they are orthonormal vectors), to add a new vector : v (, ) (, )... (, ) r 1 r where proj( r, v) ( r v) v r r proj r v1 proj r v2 proj r v n n This is easy to parallelize (as each projection is independently calculated), but it has poor numerical stability. r 21

Parallelize Gram Schmidt Orthogonalization Modified Gram Schmidt: r r proj( r, v ), 1 1 r r proj( r, v ), 2 1 1 2...... r r r proj( r, v ) v r r n 1 k k 1 k 1 k This has good numerical stability, but is hard to parallelize, as the computation is sequential! 22

Parallelize Gram Schmidt Orthogonalization Our approach: Blocked Gram Schmidt (1) v v proj v u1 proj v u2 proj v um (, ) (, )... (, ) v v proj( v, u ) proj( v, u )... proj( v, u ) (2) (1) (1) (1) (1) m 1 m 2 2m... u v proj( v, u ) proj( v, u )... proj( v, u ) ( n/ m) ( n/ m) ( n/ m) ( n/ m) k 1 2m 1 2m 2 k This hybrid approach combines the advantages of the previous two numerically stable, and GPU-friendly. 23

Extract the Final Result Input: matrix A, basis set V Output: U, Σ, V, the best approximate SVD of A within the subspace V. Algorithm: Compute AV, then (AV) T AV, and SVD U Σ V T = (AV) T AV Let V = VV, Σ = (Σ ) 1/2, U = (AV)V Σ 1, Return U, Σ, V. 24

Extract the Final Result Input: matrix A, subspace basis V Output: U, Σ, V, the best approximate SVD of A within the subspace V. Algorithm: Compute AV, then (AV) T AV, and SVD U Σ V T = (AV) T AV Let V = VV, Σ = (Σ ) 1/2, U = (AV)V Σ 1, Return U, Σ, V. SVD of a k k matrix. 25

Partitioned QUIC-SVD The advantage of the GPU is only obvious for largescale problems, e.g. matrices larger than 10,000 2. For dense matrices, this will soon become an out-ofcore problem. We modified our algorithm to use a matrix partitioning scheme, so that each partition can fit in GPU memory. 26

Partitioned QUIC-SVD Partition the matrix into fixed-size blocks A 1 A 2 A s 27

Partitioned QUIC-SVD Partition the matrix into fixed-size blocks A 1 A 2 A s 28

Partitioned QUIC-SVD 29

Partitioned QUIC-SVD Termination Criteria is guaranteed: 30

Implementation Details Main implementation done in CUDA. Use CULA library to extract the final SVD (this involves processing a k k matrix, where k n). Selection of the splitting node is done using a priority queue, maintained on the CPU side. Source code will be made available on our site http://graphics.cs.umass.edu 31

Performance Test environment: CPU: Intel Core i7 2.66 GHz (8 HTs), 6 GB memory GPU: NVIDIA GTX 480 GPU Test data: Given a rank k and matrix size n, we generate a n k left matrix and k n right matrix, both filled with random numbers; the product of the two gives the test matrix. Result accuracy: Both CPU and GPU versions use double floats; termination error ε = 10 12, Monte Carlo estimation δ = 10 12. 32

Performance Comparisons: 1. Matlab s svds; 2. CPU-based QUIC-SVD (optimized, multi-threaded, implemented using Intel Math Kernel Library); 3. GPU-based QUIC-SVD. 4. Tygert SVD (approximate SVD algorithm, implemented in Matlab, based on random projection); 33

Performance Running Time 34

Performance Speedup Factors 35

Performance Speedup Factors 36

Performance CPU vs. GPU QUIC-SVD Running Time Speedup 37

Performance GPU QUIC-SVD vs. Tygert SVD Running Time Speedup 38

Integration with Manifold Alignment Framework 39

Conclusion A fast GPU-based implementation of an approximate SVD algorithm. Reasonable (up to 6~7 ) speedup over an optimized CPU implementation, and Tygert SVD. Our partitioned version allows processing out-of-core matrices. 40

Future Work Be able to process sparse matrices. Application in solving large graph Laplacians. Components from this project can become building blocks for other algorithms: random projection trees, diffusion wavelets etc. 41

Acknowledgement Alexander Gray PPAM reviewers NSF Grant FODAVA-1025120 42