QR Decomposition on GPUs

Size: px

Start display at page:

Download "QR Decomposition on GPUs"

Isaac Nelson
5 years ago
Views:

1 QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of Technology March 8, 2009 GPGPU 09 This work was supported in part by DARPA and AFRL under contracts FA and FA C The opinions expressed are those of the authors.

2 QR Decomposition QR Algorithms Block Householder QR Outline 1 QR Decomposition 2 QR Algorithms Algorithms Householder Reflections GPU Implementation 3 Block Householder QR Problem Algorithm Implementation Performance

3 QR Decomposition QR Algorithms Block Householder QR QR Decomposition QR Decomposition Matrix factorization: A = QR Q T Q = I, Q is unitary R is upper triangular O(N 3 )

4 QR Decomposition QR Algorithms Block Householder QR Applications of QR QR decomposition used to compute least squares other matrix factorizations (Toeplitz, SVD) orthogonal basis for a collection of vectors matrix eigendecomposition

5 QR Decomposition QR Algorithms Block Householder QR QR on GPUs Challenges of QR decomposition on GPUs parallel computations require fine-grain synchronization and communication divergent control flow low compute intensity GPUs lack large caches and have high memory latencies

6 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation QR Algorithms Modified Gram-Schmidt computes A = Q 1 R 1 directly by solving normal equations parallel blocked QR via MGS unstable QR via Orthogonal Transformations orthogonal transformations triangularize A Givens - compute rotation for each element below main diagonal to place zeros Householder - compute reflection for each column to place zeros

7 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Parallel QR via Householder Reflections Approach basic linear algebra procedures parallel and perform well on GPUs large problem sizes minimize kernel launch overhead Constraints consists mostly of matrix-vector operations

8 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections Compute a Householder reflection P from a vector x v = x ± x e 1 P = I 2 v T v vvt such that P x = x e 1 P A may be computed without explicitly forming P P A = (I 2 v T v vvt )A = A 2 v T v v(vt A)

9 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections A x 1 = [ A(:, 1) ] x1 (1) x v 1 = 1 2 x 1 (2 :)

10 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections P 1 P = I [ 5 2 v T v] vvt I0 P 1 = P

11 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections P 1 A x 2 = A(2 :, 2) [ ] x2 (1) x v 2 = 2 2 x 2 (2 :)

12 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections P 2 P = I [ 4 2 v T v] vvt I1 P 2 = P

13 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections P 2 P 1 A x 3 = [ A(3 :, 3) ] x3 (1) x v 3 = 3 2 x 3 (2 :)

14 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections P 2 P = I [ 3 2 v T v] vvt I2 P 3 = P

15 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections P 3 P 2 P 1 A x 4 = [ A(4 :, 4) ] x4 (1) x v 4 = 4 2 x 4 (2 :)

16 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections P 2 P = I [ 2 2 v T v] vvt I3 P 4 = P

17 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Householder Reflections P 4 P 3 P 2 P 1 A Accumulate Q Q = P T 1 P T 2 P T 3 P T 4 Triangularize A in place A = (P T 1 P T 2 P T 3 P T 4 )(P 4 P 3 P 2 P 1 A) A = QR Q orthogonal, R upper triangular

18 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Implementation on GPUs Initial strategy: Householder QR matrix dimensions constrained to multiples of 32 Use CUBLAS to compute Householder reflections and vector outer products Write kernel with CUDA to do better than CUBLAS s cublassgemv 128-byte aligned accesses to global memory CUDA grid block sizes avoid guard conditionals

19 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Matrix-vector multiply CUDA kernel

20 QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation Performance of matrix-vector product GFLOP/s Matrix vector product Matrix order (m) gtsgemv GTX280 cublassgemv GTX280 Theoretical GTX280 gtsgemv 9800 cublassgemv 9800 Theoretical 9800GX2 TeraFLOP-capable GPU achieves 70 GFLOP/s All computations in Householder algorithm bandwidth limited norm, vector outer product Custom kernel does significantly better for problem sizes of interest

21 Problem Problem with Householder QR Matrix-vector operations are: memory bound inefficient large numerous

22 Solution Solution reduce the problem size for matrix-vector operations apply reflections in rank-r updates to identity for r > 1 let high performance of matrix-matrix product offset costs of increased FLOP count

23 A More Efficient Approach Block Householder Representation P = P 1 P r, where P j is a rank-1 update to I P may be written as P = I + Y W T W and Y are m-by-r A P T A = (I + Y W T )A = A + Y W T A Q QP = Q(I + W Y T ) = Q + QW Y T matrix-matrix multiply is efficient on GPUs

24 Block Householder QR Algorithm A = [ A 1 A 2 A 3 ] 1.) Input matrix is partitioned into blocks A 1, A 2,... A p, each with r columns.

25 Block Householder QR Algorithm A = [ A 1 A 2 A 3 ] 2.) A Householder reflection is computed from the first column

26 Block Householder QR Algorithm A = [ P 1 A 1 A 2 A 3 ] 3.) and applied to the remaining columns in A 1.

27 Block Householder QR Algorithm A = [ P 1 A 1 A 2 A 3 ] 4.) A Householder reflection is computed from the second column

28 Block Householder QR Algorithm A = [ P 2 P 1 A 1 A 2 A 3 ] 5.) and applied to the remaining columns in A 1.

29 Block Householder QR Algorithm Y W T [ A2 A 3 ] 6.) After r reflections are applied to block A 1, W is computed from Y. Then, matrix [ ] A 2 A 3 A p and Q are updated according to A [2 p] A [2 p] + Y W T A [2 p] Q Q + QW Y T

30 Block Householder QR Algorithm [ P2 P 1 A 1 P T A 2 P T A 3 ] 7.) Applying the block Householder update I + Y W T to A is equivalent to performing the first r Householder reflections according to the original algorithm. Problem sizes for matrix-vector product are much smaller. Q is updated strictly with matrix-matrix products.

31 Block Householder QR Algorithm R = [ A 1 A 2 A 3 ] 8.) Repeat with the next block until all of A is triangularized.

32 Block Householder QR Algorithm Block Householder QR Algorithm Q I partition A into [ ] A 1 A 2 A n/r for k = 1 to n/r do for j = 1 to r do s = j + (k 1) r v = house(a k (s :, j)) V ( :, j) = v β(j) = 2 v T v A k (I βvv T )A k end for [ compute W from ] V and β Ak+1 A n/r P T A = (I + Y W T ) [ ] A k+1 A n/r Q QP = Q(I + W Y T ) end for

33 Computing P = I + W Y T Compute W and Y from V and β for block k Y = V ( :, 1) W = β(1) V ( :, 1) for j = 2 to r do z = β(j) (I + W Y T ) V ( :, j) W = [ W z ] Y = [ Y V ( :, j) ] end for

34 GPU Implementation Improved strategy for GPU Implementation blocked Householder algorithm using CUBLAS and custom matrix-vector kernel matrix sizes multiples of 32 more rows than columns block size of 32 columns efficient matrix-matrix product using CUBLAS pad loads and stores to ensure alignment to maximize memory bandwidth

35 Experimental Configuration Test Platform GeForce GTX stream processors Intel Core2 Xeon GHz, 6 MB L2 cache per pair of cores Intel Math Kernel Library 10 (MKL) - sgeqrf(), sormqr() Linux x CUDA 2.0 timing excludes transfer between system and GPU memory

36 Experimental Configuration Input Test Data single-precision, real-valued data A initialized to lower-triangular matrix diagonal elements initialized to 1 random values a 1 below diagonal random Givens rotations applied to A to conceal structure Result Verification All results satisfy: A QR m 2 23 A Q T Q I m 2 23 Q is explicitly formed

37 Performance 150 GTX GX2 GFLOP/s Peak performance GFLOP/s Matrix rows

38 Runtime Distribution Table: Runtime in seconds for phases of blocked Householder QR on GPUs Operation GeForce GTX 280 GeForce GTX 280 Problem size Householder A k P A k WY Computation A (I + W Y T ) T A Q Q(I + W Y T ) Total (seconds) GFLOP/s 129 GFLOP/s 143 GFLOP/s

39 Performance of QP and P T A Average performance of QP and P T A 300 GFLOPs/s Matrix rows QP GTX280 P H A GTX280 QP 9800GX2 P H A 9800GX2 Peak performance Q QP : 334 GFLOP/s A P T A: 237 GFLOP/s

40 Speedup 5 4 MKL 1 thread MKL 2 threads MKL 4 threads Speedup 3 2 Peak speedup (1 thread) speedup Matrix rows

41 Future Work Attempt custom matrix-matrix product kernel to achieve higher performance Extend to complex-valued data Support arbitrarily sized matrices GPU VSIPL

42 Conclusions Dense, block-oriented algorithms with large problem sizes do well on GPUs GPUs can efficiently compute QR decomposition 143 GFLOP/s - 4.9x speedup Enhancements to CUBLAS are still possible

43 Questions Questions?

Modern GPUs (Graphics Processing Units)

Modern GPUs (Graphics Processing Units) Powerful data parallel computation platform. High computation density, high memory bandwidth. Relatively low cost. NVIDIA GTX 580 512 cores 1.6 Tera FLOPs 1.5 GB