Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Size: px

Start display at page:

Download "Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement"

Morgan Oliver
5 years ago
Views:

1 Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA GTC 2015 March 2015

2 Outline Combinatorial scientific computing: math + CS + applications Multifrontal methods for factorizing sparse matrices fill-in creates cliques in the graph cliques connected via the elimination tree Sparse LU: square cliques assembled via addition Sparse LU: rectangular, via addition Sparse QR: rectangular, via concatenation GPU kernel design for sparse QR Bucket scheduler for single-gpu QR Extended for multi-gpu QR Performance results

3 Combinatorial Scientific Computing: math + CS + applications

4 Cliques in the graph: the multifrontal method Cliques + elimination tree = sequence of frontal matrices Dense factorization within a front; assemble data into parent regular + irregular computation

5 UMFPACK: unsymmetric multifrontal method Frontal matrices become rectangular Assemble data into ancestors, not just parents

6 UMFPACK: unsymmetric multifrontal method Key results / impact high-performance via dense matrix kernels within each front symbolic preordering and analysis, followed by revised local pivot search with approximate unsymmetric degree update widely used sparse LU and x=a\b in MATLAB Mathematica IBM circuit simulation application finite-element solvers: NASTRAN, FEniCS,... Berkeley Design Automation: circuit simulation CVXOPT...

7 SuiteSparseQR: multifrontal sparse QR factorization Key results / impact rectangular fronts like UMFPACK, but simpler frontal matrix assembly (concatenation, not summation) (Duff, Puglisi) rank approximation (Heath, extended to multifrontal case) multicore parallelism on multicore CPU (70 Gflop theoretical peak): up to 30 Gflops sparse qr in MATLAB, and x=a\b today s talk: GPU algorithm novel Bucket QR scheduler and custom GPU kernels up to 150 GFlops on one Kepler K20c, 286 GFlops on 4 Tesla C2070 s up to 28x speedup vs CPU algorithm (10x typical for large problems)

8 A column elimination tree and its supernodes SuiteSparseQR

9 Frontal matrix assembly SuiteSparseQR

10 concatenation, resulting a staircase matrix SuiteSparseQR

11 SuiteSparseQR

12 Prior methods one front at a time on the GPU assembly on CPU panel factorization on the CPU, applied on GPU Our multifrontal QR many fronts on the GPU (entire subtree) assembly on GPU: data concatenation, not summation entire dense QR of each front on the GPU Multifrontal factorization and assembly

13 Consider a subtree of frontal matrices on the GPU

14 Expanded to show GPU kernel launches

15 Bucket QR factorization

16 Bucket QR factorization

17 Bucket QR factorization

18 Bucket QR factorization

19 Bucket QR factorization

20 Bucket QR factorization

21 Bucket QR factorization

22 Bucket QR factorization

23 Bucket QR factorization

25 Bucket QR requires two kernels on the GPU QR factorization of a t-by-1 tile, t = 1, 2, or 3 creates a block Householder details on next slides Apply a block Householder: A = A V (T (V A)) A is t-by-s, where s can be large thread-block iterates 2 column blocks at atime (details omitted) GPU kernels for Bucket QR

26 Block Householder QR = A where Q = (I VT V ), and R is upper triangular GPU kernel: Block Householder QR [m n] = size (A) for k = 1:n [tau, v] = house (A (k:m,k)) A (k:m,k:n) = A (k:m,k:n) - v * (tau * (v * A (k:m,k:n))) V (k:m,k) = v ; Tau (k) = tau end T = zeros (n) for k = 1:n tau = Tau (k) ; v = V (k:m,k) z = - tau * v * V (k:m,1:k-1) T (1:k-1,k) = T (1:k-1,1:k-1) * z T (k,k) = tau end

27 GPU kernel: Block Householder QR Householder update A(k:m,k:n) = A(k:m,k:n) -... v*(tau*(v *A(k:m,k:n))) Construct T z = - tau * v * V (k:m,1:k-1) T (1:k-1,k) = T(1:k-1,1:k-1)*z T (k,k) = tau

28 Towards an GPU kernel overwrite tril(a,-1) with V, and fold in construction of T. [m n] = size (A) T = zeros (n) for k = 1:n [tau, v] = house (A (k:m,k)) A (k:m,k:n) = A (k:m,k:n) -... v * (tau * (v * A (k:m,k:n))) V1 (k) = v (1) A (k+1:m,k) = v (2:end) z = - tau * v * A (k:m,1:k-1) T (1:k-1,k) = T (1:k-1,1:k-1) * z T (k,k) = tau end GPU kernel: Block Householder QR

29 The GPU kernel update A and construct T in parallel: A (k:m,k:n) = A (k:m,k:n) -... v * (tau * (v * A (k:m,k:n))) z = - tau * (v * A (k:m,1:k-1)) T (1:k-1,k) = T (1:k-1,1:k-1) * z T (k,k) = tau GPU kernel: Block Householder QR becomes z = -tau * v * A (k:m,:) A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1) T (k,k) = tau

30 z = -tau * v * A (k:m,:) A (k:m,k:n) =... A (k:m,k:n) +... v * z (k:n) T (1:k-1,k) =... T (1:k-1,1:k-1)... * z (1:k-1) T (k,k) = tau GPU kernel: Block Householder QR

31 The GPU kernel thread-level parallelism z = -tau * v * A (k:m,:) A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1) T (k,k) = tau GPU kernel: Block Householder QR A is 96-by-32, in register during factorization, and finally in global memory at the end (V and R) each thread owns an 8-by-1 bitty block of A v is 96-by-1, in shared memory z is 32-by-1, in shared. Requires 12-by-32 shared space for v *A(k:m,:) reduction

32 GPU kernel: Block Householder QR Putting it all together. At the kth step: threads that own column k write A to shared thread zero computes Householder coefficients z = -tau * v * A (k:m,:) each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) only done by threads that own columns k:n threads that own column k+1 compute norm of that column of A, for next Householder coef, saving result in 12-by-1 reduction space T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1) only done by threads 1:k-1 thread zero sums up reduction space for norm of column k+1

33 GPU kernel: Block Householder QR z = -tau * v * A (k:m,:) each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z

threads that own column k+1 compute norm of that column of

34 GPU kernel: Block Householder QR A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) only done by threads that own columns k:n threads that own column k+1 compute norm of that column of A, for next Householder coef, saving result in 12-by-1 reduction space

35 GPU kernel: Block Householder QR T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1) only done by threads 1:k-1

36 Putting it all together... Performance results Fermi K20 K40 GPU kernels: apply block Householder 183 Gflops 260 Gflops 360 Gflops factorize 3 tiles 27 Gflops 20 Gflops dense QR for large front 107 Gflops 120 Gflops ( bare metal flops) 154 Gflops 172 Gflops sparse QR on GPU 80 Gflops 150 Gflops peak speedup over CPU 11x 20x typical speedup over CPU 5x 10x Single GPU performance results

37 GPU Kernel pitfalls What we ll do differently in our kernel design Householder block-apply using too much shared memory uberkernel approach each thread block determines what to do from a task list (QR, apply, assemble) pros: single large kernel launch, lots of parallelism con: all tasks use same thread geometry con: QR of panel needs higher occupancy to hide scalar (d) to do: block apply kernel needs to stage A = A WYA by not keeping W and Y in shared. split the uberkernel so QR panel can have higher occupancy

38 Single GPU performance on many matrices

39 Multiple GPUs: decomposing the tree

40 Multiple GPUs: Performance Results Results on NVIDIA Tesla C2070 GPUs problem CPU 1 GPU 2 GPU 4 GPU speedup speedup GFlop GFlop GFlop GFlop vs CPU vs 1 GPU 1500:2D :2D :2D lp nug ch7-8-b ch7-8-b3:

41 Multiple GPUs: for large fronts

42 Multiple GPUs: bucket scheduler on the large scale

43 Multiple GPUs: bucket scheduler on the large scale

44 Multiple GPUs: bucket scheduler on the large scale

45 Multiple GPUs: bucket scheduler on the large scale

46 Multiple GPUs: bucket scheduler on the large scale

47 Multiple GPUs: bucket scheduler on the large scale

48 Multiple GPUs: bucket scheduler on the large scale

49 Multiple GPUs: bucket scheduler on the large scale

50 Multiple GPUs: bucket scheduler on the large scale

51 Multiple GPUs: bucket scheduler on the large scale

52 Multiple GPUs: bucket scheduler on the large scale

53 Multiple GPUs: bucket scheduler on the large scale

54 Multiple GPUs: bucket scheduler on the large scale

55 Multiple GPUs: bucket scheduler on the large scale

56 Multiple GPUs: bucket scheduler on the large scale

57 Multiple GPUs: bucket scheduler on the large scale

58 Multiple GPUs: bucket scheduler on the large scale

59 Multiple GPUs: bucket scheduler on the large scale

60 Multiple GPUs: bucket scheduler on the large scale

61 Multiple GPUs: bucket scheduler on the large scale

62 Multiple GPUs: bucket scheduler on the large scale

63 Multiple GPUs: bucket scheduler on the large scale

64 Multiple GPUs: bucket scheduler on the large scale

65 Multiple GPUs: bucket scheduler on the large scale

66 Multiple GPUs: bucket scheduler on the large scale

70 Acknowledgements: National Science Foundation NVIDIA Texas A&M University

71 Summary: Sparse QR on GPUs Fronts live and die on the GPU, reduces CPU-GPU traffic Bucket scheduler: extends Communication-Avoiding QR method Single GPU: speedup 5x to 20x on one GPU Multi GPU prototype: speedup over 3x on 4 GPUs Code: SuiteSparse.com and developer.nvidia.com/cholmod SuiteSparse logo, and music to art via math: NotesArtStudio.com

Sparse Matrix Algorithms

Sparse Matrix Algorithms combinatorics + numerical methods + applications Math + X Tim Davis University of Florida June 2013 contributions to the field current work vision for the future Outline Math+X