On the limits of (and opportunities for?) GPU acceleration

Size: px

Start display at page:

Download "On the limits of (and opportunities for?) GPU acceleration"

Cordelia Cook
6 years ago
Views:

1 On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10, Berkeley June 15, 2010

3 Q: Port to GPU? (Posed to me by Scott Klasky at ORNL) A: Given roughly same level of tuning & power*, GPU CPU Meta-analysis, for semi-irregular sci. comp. + data analytics apps (sparse iterative + direct solvers; tree-based particle methods) 3

4 Summary of limits Bandwidth-bound: Aggregate bandwidth < 3x, restricted access patterns Compute-bound: 5 to 10x peak potential, but there is a multithreading granularity mismatch PCIe: Will architects replicate GPU memory system in on-die CPU/GPU? Productivity: To 0th order, tuning required on all platforms I.e., Bailey s Rule #6 4

5 Limitations of this meta-analysis Mix of partial results, very rough back-of-the envelope, apples vs. oranges But: It s all fruit My Big Fat Greek Wedding Narrow: Scientific computing? (yawn) But: More physically realistic games & graphics, signal analysis, data analytics Limited to today s platforms, and excludes Fermi. What about tomorrow? But: Prediction is very difficult, especially if it's about the future. N. Bohr 5

Other voices Vishkin, et al., HotPar 10 poster! Bordawekar, et al. (IBM). Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!

6 Other voices Vishkin, et al., HotPar 10 poster! Bordawekar, et al. (IBM). Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application! IBM Technical Report RC24982, April Lee, et al. (Intel). Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU. ISCA, June

7 Performance and productivity expectations 7

8 Easy to write in Nesl, but at what cost?. 100 Lines of code Time Nesl (16-cores) C (1-core) Cilk++ (16-cores) Tuned library (16-core) CUDA (GPU) Worth the effort? Algorithm: Parallel sorting 8

9 (7.2, 1030) (9.1, 933) 256 Gflop/s (3.3, 86) Platform a Fermi a Nehalem a T10P Single-precision 1:1 socket 2 1/8 1/4 1/ Intensity (flop : byte)

10 1024 (7.2, 1030) (9.1, 933) ~ 8x (Case study 3) ~ 12x (Case study 2) 128 (3.3, 86) Gflop/s ~ 4x (Case study 1) Platform a Fermi a Nehalem a T10P Single-precision 1:1 socket 2 1/8 1/4 1/ Intensity (flop : byte)

11 (3.6, 515) 256 ~ 2 to 12x 128 Gflop/s (0.8, 78) (1.7, 43) Platform a Fermi a Nehalem a T10P ~ 3 to 4x Double-precision 1:1 socket 2 1/8 1/4 1/ Intensity (flop : byte)

12 Case study 1: Sparse iterative solvers S. Williams, N. Bell, J. Choi, M. Garland, L. Oliker, R. Vuduc. SpMV on MC & Acc. In BDK book (2010). 12

13 Anatomy of a sparse iterative solver do { y A*x SpMV } while (!converged) Bottleneck: Sparse matrix-vector multiply (SpMV) Memory bandwidth-limited (stream A): GPU / CPU ~ 3x 13

14 20 18 Double Gflop/s Platform Nehalem GTX 285 Nehalem (tuned) Dense Protein Spheres Cantilever WindTunnel Harbor QCD Ship Economics Epidemiology Accelerator Circuit Webbase S. Williams, N. Bell, J. Choi, M. Garland, L. Oliker, MatrixR. Vuduc. SpMV on MC & Acc. In BDK book (2010). LP 14

15 20 18 Double Gflop/s ~ 6 to 7x Platform Nehalem GTX 285 Nehalem (tuned) Dense Protein Spheres Cantilever WindTunnel Harbor QCD Ship Economics Epidemiology Accelerator Circuit Webbase S. Williams, N. Bell, J. Choi, M. Garland, L. Oliker, MatrixR. Vuduc. SpMV on MC & Acc. In BDK book (2010). LP 15

16 20 18 Double Gflop/s ~ 2.1x (vs. ~ 3x bandwidth) Platform Nehalem GTX 285 Nehalem (tuned) Dense Protein Spheres Cantilever WindTunnel Harbor QCD Ship Economics Epidemiology Accelerator Circuit Webbase S. Williams, N. Bell, J. Choi, M. Garland, L. Oliker, MatrixR. Vuduc. SpMV on MC & Acc. In BDK book (2010). LP 16

17 Beyond the kernel Optimal data structures differ between CPU & GPU Startup, transfer cost In distributed memory, need to transfer vectors PCIe limits Need ~ 100 iterations to break even, ~ 840 to get 2x on actual solver Big or not? Depends on app & input 17

18 Case study 2: The Fast Multipole Method (1) Lashuk, et al., SC 09. (2) Chandramowlishwaran, Williams, et al., IPDPS

19 Blood cell simulation (movie) In 2-D with deformations 19

20 Anatomy of an FMM Want: All pair interactions among green dots O(n 2 ) FMM idea: Build tree, traverse & prune (approx.) O(n log n), with accuracy guarantee Bottleneck: Little all-pair interactions (b 2 ) among leaves High compute intensity: 12x possible? 20

21 The story Baseline: Lashuk, et al., SC 09 FMM that scales to 100k cores of Kraken UTK Good overall scalability, but low within-node performance Try GPUs? Gumerov & Duraiswami (JCP 08) suggest 30 60x speedups on GPUs We successfully replicate on 256 GPU system at UIUC (1 MPI task / GPU) But, stubborn student (Aparna C.) is skeptical of speedup 21

22 Speedup over Nehalem Baseline Code Baseline Opt Opt+Par Nehalem C1060 C1060 x 2 Platform

23 53 0.9x Speedup over Nehalem Baseline x 31 Code Baseline Opt Opt+Par Nehalem C1060 C1060 x 2 Platform

24 Recall: Anatomy of an FMM Want: All pair interactions among green dots O(n 2 ) FMM idea: Build tree, traverse & prune (approx.) O(n log n), with accuracy guarantee Bottleneck: Little all-pair interactions (b 2 ) among nodes High compute intensity: 12x possible? 24

25 Direct O(n^2) n body algorithm Platform AMD Barcelona Gflop/s 200 Intel Harpertown NVIDIA Tesla C1060 NVIDIA Tesla C870 STI PowerXCell/8i No. of Particles

26 Direct O(n^2) n body algorithm 500 ~ 500 Gflop/s (!) Platform AMD Barcelona Gflop/s 200 Intel Harpertown NVIDIA Tesla C1060 NVIDIA Tesla C870 STI PowerXCell/8i No. of Particles

27 Direct O(n^2) n body algorithm 500 ~ 500 Gflop/s (!) Platform AMD Barcelona Gflop/s 200 Intel Harpertown NVIDIA Tesla C1060 NVIDIA Tesla C870 STI PowerXCell/8i No. of Particles

28 Case study 3: Sparse direct solvers (1) M. Efe Guney, Ph.D. Thesis, Georgia Tech, May 10. (2) Guney, Czechowski, et al. (in progress) 28

29 Anatomy of a sparse direct solver Sparse Cholesky factorization, A = L L T, where A & L are sparse Mixed compute intensity, average of ~ 4 flops : byte for sample problem 29

30 Real elimination tree example Independent subtrees may be processed in parallel. 30

31 Depth Flops (%) Cumulative flops (%) Depth

32 Finer-grained dependencies Colored circles on the right are BLAS calls on operands of varying size. 32

33 Fraction of time in BLAS calls 0.3 % ops dgemm dpotrf dsyrk dtrsm

34 flop : byte Fraction of flops /8 1/4 1/ BLAS dgemm dpotrf dsyrk dtrsm Intensity mix

0 1 2 3 4 5 BLAS Depth 6 7 8 Depth dgemm dpotrf dsyrk

35 BLAS Depth Depth dgemm dpotrf dsyrk dtrsm % ops

36 flop : byte % ops /8 1/41/ /8 1/41/ /8 1/41/ BLAS dgemm dpotrf dsyrk dtrsm Depth

37 Double-precision Gflop/s! 40! 35! 30! 25! 20! 15! 10! 5! 28.1! 26.6! 23.5! 20.1! 22.4! 21.7! 17.1! 14.1! 35.2! 33.5! 30.8! 30.2! 31.0! 25.2! 1 core (4-core ~ 2-4x) 9.5! 9.8! 10.3! 10.3! 10.6! 10.9! 11.0! 0! S15x15x50! F15x15x50! S30x30x30! S100x50x13! F30x30x30! S50x50x50! F45x45x45! Frontal matrix size! GPU Gflop/s! + copy! GPU Gflop/s! (no copy)! 1-core CPU Gflop/s! + copy!

38 Summary: Definite potential, but no magic For a set of semi-irregular computations, 1 GPU 1 to 2 CPUs. Study heterogeneous computing, but scale-back expectations. Barriers? Bandwidth-bound: Aggregate GPU : CPU bandwidth < 3x PCIe: Will we really replicate GPU memory system in on-die CPU/GPU? Compute-bound: 5 to 10x potential, but how? E.g., FMM granularity mismatch Productivity: To 0th order, tuning required on all platforms (Bailey s Rule #6) 38

Tuning Algorithms and Data

Tuning Algorithms and Data Rich Vuduc Students: Aparna Chandramowlishwaran, Jee Choi, Cong Hou, Aashay Shringarpure, Sundaresan Venkatasubramanian (Amazon) hpcgarage.org 1 Tuning the Data Structure: Model