Accelerating cublas/cudnn using Input-Aware Auto-Tuning

Accelerating cublas/cudnn using Input-Aware Auto-Tuning The ISAAC library Philippe Tillet Harvard University

Introduction cublas does not always achieve peak performance: (M, N, K) 1 = (4096, 4096, 4096): 95% (M, N, K) = (1760, 32, 1760): 15% (M, N, K) = (16, 16, 128,000): 0.1% Yes, some configurations are IO-bound, but still... 1 Product of an M K by a K N matrix

Introduction Figure: cublas (GEMM) vs Roofline Model Pascal Titan X 10 1 Performance [TFLOPS] 10 0 10 1 10 2 10 3 Theoretical peak LinPack DeepBench Covariannce LaPack 10 0 10 1 10 2 10 3 Operational Intensity [TFLOPS/Bytes]

Introduction cublas/cudnn are good... Better than anything else so far Achieves peak performance... sometimes... but not perfect Lack performance-portability (across hardware/tensor shapes) Can we do better?

Method Performance portability across hardware is a solved problem: Assume the existence of a kernel generator for GEMM/CONV x k : kernel parameters (e.g., tile sizes) x i : input parameters (e.g., tensor shape, data-type) y(x i, x k ): Performance of a given kernel on given inputs Auto-Tuning (ATLAS, clblas, etc.): Offline: Choose x i ; find arg max xk y(x i, x k ).

Method ISAAC adds input portability: We want to retain good performance across the entire space of inputs Input-Aware Auto-Tuning: Offline: Build a predictive model ŷ for y. Online: x i is imposed; find arg max xk ŷ(x i, x k ).

Method Figure: Flowchart of ISAAC

Kernel Generation Goal: Transform kernel parameters x k into functional binaries. Typical kernel parameters are tile sizes, reduction splits, pre-fetching factors.

Kernel Generation Figure: Parameterization of GEMM x k = (M L, N L, M S, N S, P, K G, K L, K S )

Kernel Generation Implementation details Double-buffered memory loads Vector loads/stores are used when possible CONV is essentially GEMM with a look-up table PTX code generation: Faster compilation i.e., auto-tuning No CUDA SDK dependency 20 30% performance gain vs CUDA C (predicates)

Data Generation Goal: Generate a set of pairs (x n, y n ) where x = (x i, x k ) Method: Sample x and measure y. About 99.9% of the generated configurations are invalid! Build a generative model for valid x

Regression Analysis Goal: Given X, Y build a predictive model ŷ(x) Method: MLPs are a good choice because: Generating data-points is cheap Fast, batched inference Vanilla ML algorithms are not good at handling multiplications/divisions Feature transormation x = log x

Runtime Inference Goal: Given x i, find the best possible x k. Method: Compute arg max xk ŷ(x i, x k ). Exhaustive search: millions of candidates x k can be evaluated in one second. Global maximum guaranteed Other choices: GA, Simulated Annealing... Re-benchmark the 10 best predictions and pick the actual fastest.

Method Summary Build a parameterized code generator for GEMM and CONV Benchmark random kernels on random input configurations Build a predictive model for the performance of any kernel on any shape For a fixed shape, maximize the model over kernels.

Benchmarks Figure: SGEMM on TitanX (Pascal) 12 10 ISAAC cublas 8 TFLOPS 6 4 2 0 512 10242048 M=N=K LinPack 16 32 64 128 N DeepBench [F] M=K=2560 16 32 64 128 N DeepBench [B] M=K=2560 16 64 256 M=N ICA K=60000 896 20484096 M=N Blocked SVD K=32

Benchmarks Figure: Roofline Model - Revisited 10 1 Performance [TFLOPS] 10 0 10 1 10 2 Theoretical peak LinPack DeepBench Covariannce LaPack

Benchmarks Figure: HGEMM/DGEMM on P100 3.5 Our framework cublas 3.0 2.5 TFLOPS 2.0 1.5 1.0 0.5 0.0 512 1024 2048 M=N=K LinPack [Double-Precision] 16 32 64 128 N DeepBench [Half-Precision] 16 64 256 M=N ICA [Double-Precision] 896 2048 4096 M=N Blocked SVD [Double-Precision]

Benchmarks Figure: SCONV on TitanX (Pascal) 12 10 ISAAC cudnn 8 TFLOPS 6 4 2 0 DeepBench

Benchmarks Figure: HCONV on P100 16 14 ISAAC cudnn TFLOPS 12 10 8 6 4 2 0 DeepBench

Conclusions Presented the design and implementation of ISAAC Performance improvements of 0.8-9x over cudnn Performance improvements of 0.9-3x (> 30x on ICA) over cublas Fast release cycle (auto-tuning takes 3 hours) git clone -b v2.0 https://github.com/ptillet/isaac.git

Thanks for your attention!

Benchmarks Figure: SGEMM on GTX980 5 ISAAC cublas 4 TFLOPS 3 2 1 0 512 1024 2048 M=N=K LinPack 16 32 64 N DeepBench M=K=1760 128 16 64 M=N ICA K=60000 256 896 2048 4096 M=N Blocked SVD K=32