Accelerating cublas/cudnn using Input-Aware Auto-Tuning

Similar documents
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Lecture 8: GPU Programming. CSE599G1: Spring 2017

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

CME 213 S PRING Eric Darve

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

Research Faculty Summit Systems Fueling future disruptions

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

How to Optimize Geometric Multigrid Methods on GPUs

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

A performance comparison of Deep Learning frameworks on KNL

HIRP OPEN 2018 Compiler & Programming Language. An Efficient Framework for Optimizing Tensors

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

Accelerating Convolutional Neural Nets. Yunming Zhang

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

TENSORRT 4.0 RELEASE CANDIDATE (RC)

CUDA. Matthew Joyner, Jeremy Williams

End to End Optimization Stack for Deep Learning

Deep learning in MATLAB From Concept to CUDA Code

QR Decomposition on GPUs

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Practical Introduction to CUDA and GPU

GPUCC An Open-Source GPGPU Compiler A Preview

Learning to Localize Objects with Structured Output Regression

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

AutoTVM & Device Fleet

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Accelerated Machine Learning Algorithms in Python

Accelerating HPL on Heterogeneous GPU Clusters

Tuning Performance on Kepler GPUs

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Accelerating GPU kernels for dense linear algebra

Accelerating GPU Kernels for Dense Linear Algebra

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Unleashing the Power of Multi-GPU Accelerators with FLAME

Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Accelerating the Fast Fourier Transform using Mixed Precision on Tensor Core Hardware

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Optimization Space Pruning without Regrets

High-Performance Data Loading and Augmentation for Deep Neural Network Training

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

TENSORRT 3.0. DU _v3.0 February Installation Guide

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability

PARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller

Adrian Tate XK6 / openacc workshop Manno, Mar

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

Challenges in HPC I/O

Advanced CUDA Optimization 1. Introduction

Convolutional Neural Network Layer Reordering for Acceleration

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

SVM multiclass classification in 10 steps 17/32

Auto-tuning a High-Level Language Targeted to GPU Codes. By Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, John Cavazos

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

CAPS Technology. ProHMPT, 2009 March12 th

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

*Yuta SAWA and Reiji SUDA The University of Tokyo

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

When MPPDB Meets GPU:

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

Automatic Development of Linear Algebra Libraries for the Tesla Series

High Performance Linear Algebra

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CS 179: Lecture 10. Introduction to cublas

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Compiler Optimizations and Auto-tuning. Amir H. Ashouri Politecnico Di Milano -2014

GPU Acceleration for Machine Learning

Transcription:

Accelerating cublas/cudnn using Input-Aware Auto-Tuning The ISAAC library Philippe Tillet Harvard University

Introduction cublas does not always achieve peak performance: (M, N, K) 1 = (4096, 4096, 4096): 95% (M, N, K) = (1760, 32, 1760): 15% (M, N, K) = (16, 16, 128,000): 0.1% Yes, some configurations are IO-bound, but still... 1 Product of an M K by a K N matrix

Introduction Figure: cublas (GEMM) vs Roofline Model Pascal Titan X 10 1 Performance [TFLOPS] 10 0 10 1 10 2 10 3 Theoretical peak LinPack DeepBench Covariannce LaPack 10 0 10 1 10 2 10 3 Operational Intensity [TFLOPS/Bytes]

Introduction cublas/cudnn are good... Better than anything else so far Achieves peak performance... sometimes... but not perfect Lack performance-portability (across hardware/tensor shapes) Can we do better?

Method Performance portability across hardware is a solved problem: Assume the existence of a kernel generator for GEMM/CONV x k : kernel parameters (e.g., tile sizes) x i : input parameters (e.g., tensor shape, data-type) y(x i, x k ): Performance of a given kernel on given inputs Auto-Tuning (ATLAS, clblas, etc.): Offline: Choose x i ; find arg max xk y(x i, x k ).

Method ISAAC adds input portability: We want to retain good performance across the entire space of inputs Input-Aware Auto-Tuning: Offline: Build a predictive model ŷ for y. Online: x i is imposed; find arg max xk ŷ(x i, x k ).

Method Figure: Flowchart of ISAAC

Kernel Generation Goal: Transform kernel parameters x k into functional binaries. Typical kernel parameters are tile sizes, reduction splits, pre-fetching factors.

Kernel Generation Figure: Parameterization of GEMM x k = (M L, N L, M S, N S, P, K G, K L, K S )

Kernel Generation Implementation details Double-buffered memory loads Vector loads/stores are used when possible CONV is essentially GEMM with a look-up table PTX code generation: Faster compilation i.e., auto-tuning No CUDA SDK dependency 20 30% performance gain vs CUDA C (predicates)

Data Generation Goal: Generate a set of pairs (x n, y n ) where x = (x i, x k ) Method: Sample x and measure y. About 99.9% of the generated configurations are invalid! Build a generative model for valid x

Regression Analysis Goal: Given X, Y build a predictive model ŷ(x) Method: MLPs are a good choice because: Generating data-points is cheap Fast, batched inference Vanilla ML algorithms are not good at handling multiplications/divisions Feature transormation x = log x

Runtime Inference Goal: Given x i, find the best possible x k. Method: Compute arg max xk ŷ(x i, x k ). Exhaustive search: millions of candidates x k can be evaluated in one second. Global maximum guaranteed Other choices: GA, Simulated Annealing... Re-benchmark the 10 best predictions and pick the actual fastest.

Method Summary Build a parameterized code generator for GEMM and CONV Benchmark random kernels on random input configurations Build a predictive model for the performance of any kernel on any shape For a fixed shape, maximize the model over kernels.

Benchmarks Figure: SGEMM on TitanX (Pascal) 12 10 ISAAC cublas 8 TFLOPS 6 4 2 0 512 10242048 M=N=K LinPack 16 32 64 128 N DeepBench [F] M=K=2560 16 32 64 128 N DeepBench [B] M=K=2560 16 64 256 M=N ICA K=60000 896 20484096 M=N Blocked SVD K=32

Benchmarks Figure: Roofline Model - Revisited 10 1 Performance [TFLOPS] 10 0 10 1 10 2 Theoretical peak LinPack DeepBench Covariannce LaPack

Benchmarks Figure: HGEMM/DGEMM on P100 3.5 Our framework cublas 3.0 2.5 TFLOPS 2.0 1.5 1.0 0.5 0.0 512 1024 2048 M=N=K LinPack [Double-Precision] 16 32 64 128 N DeepBench [Half-Precision] 16 64 256 M=N ICA [Double-Precision] 896 2048 4096 M=N Blocked SVD [Double-Precision]

Benchmarks Figure: SCONV on TitanX (Pascal) 12 10 ISAAC cudnn 8 TFLOPS 6 4 2 0 DeepBench

Benchmarks Figure: HCONV on P100 16 14 ISAAC cudnn TFLOPS 12 10 8 6 4 2 0 DeepBench

Conclusions Presented the design and implementation of ISAAC Performance improvements of 0.8-9x over cudnn Performance improvements of 0.9-3x (> 30x on ICA) over cublas Fast release cycle (auto-tuning takes 3 hours) git clone -b v2.0 https://github.com/ptillet/isaac.git

Thanks for your attention!

Benchmarks Figure: SGEMM on GTX980 5 ISAAC cublas 4 TFLOPS 3 2 1 0 512 1024 2048 M=N=K LinPack 16 32 64 N DeepBench M=K=1760 128 16 64 M=N ICA K=60000 256 896 2048 4096 M=N Blocked SVD K=32