OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

Size: px

Start display at page:

Download "OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi"

Augustus Cox
6 years ago
Views:

1 OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

2 Outline About nervana Optimizing deep learning at assembler level Limited precision for deep learning neon benchmarks 2

3 About nervana A platform for machine intelligence enable deep learning at scale optimized from algorithms to silicon X 3 About Kernels neon Summary

4 Verticals Medical Finance Pharma Oil&Gas Agriculture $ 4 About Kernels neon Summary

5 Verticals Medical Finance Pharma Oil&Gas Agriculture $ Deep learning supplanting traditional approaches everywhere Small improvements have large impact Customers require clear roadmap that scales to growing need. 4 About Kernels neon Summary

6 nervana platform for deep learning train explore deploy nervana framework Data nervana cloud Solutions 5 About Kernels neon Summary

7 nervana platform for deep learning train explore deploy nervana framework Data nervana cloud Solutions GPUs CPUs nervana engine 5 About Kernels neon Summary

8 maxas: a Maxwell Assembler Full control of: register allocation instruction ordering control codes barriers, stall counts Built-in scheduler (optional) Meta-programming 6 About Kernels [ maxas ] neon Summary

9 maxas: a Maxwell Assembler Full control of: register allocation instruction ordering control codes barriers, stall counts Built-in scheduler (optional) Meta-programming Scott Gray 6 About Kernels [ maxas ] neon Summary

10 maxas: a Maxwell Assembler Full control of: register allocation instruction ordering control codes barriers, stall counts Built-in scheduler (optional) Meta-programming Scott Gray See GitHub repo for docs and examples 6 About Kernels [ maxas ] neon Summary

11 ptxas struggles with Instruction Level Parallelism 25" 20" Distribu4on&of&Number&of&Instruc4ons&Between&LDS& and&dependant&ffma&operands& Count& 15" 10" Bad Good ptx" cublas" 5" 0" 1" 6" 11" 16" 21" 26" 31" 36" 41" 46" 51" 56" 61" 66" 71" 76" 81" 86" 91" 96" 101" 106" 111" 116" 121" 126" 131" 136" 141" 146" 151" FFMA&Line#&.&LDS&Line#& courtesy Scott Gray 7 About Kernels [ maxas ] neon Summary

12 Easy register allocation through maxas c Register banking for outer products c = a b t a b 8 About Kernels [ maxas ] neon Summary

13 Example GEMM code in maxas 9 About Kernels [ maxas ] neon Summary

14 Example GEMM code in maxas Load from shared 9 About Kernels [ maxas ] neon Summary

15 Example GEMM code in maxas Load from shared Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

16 Example GEMM code in maxas Load from shared Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

17 Example GEMM code in maxas Dual issue instr. Load from shared Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

18 Example GEMM code in maxas Dual issue instr. Load from shared Set barrier Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

19 Example GEMM code in maxas Dual issue instr. Load from shared Set barrier Barrier sync Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

20 Convolution kernels for deep learning Input Filters Output H R * S = K P C W S R K Q C H x W R x S K P x Q N Number of input channels Input spatial dims Filter spatial dims Number of filters Output spatial dims Mini-batch dim (not shown) 10 About Kernels [ Convolution ] neon Summary

21 Access patterns for matrix lowering Convolution kernels: 11 About Kernels [ Convolution ] neon Summary

22 Access patterns for matrix lowering Convolution kernels: fprop 11 About Kernels [ Convolution ] neon Summary

23 Access patterns for matrix lowering Convolution kernels: bprop Backprop(Step(1( δ 1 ( P(=(Q(=(2( K(=(2( C(=(3( K(=(2( R(=(S(=(2(( δ 0 ( N(=(3( C(=(3( H(=(W(=(3( ( 11 About Kernels [ Convolution ] neon Summary

24 Access patterns for matrix lowering Convolution kernels: update Backprop(Step(2( (Weight(Updates( δ 1 ( Output(of(the(previous(layer ( P(=(Q(=(2( K(=(2( Weight(updates( C(=(3( K(=(2( R(=(S(=(2(( N(=(3( C(=(3( H(=(W(=(3( 11 About Kernels [ Convolution ] neon Summary

25 Deep learning with low precision works 12 About Kernels [ Limited Precision ] neon Summary

26 Deep learning with low precision works Improving the speed of neural networks on CPUs Vincent Vanhoucke Google, Inc. Mountain View, CA Andrew Senior Google, Inc. New York, NY Mark Z. Mao Google, Inc. Mountain View, CA Abstract Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3 improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10 speedup over an unoptimized baseline and a 4 speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware. 12 About Kernels [ Limited Precision ] neon Summary

27 Deep learning with low precision works LOW PRECISION ARITHMETIC FOR DEEP LEARNING Matthieu Courbariaux & Jean-Pierre David Department of Electrical Engineering École Polytechnique de Montréal Montréal, QC H3T 1J4, Canada Yoshua Bengio Department of Computer Science and Operations Research Université de Montréal Montréal, QC H3T 1J4, Canada ABSTRACT We simulate the training of a set of state of the art neural networks, the Maxout networks (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST, CIFAR10 and SVHN, with three distinct arithmetics: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those arithmetics, we assess the impact of the precision of the computations on the final error of the training. We find that very low precision computation is sufficient not just for running trained networks but also for training them. For example, almost state-of-the-art results were obtained on most datasets with around 10 bits for computing activations and gradients, and 12 bits for storing updated parameters. 12 About Kernels [ Limited Precision ] neon Summary

28 Deep learning with low precision works Deep Learning with Limited Numerical Precision Suyog Gupta Ankur Agrawal Kailash Gopalakrishnan IBM T. J. Watson Research Center, Yorktown Heights, NY Pritish Narayanan IBM Almaden Research Center, San Jose, CA Abstract Training of large-scale deep neural networks is often constrained by the available computational resources. We study the e ect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network s behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-e cient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. 12 About Kernels [ Limited Precision ] neon Summary

29 neon: nervana python deep learning library 13 About Kernels neon Summary

30 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism 13 About Kernels neon Summary

31 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models 13 About Kernels neon Summary

32 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud 13 About Kernels neon Summary

33 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud Supports multiple backends nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon) { } 13 About Kernels neon Summary

34 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud Supports multiple backends Multiple limited precision options nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon) { } 13 About Kernels neon Summary

35 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud Supports multiple backends Multiple limited precision options Optimized for Maxwell at assembler level nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon) { } 13 About Kernels neon Summary

36 neon: easy model configuration 14 About Kernels neon Summary

37 neon: easy model configuration Dataset 14 About Kernels neon Summary

38 neon: easy model configuration Dataset Weight initialization 14 About Kernels neon Summary

39 neon: easy model configuration Dataset Weight initialization Learning rule 14 About Kernels neon Summary

40 neon: easy model configuration Dataset Weight initialization Learning rule Model layers and cost 14 About Kernels neon Summary

41 neon experiments in fp16/32 15 About Kernels neon Summary

42 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format 15 About Kernels neon Summary

43 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 15 About Kernels neon Summary

44 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection 15 About Kernels neon Summary

45 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection Python element-wise operations auto-compiled into kernels 15 About Kernels neon Summary

46 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection Python element-wise operations auto-compiled into kernels fp16 accumulations done carefully to minimize errors 15 About Kernels neon Summary

47 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection Python element-wise operations auto-compiled into kernels fp16 accumulations done carefully to minimize errors Working with collaborators (Baidu, Bengio lab) to improve 15 About Kernels neon Summary

48 fp16/32 accuracy No accuracy loss going from fp32 to fp16 fp32 Count Error (%) distribution over 25 reruns Error (%) distribution over 25 runs 16 About Kernels neon Summary

49 fp16/32 accuracy No accuracy loss going from fp32 to fp16 fp16 fp32 Count Error (%) distribution over 25 reruns Error (%) distribution over 25 runs 16 About Kernels neon Summary

50 fp16/32 accuracy No accuracy loss going from fp32 to fp16 fp 16 sto fp16 fp32 Count Error (%) distribution over 25 reruns Error (%) distribution over 25 runs 16 About Kernels neon Summary

51 Speed benchmarks 1 : fp16 vs others 600 neon fp16 neon Cudaconvnet2 Torch7 cudnn* cudanet 500 Time per layer (ms) layers convolutional forward pass, layers, 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 *2 nd, 3 rd layer don t fit on a 4GB card 1 Soumith Chintala, github.com/soumith/convnet-benchmarks 17 About Kernels neon Summary

52 Speed benchmarks 1 : fp16 vs fp neon fp16 neon Cudaconvnet2 cudanet Torch7 cudnn * 500 Time per layer (ms) layers convolutional forward pass, layers, 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 1 Soumith Chintala, github.com/soumith/convnet-benchmarks *some layers do not fit on a 4GB card X About Kernels neon Summary

53 Speed benchmarks 1 : fp16 vs fp neon fp16 neon Cuda- cudanet convnet2 Torch7 cudnn * 500 Time per layer (ms) layers convolutional forward pass, layers, 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 1 Soumith Chintala, github.com/soumith/convnet-benchmarks *some layers do not fit on a 4GB card X About Kernels neon Summary

54 Speed benchmarks 1 : fp16 vs fp neon fp16 neon Cuda- cudanet convnet2 Torch7 cudnn* 500 Time Time per per layer layer (ms) (ms) convolutional layers forward pass, layers, 5 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 1 Soumith Chintala, github.com/soumith/convnet-benchmarks *some layers do not fit on a 4GB card X About Kernels neon Summary

55 Benchmarks 1 show 2x performance Raw numbers (averaged over 10 runs) Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

56 Benchmarks 1 show 2x performance Alexnet Raw numbers (averaged over 10 runs) Avg(10) fprop: msecs gflops Avg(10) bprop: msecs gflops Avg(10) total: msecs gflops Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

57 Benchmarks 1 show 2x performance Overfeat Raw numbers (averaged over 10 runs) Avg(10) fprop: msecs gflops Avg(10) bprop: msecs gflops Avg(10) total: msecs gflops Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

58 Benchmarks 1 show 2x performance VGG (N=64) Raw numbers (averaged over 10 runs) Avg(10) fprop: msecs gflops Avg(10) bprop: msecs gflops Avg(10) total: msecs gflops Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

59 Summary 19 About Kernels neon Summary

60 Summary neon: User-friendly python library 19 About Kernels neon Summary

61 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning 19 About Kernels neon Summary

62 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU 19 About Kernels neon Summary

63 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU Limited precision allows for larger models 19 About Kernels neon Summary

64 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU Limited precision allows for larger models Toolbox for exploring numerical representations 19 About Kernels neon Summary

65 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU Limited precision allows for larger models Toolbox for exploring numerical representations 19 About Kernels neon Summary

66 GTC 2015 Contact us at We are hiring! Cloud engineers GPU experts machine learning engineers software engineers Sign up to try neon, our deep learning library. We can help solve your problem. 20 About Kernels neon Summary

High Performance Computing

High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason