MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA

Size: px

Start display at page:

Download "MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA"

Irene Richard
5 years ago
Views:

1 MIXED PRECISION TRAINING OF NEURAL NETWORKS Carl Case, Senior Architect, NVIDIA

2 OUTLINE 1. What is mixed precision training with FP16? 2. Considerations and methodology for mixed precision training 3. Mixed precision training in practice 2

3 OUTLINE 1. What is mixed precision training with FP16? 2. Considerations and methodology for mixed precision training 3. Mixed precision training in practice 3

4 BACKGROUND: VOLTA TENSOR CORES Hardware support for accelerated 16-bit FP math Peak throughput of 125 TFLOPS (8x FP32) Used by cudnn and cublas libraries to accelerate matrix multiply and convolution Internal accumulation occurs in FP32 to maintain accuracy* Exposed in CUDA as WMMA. See: FP16 storage/input Full precision product Sum with FP32 accumulator more products *FP16 accumulator is also available for inference 4

5 MIXED PRECISION TRAINING In a nutshell Goal Benefits Keep stored values in FP16: weights and activations, along with their gradients Use Tensor Cores to accelerate math and maintain accuracy Up to 8x math speedup (depends on arithmetic intensity) Half the memory traffic Half the memory storage Can enable larger model or batch sizes 5

6 MIXED PRECISION FOR RESNET-50 Four lines of code => 2.3x training speedup in PyTorch Mixed precision training uses halfprecision floating point (FP16) to accelerate training You can start using mixed precision today with four lines of code This example uses AMP: Automatic Mixed Precision, a PyTorch library No hyperparameters changed + amp_handle = amp.init() #... Define model and optimizer for x, y in dataset: prediction = model(x) loss = criterion(prediction, y) - loss.backward() + with amp_handle.scale_loss(loss, + optimizer) as scaled_loss: + scaled_loss.backward() optimizer.step() 6

7 MIXED PRECISION FOR RESNET-50 Real-world single-gpu runs using default PyTorch ImageNet example NVIDIA PyTorch py3 container AMP for mixed precision Minibatch=256 Single GPU RN-50 speedup for FP32 -> M.P. (with 2x batch size): MxNet: 2.9x TensorFlow: 2.2x TensorFlow + XLA: ~3x Four lines of code => 2.3x training speedup PyTorch: 2.3x Work ongoing to bring to 3x everywhere 7

8 MIXED PRECISION FOR RESNET-50 Speedup at scale Real-world 8GPU runs using MxNet on DGX-1 NVIDIA mxnet py3 container Minibatch=256 per GPU Note: less inter-iteration loss variance since total batch size is 8x larger Total time to run a full 90 epochs of training in mixed precision is well under four hours 8

9 MIXED PRECISION IS GENERAL PURPOSE Models trained to match FP32 results (same hyperparameters) Image Classification AlexNet, VGG-D, GoogLeNet, Inception v{2,3}, ResNet-50, ResNeXt-50 Detection Faster R-CNN, Multibox SSD, NVIDIA-proprietary automotive Translation, other NLP GNMT (RNN-based), FairSeq (convolution-based), Sentiment classification Speech Deep Speech 2 (recognition); Tacotron2 + Wavenet (synthesis) Image Synthesis Progressive GANs, Partial image inpainting 9

10 MIXED PRECISION SPEEDUPS Not limited to image classification Model FP32 -> M.P. Speedup Comments GNMT (Translation) 2.3x Iso-batch size FairSeq Transformer (Translation) 2.9x 4.9x Iso-batch size 2x lr + larger batch ConvSeq2Seq (Translation) Deep Speech 2 (Speech recognition) wav2letter (Speech recognition) Nvidia Sentiment (Language modeling) 2.5x 2x batch size 4.5x Larger batch 3.0x 2x batch size 4.0x Larger batch *In all cases trained to same accuracy as FP32 model **No hyperaparameter changes, except as noted 10

11 MIXED PRECISION IN DL RESEARCH Large Scale Language Modeling: Converging on 40GB of Text in Four Hours [Nvidia] We train our recurrent models with mixed precision FP16/FP32 arithmetic, which speeds up training on a single V100 by 4.2X over training in FP32. Scaling Neural Machine Translation [Facebook] This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. 11

12 OUTLINE 1. What is mixed precision training with FP16? 2. Considerations and methodology for mixed precision training 3. Mixed precision training in practice 12

13 MIXED PRECISION METHODOLOGY For training Goal: training with FP16 is general purpose, not only for a limited class of applications In order to train with no architecture or hyperparameter changes, we need to give consideration to the reduced precision inherent in using only 16 bits Note: true for any reduced precision format, though specifics may be different Three parts: 1. Model conversion, with careful handling of non-tensor Core ops 2. Master weight copy 3. Loss scaling 13

14 1. MODEL CONVERSION For Tensor Core ops For most of the model, we make simple type updates to each layer: Use FP16 values for the weights (layer parameters) Ensure the inputs are FP16, so the layer runs on Tensor Cores 14

15 1. MODEL CONVERSION Pointwise and reduction ops Common operations that are not matrix multiply or convolution: Activation functions: ReLU, sigmoid, tanh, softplus Normalization functions: batchnorm, layernorm, sum, mean, softmax Loss functions: cross entropy, L2 loss, weight decay Miscellaneous: exp, log, pointwise-{add, subtract, multiply, divide} We want to maintain the accuracy of these operations, even though they will not run on Tensor Cores 15

16 POINTWISE AND REDUCTION OPS Principles Tensor Cores increase precision in two ways: 1. Each individual multiply is performed in high precision 2. The sum of the products is accumulated in high precision For non-tc operations, we want to adhere to those same principles: 1. Keep intermediate or temporary values in high precision 2. Perform sums (reductions) in high precision 16

POINTWISE AND REDUCTION OPS Intermediate and temporary values in high precision For pointwise operations, generally fine to operate directly on FP16 values.

17 POINTWISE AND REDUCTION OPS Intermediate and temporary values in high precision For pointwise operations, generally fine to operate directly on FP16 values. FP32 math and storage recommended for ops where f(x) x (or same for grads) Eg: exp, log, pow Pointwise op fusion is a good next step for performance and precision 17

18 POINTWISE AND REDUCTION OPS Perform sums / reductions in high precision Common to normalize a large set of FP16 values in, e.g., a softmax layer Two choices : Sum all the values directly into an FP16 accumulator, then perform division in FP16 Perform math in high precision (FP32 accumulator, division), then write the final result in FP16 The first introduces the possibility of compounding precision error The second does what Tensor Cores do: limit reduced precision to final output This is the desired behavior 18

19 POINTWISE AND REDUCTION OPS Practical recommendations Nonlinearities: fine for FP16 Except: watch out for exp, log, pow Normalization: input /output in FP16; intermediate results stored in FP32 Ideally: fused into single op. Example: cudnn BatchNorm Loss functions: input / output in FP32 Also: attention modules (softmax) 19

20 2. MASTER WEIGHTS At each iteration of training, perform a weight update of the form w t+1 = w t α t w t s are weights; t s are gradients; α is the learning rate As a rule, gradients are smaller than weights, and learning rate is less than one Consequence: weight update can be a no-op, since you can t get to next representable value Conservative solution: keep a high-precision copy of weights so small updates accumulate across iterations No-op weight update

21 Range representable in FP16: ~40 powers of 2 Gradients are small: 3. LOSS SCALING Weights Activations Some lost to zero While ~15 powers of 2 remain unused Loss scaling: Multiply loss by a constant S All gradients scaled up by S (chain rule) Weight Grads Activation Grads Unscale weight gradient (in FP32) before weight update 21

22 3. LOSS SCALING Automatically choosing a scaling factor S Intuition: 0. Start with a very large scaling factor. 1. If an inf or NaN is present in the gradient, decrease the scale. And skip the update, including optimizer state. 2. If no inf or NaN has occurred for some time, increase the scale. 22

23 3. LOSS SCALING Automatically choosing a scaling factor S Intuition: 0. Start with a very large scaling factor. 1. If an inf or NaN is present in the gradient, decrease the scale. And skip the update, including optimizer state. 2. If no inf or NaN has occurred for some time, increase the scale. Described in detail: Implementations available in major frameworks (see references at the end) 23

24 OUTLINE 1. What is mixed precision training with FP16? 2. Considerations and methodology for mixed precision training 3. Mixed precision training in practice 24

25 Matrix Multiply MAKING USE OF TENSOR CORES Performance Guidelines All three dimensions (M, N, K) should be multiples of 8 Larger K (shared dimension) is more efficient (amortizes write overhead) Convolution Note this makes wider recurrent cells more practical (K is input layer width) Number of channels for input and output should be a multiple of 8 (can be different) If you are designing models: Choose layer widths and batch sizes or channel counts that are multiples of 8 Am I using Tensor Cores? cublas and cudnn are optimized for Tensor Cores, coverage is always increasing Run with nvprof and look for s[some digits] in kernel name Eg: volta_fp16_s884gemm_fp16_128x128_ldg8_f2f_nn 25

26 ENABLING MIXED PRECISION In real-world training pipelines Three ways to switch to mixed precision training: 1. Add all the necessary code yourself 2. Use libraries that enable safe updates; manage model conversion by hand 3. Use a fully-automated tool to perform necessary transformations under-the-hood Recommendation: (3) to get familiar with mixed precision training; (2) for more control (and occasionally better performance) (1) if using a custom setup for which libraries don t exist 26

27 DEBUGGING MIXED PRECISION Notes and what to watch for The unreasonable effectiveness of gradient descent Bugs in code for mixed precision steps often manifest as slightly worse training accuracy Be sure to follow good software engineering practices, especially testing Common mistakes: Gradients not unscaled correctly before weight update (AdaGrad / Adam will try to handle this!) Gradient clipping or regularization improperly using scaled gradients Incorrectly synchronizing master weight updates across multiple GPUs Not running loss function in FP32 27

28 RUNNING EXAMPLE FOR ENABLING MIXED PRECSION Toy Model in PyTorch N, D_in, D_out = 64, 1024, 512 x = torch.randn(n, D_in ).cuda() y = torch.randn(n, D_out).cuda() model = torch.nn.linear(d_in, D_out).cuda() optimizer = torch.optim.sgd(model.parameters(), lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred, y) optimizer.zero_grad() loss.backward() optimizer.step() 28

29 DIY: MODEL CONVERSION N, D_in, D_out = 64, 1024, 512 x = torch.randn(n, D_in ).cuda().half() y = torch.randn(n, D_out).cuda() model = torch.nn.linear(d_in, D_out).cuda().half() optimizer = torch.optim.sgd(model.parameters(), lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred.float(), y) optimizer.zero_grad() loss.backward() optimizer.step() 29

30 N, D_in, D_out = 64, 1024, 512 DIY: MASTER WEIGHTS x = torch.randn(n, D_in ).cuda().half() y = torch.randn(n, D_out).cuda() model = torch.nn.linear(d_in, D_out).cuda().half() master_weights = [x.clone().detach().float() for x in model.parameters()] for p in master_weights: p.requires_grad = True optimizer = torch.optim.sgd(master_weights, lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred.float(), y) optimizer.zero_grad() loss.backward() for mod_w, master_w in zip(model.parameters(), master_weights): master_w.grad.data.copy_(mod_w.grad.data) optimizer.step() for mod_w, master_w in zip(model.parameters(), master_weights): mod_w.data.copy_(master_w.data) 30

31 DIY: LOSS SCALING (FIXED VALUE) N, D_in, D_out = 64, 1024, 512 loss_scale = 256. x = torch.randn(n, D_in ).cuda().half() y = torch.randn(n, D_out).cuda() model = torch.nn.linear(d_in, D_out).cuda().half() master_weights = [x.clone().detach().float() for x in model.parameters()] for p in master_weights: p.requires_grad = True optimizer = torch.optim.sgd(master_weights, lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred.float(), y) * loss_scale optimizer.zero_grad() loss.backward() for mod_w, master_w in zip(model.parameters(), master_weights): master_w.grad.data.copy_(mod_w.grad.data) master_w.grad.data.mul_(1. / loss_scale) optimizer.step() for mod_w, master_w in zip(model.parameters(), master_weights): mod_w.data.copy_(master_w.data) 31

32 LIBRARIES FOR MIXED PRECISION N, D_in, D_out = 64, 1024, 512 x = torch.randn(n, D_in ).cuda() y = torch.randn(n, D_out).cuda() model = torch.nn.linear(d_in, D_out).cuda() optimizer = torch.optim.sgd(model.parameters(), lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred, y) optimizer.zero_grad() loss.backward() optimizer.step() 32

33 LIBRARIES FOR MIXED PRECISION from apex.fp16_utils import FP16_Optimizer N, D_in, D_out = 64, 1024, 512 x = torch.randn(n, D_in ).cuda().half() y = torch.randn(n, D_out).cuda() model = torch.nn.linear(d_in, D_out).cuda().half() optimizer = FP16_Optimizer(torch.optim.SGD(model.parameters(), lr=1e-3), dynamic_loss_scale=true) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred.float(), y) optimizer.zero_grad() loss.backward() optimizer.backward(loss) optimizer.step() 33

34 AUTOMATIC MIXED PRECISION N, D_in, D_out = 64, 1024, 512 x = torch.randn(n, D_in ).cuda() y = torch.randn(n, D_out).cuda() model = torch.nn.linear(d_in, D_out).cuda() optimizer = torch.optim.sgd(model.parameters(), lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred, y) optimizer.zero_grad() loss.backward() optimizer.step() 34

35 AUTOMATIC MIXED PRECISION from apex import amp amp_handle = amp.init() N, D_in, D_out = 64, 1024, 512 x = torch.randn(n, D_in ).cuda() y = torch.randn(n, D_out).cuda() model = torch.nn.linear(d_in, D_out).cuda() optimizer = torch.optim.sgd(model.parameters(), lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred, y) optimizer.zero_grad() loss.backward() with amp_handle.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward() optimizer.step() 35

36 RESOURCES (1) Examples New mixed precision model examples: GitHub: DGX/NGC deep learning framework containers: Latest versions of software stack and Tensor Core optimized examples. DGX/NGC Registry 36

37 Mixed Precision Training (ICLR 2018): RESOURCES (2) Reading Mixed Precision Guide: OpenSeq2Seq (describes mixed precision for sequence models): GTC 2018 Sessions: Training Neural Networks with Mixed Precision: Theory and Practice Training Neural Networks with Mixed Precision: Real Examples 37

38 RESOURCES (3) Tools TensorFlow OpenSeq2Seq: Full implementation of mixed precision for sequence-to-sequence models Building blocks of mixed precision for TensorFlow Apex for PyTorch: Tools for master weights and loss scaling amp for PyTorch: Fully automatic mixed precision training 38

39 CONCLUSION Mixed precision training is a general-purpose technique with tremendous benefits: Math and memory speedups Memory savings, enabling larger models (or minibatches) Accuracy matches FP32 training across a wide range of models (all we have tried) Significant speedups are common and getting more common with each new library release Enabling mixed precision depends on a specific methodology: Model conversion with special care for pointwise and reduction ops Safe updates with master weights and loss scaling There are many tools to make it easier to use mixed precision all the way up to fully automatic! 39

41 MODEL ACCURACY REFERENCES WITH MIXED PRECISION TRAINING 41

42 ILSVRC12 Classification Networks, Top-1 Accuracy FP32 Baseline Mixed Precision AlexNet 56.8% 56.9% VGG-D 65.4% 65.4% GoogLeNet 69.3% 69.3% Inception v2 70.0% 70.0% Inception v3 73.9% 74.1% Resnet % 76.0% ResNeXt % 77.5% A number of these train fine in mixed precision even without loss-scaling. (C) NVIDIA 42

43 Detection Networks, map FP32 Baseline Mixed Precision Faster R-CNN, VOC 07 data 69.1% 69.7% Multibox SSD, VOC data 76.9% 77.1% NVIDIA s proprietary automotive networks train with mixed-precision matching FP32 baseline accuracy. (C) NVIDIA 43

44 Language Translation GNMT: German -> English (train on WMT, test on newstest2015) 8 layer encoder, 8 layer decoder, 1024x LSTM cells, attention FP32 and Mixed Precision: ~29 BLEU using SGD Both equally lower with Adam, match the paper FairSeq: Convolutional net for translation, English - French FP32 and Mixed Precision: ~40.5 BLEU after 12 epochs (C) NVIDIA 44

45 Speech Courtesy of Baidu 2 2D-conv layers, 3 GRU layers, 1D conv Baidu internal datasets Character Error Rate (lower is better) FP32 Baseline Mixed Precision English Mandarin (C) NVIDIA 45

Progressive Growing of GANs Generates 1024x1024 face images http://research.nvidia.

46 Progressive Growing of GANs Generates 1024x1024 face images No perceptible difference between FP32 and mixed-precision training Loss-scaling: Separate scaling factors for generator and discriminator (you are training 2 networks) Automatic loss scaling greatly simplified training gradient stats shift drastically when image resolution is increased (C) NVIDIA 46

47 Sentiment Analysis Multiplicative LSTM, based on Amazon Reviews training BPC FP32 Mixed Precision Train BPC Val BPC SST acc IMDB acc FP Mixed Precision (C) NVIDIA 47

48 Image Inpainting Fill in arbitrary holes Network Architecture: U-Net with partial convolution VGG16 based Perceptual loss + Style loss Speedup: 3x, at 2x bigger batch size We can increase batch size only in mixed precision Input Inpainted Result (C) NVIDIA 48

49 Image Inpainting : result Testing Input Training Loss Curve Mixed Precision Result FP32 Result (C) NVIDIA 49

50 Text to speech synthesis Using Tacotron 2 Shen et al, Natural TTS Synthesis by Conditioning Wavenet on Mel-Spectrogram Predictions,

51 Text to speech synthesis : results Mixed Precision: Pink FP32: Green Predicted Mel-Spectrograms Predicted Alignments Mixed Precision FP32

52 Wavenet 12 Layers of dilated convolutions Dilations reset every 6 layers 128 channels for dilated convs. (64 per nonlinearity) 64 channels for residual convs. 256 channels for skip convs. Van den Oord et al. WaveNet: A Generative Model for Raw Audio,

53 Wavenet : results Mixed precision: Pink; FP32: Green Results in FP16 (pink) and FP32 (green)

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety