OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

Size: px

Start display at page:

Download "OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS"

Cody Gilmore
5 years ago
Views:

1 April 4-7, 2016 Silicon Valley OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS Jeremy Appleyard, 7 April 2016

2 RECURRENT NEURAL NETWORKS Output is fed into input Perform the same operation repeatedly on a combination of: Previous state New input 2

3 RECURRENT NEURAL NETWORKS Output is fed into input Perform the same operation repeatedly on a combination of: Previous state New input 3

4 MANY RECURRENT STEPS Parameters in each recurrent step are shared 4

5 POTENTIALLY MANY LAYERS Each layers has unique parameters 5

6 CASE STUDY LSTM Long Short Term Memory First published in 1997 by Sepp Hochreiter and Jürgen Schmidhuber Recurrent Neural Network with potential for long-term memory Designed to deal with the vanishing gradient problem by propagating a linear component 6

7 LSTM Viewed as a black box h n Inputs and outputs are batched vectors h n-1, c n-1 LSTM Cell h n, c n ie. A minibatch Typical vector length is Typical batch size is i n 7

8 LSTM Computational requirements h n Computational Requirements Four matrix products with input h h n-1, c n-1 LSTM Cell h n, c n Four matrix products with input i Many point-wise operations on the results of these matrix products and c i n 8

9 STARTING STATE Case Study For this talk I will consider an LSTM with the following properties: 512 hidden units 100 recurrent iterations Minibatch 64 Four layers 9

10 STARTING POINT Performance RNNs with a minibatch of 64 are expected to be bound by floating point performance (see: roofline model) But A naïve implementation achieves about 6% of peak FLOPs on an M40! Profiling shows that the implementation is not exposing enough parallelism. There are fewer blocks than streaming multiprocessors (SMs). Optimization target: Increase parallelism! 10

11 STARTING POINT Performance OPTIMIZATION RUNTIME SPEEDUP Naïve 777us (1.0x) Time per cell, 512 hidden units per layer, minibatch 64, M40 11

12 SINGLE CELL OPTIMIZATION 12

13 GEMM PERFORMANCE Optimization #1 [A 1 ][h] = [x 1 ] [A 2 ][h] = [x 2 ] [A 3 ][h] = [x 3 ] [A 4 ][h] = [x 4 ] A [h] = x As our matrix operations share inputs we can combine them 13

14 GEMM PERFORMANCE Optimization #2 We are still doing two independent matrix products These can be performed in parallel with each other using streams A 1 B 1 A 2 [h] = x B 2 [i] = y A 3 B 3 A 4 B 4 14

15 LSTM GEMM Optimization OPTIMIZATION RUNTIME SPEEDUP Naïve 777us (1.0x) Combined GEMMs 400us 1.9x GEMM streaming 280us 2.8x Time per cell, 512 hidden units per layer, minibatch 64, M40 15

16 POINTWISE OPERATIONS Optimization #3 It is inefficient to launch a new GPU kernel for each pointwise operation: Launch overheads can be costly for small operations Streaming data back and forth between DRAM and SM is wasteful Solution: Fuse all pointwise operations into one kernel 16

17 LSTM Single Cell Optimzation OPTIMIZATION RUNTIME SPEEDUP Naïve 777us (1.0x) Combined GEMMs 400us 1.9x GEMM streaming 280us 2.8x Fused point-wise ops 146us 5.3x Time per cell, 512 hidden units per layer, minibatch 64, M40 17

18 OPTIMIZATION WITHIN A LAYER 18

19 PRE-TRANSPOSING Optimization #4 All weight matrices in a layer are the same. A pre-processing step before we start can make each cheaper. One of forward propagation or backward propagation requires the weight matrix to be transposed GEMM operations have different efficiency depending on whether matrices are transposed/not transposed Spending time transposing up-front can improve performance for all GEMMs. 19

20 COMBINING INPUT OPERATIONS Optimization #5 Matrix operations from the previous layer can be combined, but there s a trade-off Good: larger matrix operations are more parallel Bad: Adds more complex dependencies. Eg. If all the input operations are fused you can t progress on the recurrent operations until that is done. 20

21 LSTM Optimization Across a Layer OPTIMIZATION RUNTIME SPEEDUP Naïve 777us (1.0x) Combined GEMMs 400us 1.9x GEMM streaming 280us 2.8x Single Cell Fused point-wise ops 146us 5.3x Matrix transposition 125us 6.2x Fusing inputs 119us 6.5x Single Layer Time per cell, 512 hidden units per layer, minibatch 64, M40 21

22 OPTIMIZATION ACROSS LAYERS 22

23 RNN Dependency Graph 23

24 RNN Dependency Graph 24

25 RNN Dependency Graph 25

26 RNN Dependency Graph 26

27 RNN Dependency Graph 27

28 RNN Dependency Graph 28

29 RNN Dependency Graph 29

30 RNN Dependency Graph 30

31 RNN Dependency Graph 31

32 PERFORMANCE Using Streams + Layers Use cudaeventrecord and cudastreamwaitevent to express dependencies 32

33 LSTM From naïve to optimised OPTIMIZATION RUNTIME SPEEDUP Naïve 777us (1.0x) Combined GEMMs 400us 1.9x GEMM streaming 280us 2.8x Single Cell Fused point-wise ops 146us 5.3x Matrix transposition 125us 6.2x Fusing inputs 119us 6.5x 4 layers 70us 11.1x Single Layer Entire RNN Time per cell, 512 hidden units per layer, minibatch 64, M40 33

34 RNN Backward Pass Data gradients Similar problem but backwards Same optimisations apply Weight gradients No dependencies from recurrence Large GEMMs used very efficient 34

35 RECURRENT NEURAL NETWORKS cudnn v5 New feature in cudnn v5 Optimised path for basic RNNs, GRU and LSTM cudnn features for all network types: Uni/Bi-directional Dropout between layers Variable length sequences within a minibatch 35

36 Performance (GFLOPS) CUDNN V5 LSTM PERFORMANCE Minibatch 32 Minibatch 64 Peak Hidden Layer Size Hidden Layer Size cudnn v5 RC, M40, base clocks 36

37 April 4-7, 2016 Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

Profiling GPU Code. Jeremy Appleyard, February 2016

Profiling GPU Code. Jeremy Appleyard, February 2016 Profiling GPU Code Jeremy Appleyard, February 2016 What is Profiling? Measuring Performance Measuring application performance Usually the aim is to reduce runtime Simple profiling: How long does an operation