Natural Language Processing with Deep Learning. CS224N/Ling284. Lecture 5: Backpropagation Kevin Clark

Size: px

Start display at page:

Download "Natural Language Processing with Deep Learning. CS224N/Ling284. Lecture 5: Backpropagation Kevin Clark"

Ilene Briggs
5 years ago
Views:

1 Natural Language Processing with Deep Learning CS4N/Ling84 Lecture 5: Backpropagation Kevin Clark

2 Announcements Assignment 1 due Thursday, 11:59 You can use up to 3 late days (making it due Sunday at midnight) Default final project will be released February 1 st To help you choose which project opgon you want to do Final project proposal due February 8 th See website for details and inspiragon

3 Overview Today: From one-layer to mulg layer neural networks! Fully vectorized gradient computagon The backpropagagon algorithm (Time perminng) Class project Gps 3

4 Remember: One-layer Neural Net x = [ x museums x in x Paris x are x amazing ] 4

5 Two-layer Neural Net x = [ x museums x in x Paris x are x amazing ] 5

6 Repeat as Needed! x = [ x museums x in x Paris x are x amazing ] 6

7 Why Have Layers? Hierarchical representagons -> neural net can represent complicated features BeZer results! # Layers Machine Transla@on Score From Transformer Network (will cover in a later lecture) 7

8 Remember: Gradient Descent Update equagon: α = step size or learning rate 8

9 Remember: Gradient Descent Update equagon: α = step size or learning rate This Lecture: How do we compute? By hand Algorithmically (the backpropagagon algorithm) 9

10 Why learn all these details about gradients? Modern deep learning frameworks compute gradients for you But why take a class on compilers or systems when they are implemented for you? Understanding what is going on under the hood is useful! BackpropagaGon doesn t always work perfectly. Understanding why is crucial for debugging and improving models Example in future lecture: exploding and vanishing gradients 10

11 Quickly Gradients by Hand Review of mulgvariable derivagves Fully vectorized gradients Much faster and more useful than non-vectorized gradients But doing a non-vectorized gradient can be good pracgce, see slides in last week s lecture for an example Lecture notes cover this material in more detail 11

12 Gradients Given a funcgon with 1 output and n inputs Its gradient is a vector of pargal derivagves 1

13 Jacobian Matrix: of the Gradient Given a funcgon with m outputs and n inputs Its Jacobian is an m x n matrix of pargal derivagves 13

14 Chain Rule For Jacobians For one-variable funcgons: mulgply derivagves For mulgple variables: mulgply Jacobians 14

15 Example Jacobian: 15

16 Example Jacobian: FuncGon has n outputs and n inputs -> n by n Jacobian 16

17 Example Jacobian: 17

18 Example Jacobian: 18

19 Example Jacobian: 19

20 Other Jacobians Compute these at home for pracgce! Check your answers with the lecture notes 0

21 Other Jacobians Compute these at home for pracgce! Check your answers with the lecture notes 1

22 Other Jacobians Compute these at home for pracgce! Check your answers with the lecture notes

23 Other Jacobians Compute these at home for pracgce! Check your answers with the lecture notes 3

24 Back to Neural Nets! 4 x = [ x museums x in x Paris x are x amazing ]

25 Back to Neural Nets! Let s find In pracgce we care about the gradient of the loss, but we will compute the gradient of the score for simplicity 5 x = [ x museums x in x Paris x are x amazing ]

26 1. Break up into simple pieces 6

27 . Apply the chain rule 7

28 . Apply the chain rule 8

29 . Apply the chain rule 9

30 . Apply the chain rule 30

31 3. Write out the Jacobians Useful Jacobians from previous slide 31

32 3. Write out the Jacobians Useful Jacobians from previous slide 3

33 3. Write out the Jacobians Useful Jacobians from previous slide 33

34 3. Write out the Jacobians Useful Jacobians from previous slide 34

35 3. Write out the Jacobians Useful Jacobians from previous slide 35

36 Re-using Suppose we now want to compute Using the chain rule again: 36

37 Re-using Suppose we now want to compute Using the chain rule again: The same! Let s avoid duplicated computagon 37

38 Re-using Suppose we now want to compute Using the chain rule again: 38

39 with respect to Matrix What does look like? 1 output, nm inputs: 1 by nm Jacobian? Inconvenient to do 39

Deriva@ve with respect to Matrix What does look like? 1 output, nm inputs: 1 by nm Jacobian?

40 with respect to Matrix What does look like? 1 output, nm inputs: 1 by nm Jacobian? Inconvenient to do Instead follow convengon: shape of the gradient is shape of parameters So is n by m: 40

41 with respect to Matrix Remember is going to be in our answer The other term should be because It turns out 41

42 Why the Transposes? Hacky answer: this makes the dimensions work out Useful trick for checking your work! Full explanagon in the lecture notes 4

43 Why the Transposes? Hacky answer: this makes the dimensions work out Useful trick for checking your work! Full explanagon in the lecture notes 43

44 What shape should be? is a row vector But convengon says our gradient should be a column vector because is a column vector Disagreement between Jacobian form (which makes the chain rule easy) and the shape convengon (which makes implemengng SGD easy) We expect answers to follow the shape convengon But Jacobian form is useful for compugng the answers 44

45 What shape should be? Two opgons: 1. Use Jacobian form as much as possible, reshape to follow the convengon at the end: What we just did. But at the end transpose to make the derivagve a column vector, resulgng in. Always follow the convengon Look at dimensions to figure out when to transpose and/or reorder terms. 45

46 Notes on PA1 Don t worry if you used some other method for gradient computagon (as long as your answer is right and you are consistent!) This lecture we computed the gradient of the score, but in PA1 its of the loss Don t forget to replace f with the actual derivagve PA1 uses for the linear transformagon: gradients are different! 46

47 Compute gradients algorithmically ConverGng what we just did by hand into an algorithm Used by deep learning frameworks (TensorFlow, PyTorch, etc.) 47

48 Graphs RepresenGng our neural net equagons as a graph Source nodes: inputs Interior nodes: operagons + 48

49 Graphs RepresenGng our neural net equagons as a graph Source nodes: inputs Interior nodes: operagons Edges pass along result of the operagon + 49

50 Graphs RepresenGng our neural net equagons as a graph Source nodes: inputs Forward PropagaGon Interior nodes: operagons Edges pass along result of the operagon + 50

51 Go backwards along edges Pass along gradients + 51

52 Single Node Node receives an upstream gradient Goal is to pass on the correct downstream gradient 5 Downstream gradient Upstream gradient

53 Single Node Each node has a local gradient The gradient of its output with respect to its input 53 Downstream gradient Local gradient Upstream gradient

54 Single Node Each node has a local gradient The gradient of its output with respect to its input Chain rule! 54 Downstream gradient Local gradient Upstream gradient

55 Single Node Each node has a local gradient The gradient of its output with respect to its input [downstream gradient] = [upstream gradient] x [local gradient] 55 Downstream gradient Local gradient Upstream gradient

56 Single Node What about nodes with mulgple inputs? * 56

57 Single Node MulGple inputs -> mulgple local gradients * 57 Downstream gradients Local gradients Upstream gradient

58 An Example 58

59 An Example Forward prop steps + max * 59

60 An Example Forward prop steps max 3 * 6 60

61 An Example Forward prop steps Local gradients max 3 * 6 61

62 An Example Forward prop steps Local gradients max 3 * 6 6

63 An Example Forward prop steps Local gradients max 3 * 6 63

64 An Example Forward prop steps Local gradients max 3 * 6 64

65 An Example Forward prop steps Local gradients max 3 1* = 1*3 = 3 * 6 1 upstream * local = downstream

66 An Example Forward prop steps Local gradients *1 = 3 0 3*0 = 0 + max 3 3 * 6 1 upstream * local = downstream

67 An Example Forward prop steps Local gradients 67 1 *1 = *1 = max 3 3 * 6 1 upstream * local = downstream

68 An Example Forward prop steps Local gradients max 3 3 * 6 1

69 Gradients add at branches + 69

70 Gradients add at branches + 70

71 Node + distributes the upstream gradient max 3 *

72 Node + distributes the upstream gradient max routes the upstream gradient max 3 3 * 6 1

73 Node + distributes the upstream gradient max routes the upstream gradient * switches the upstream gradient max 3 3 *

74 Efficiency: compute all gradients at once Incorrect way of doing backprop: First compute * + 74

75 Efficiency: compute all gradients at once Incorrect way of doing backprop: First compute Then independently compute Duplicated computagon! * + 75

76 Efficiency: compute all gradients at once Correct way: Compute all the gradients at once Analogous to using when we computed gradients by hand * + 76

77 Backprop 77

78 forward/backward API 78

79 forward/backward API 79

Alterna@ve to backprop: Numeric Gradient For small h, Easy to implement But approximate and very

80 to backprop: Numeric Gradient For small h, Easy to implement But approximate and very slow: Have to recompute f for every parameter of our model Useful for checking your implementagon 80

81 Summary BackpropagaGon: recursively apply the chain rule along computagonal graph [downstream gradient] = [upstream gradient] x [local gradient] Forward pass: compute results of operagon and save intermediate values Backward: apply chain rule to compute gradient 81

82 8

83 Project Types 1. Apply exisgng neural network model to a new task. Implement a complex neural architecture(s) This is what PA4 will have you do! 3. Come up with a new model/training algorithm/etc. Get 1 or working first See project page for some inspiragon 83

84 Must-haves (choose-your-own final project) 10,000+ labeled examples by milestone Feasible task AutomaGc evaluagon metric NLP is central 84

85 Details ma_er! Split your data into train/dev/test: only look at test for final experiments Look at your data, collect summary stagsgcs Look at your model s outputs, do error analysis Tuning hyperparameters is important Writeup quality is important Look at last-year s prize winners for examples 85

86 Project Advice Implement simplest possible model first (e.g., average word vectors and apply logisgc regression) and improve it Having a baseline system is crucial First overfit your model to train set (get really good training set results) Then regularize it so it does well on the dev set Start early! 86

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Natural Language Processing with Deep Learning CS4N/Ling84 Christopher Manning Lecture 4: Backpropagation and computation graphs Lecture Plan Lecture 4: Backpropagation and computation graphs 1. Matrix