Gated Recurrent Models. Stephan Gouws & Richard Klein

Size: px

Start display at page:

Download "Gated Recurrent Models. Stephan Gouws & Richard Klein"

Moses Parrish
6 years ago
Views:

1 Gated Recurrent Models Stephan Gouws & Richard Klein

2 Outline Part 1: Intuition, Inference and Training Building intuitions: From Feedforward to Recurrent Models Inference in RNNs: Fprop Training in RNNs: Backpropagation-through-time (BPTT) SHORT BREAK

3 Outline Part 2: Gated models & Applications Long Short-term Memory (LSTMs) Gated Recurrent Units (GRUs) Applications: Image captioning Sequence classification (Practical 4: MNIST) Language modeling Sequence-labeling (lots of NLP tasks, e.g. POS tagging, NER, ) Sequence-to-sequence learning (Machine translation, Dialogue modeling, )

4 Recurrent Models PART 1: Intuition, Inference and Training

5 Introduction?

6 Introduction? x2

7 Introduction? x3

8 We need to be able to remember information from previous time steps

9 Recurrent neural networks: Intuition

10 Long-term dependencies: Why do they matter? Michel C. was born in Paris, France. He is married and has three children. He received a M.S. in neurosciences from the University Pierre & Marie Curie and the Ecole Normale Supérieure in 1987, and and then spent most of his career in Switzerland, at the Ecole Polytechnique de Lausanne. He specialized in child and adolescent psychiatry and his first field of research was severe mood disorders in adolescent, topic of his PhD in neurosciences (2002). His Michel Hn-1 C. was born in mother tongue is????? Short context Hn French Long context English, German, Russian, French Problems

11 Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

12 Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

13 ? Types of Sequence Models We ll talk about this a little later. We ll also implement this in today s practical! FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

14 Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

15 Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

16 Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

17 FFNs vs RNNs Classify following examples:,,,

18 FFNs vs RNNs PREDICT: 1 Y: outputs Y softmax tanh FFN H X X: inputs,,,

19 FFNs vs RNNs PREDICT: 9 Y: outputs Y softmax tanh FFN H X X: inputs,,,

20 FFNs vs RNNs PREDICT: 2 Y: outputs Y softmax tanh FFN H X X: inputs,,,

21 FFNs vs RNNs But what if these were not digits, but longer numbers?, Problem? Variable length inputs.,

22 FFNs vs RNNs Y: outputs Yt softmax tanh 1? RNN cell H Xt X: inputs at step t,,,

23 FFNs vs RNNs Y: outputs Yt softmax H: internal state tanh 19? RNN cell H Xt 1 H at t-1, X: inputs at step t,,

24 FFNs vs RNNs Maintains a state (memory) that carries information between inputs! Y: outputs Yt softmax H: internal state tanh 192? RNN cell H Xt 19 H at t-1, X: inputs at step t,,

25 The RNN API prev_state next_state recurrent_fn() x outputs

26 The RNN Computation Graph yt t-1 xt Feedback loop / state / memory / stack (previous time-step)

27 Unrolling the RNN Computation Graph y1 h0 h1 x1

28 Unrolling the RNN Computation Graph h0 x1 y1 y2 h1 h2 x2

29 Unrolling the RNN Computation Graph h0 x1 y1 y2 h1 h2 x2 Unrolled over n time-steps. yn... x3 hn

30 Unrolling the RNN Computation Graph NB: We reuse the same weights at every time-step! h0 y1 y2 h1 h2 x1 x2 The same Unrolled over n time-steps. yn... x3 hn

31 Unrolling the RNN Computation Graph NB: We reuse the same weights at every time-step! y1 y2 yn We can therefore think of an RNN as a composition of identical feedforward neural networks h1 (with replicated/tied h2 weights), one for each moment or step in time. h0 x1 x2 The same Unrolled over n time-steps. x3... hn

32 The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

33 The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

34 The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

35 The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

36 The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

37 The RNN API class RecurrentModel(): #... yt def recurrent_fn(self, x, prev_state): # Compute the new state based on the previous state and current input. new_state = self.act_fn(np.dot(self.w_xh, x) + np.dot(self.w_hh, prev_state) + b) t-1 # Compute the output vector. y = np.dot(self.w_hy, new_state) return new_state, y xt

38 The RNN API class RecurrentModel(): #... yt def recurrent_fn(self, x, prev_state): # Compute the new state based on the previous state and current input. new_state = self.act_fn(np.dot(self.w_xh, x) + np.dot(self.w_hh, prev_state) + b) t-1 # Compute the output vector. y = np.dot(self.w_hy, new_state) return new_state, y xt

39 The RNN API New state Recurrent function Input at current time-step Previous state class RecurrentModel(): #... yt def recurrent_fn(self, x, prev_state): # Compute the new state based on the previous state and current input. new_state = self.act_fn(np.dot(self.w_xh, x) + np.dot(self.w_hh, prev_state)) t-1 # Compute the output vector. y = np.dot(self.w_hy, new_state) return new_state, y xt

40 The RNN API New state Recurrent function Input at current time-step Previous state class RecurrentModel(): #... yt def recurrent_fn(self, x, prev_state): # Compute the new state based on the previous state and current input. new_state = self.act_fn(np.dot(self.w_xh, x) + np.dot(self.w_hh, prev_state)) t-1 # Compute the output vector. y = np.dot(self.w_hy, new_state) return new_state, y xt

41 The RNN API def forward(self, data_sequence, initial_state): state = initial_state all_states, all_ys = [state], [] cache = [] for x, y in data_sequence: new_state, y_pred = recurrent_fn(x, state) loss += cross_entropy(y_pred, y) cache.append((new_state, y_pred)) state = new_state return loss, cache

42 The RNN API def forward(self, data_sequence, initial_state): state = initial_state all_states, all_ys = [state], [] cache = [] for x, y in data_sequence: new_state, y_pred = recurrent_fn(x, state) loss += cross_entropy(y_pred, y) cache.append((new_state, y_pred)) state = new_state return loss, cache

43 The RNN API def forward(self, data_sequence, initial_state): state = initial_state all_states, all_ys = [state], [] cache = [] for x, y in data_sequence: new_state, y_pred = recurrent_fn(x, state) loss += cross_entropy(y_pred, y) cache.append((new_state, y_pred)) state = new_state return loss, cache

44 Math: FFNs v RNNs NOTATION: Wxh is a matrix that maps a vector x into a vector h. y x

45 NOTATION: Wxh is a matrix that maps a vector x into a vector h. Math: FFNs v RNNs y Input x

46 NOTATION: Wxh is a matrix that maps a vector x into a vector h. Math: FFNs v RNNs y Activation function Input x

47 NOTATION: Wxh is a matrix that maps a vector x into a vector h. Math: FFNs v RNNs Hidden layer y Activation function Input x

48 Math: FFNs v RNNs NOTATION: Wxh is a matrix that maps a vector x into a vector h. Hidden layer yt Activation function Input t-1... New state Recurrent function Input at current time-step xt

49 Math: FFNs v RNNs NOTATION: Wxh is a matrix that maps a vector x into a vector h. Hidden layer yt Activation function Input t-1 New state Recurrent function Input at current time-step xt

50 Math: FFNs v RNNs NOTATION: Wxh is a matrix that maps a vector x into a vector h. Hidden layer yt Activation function Input Recurrent weights t-1 New state Recurrent function Input at current time-step Previous state xt

51 Inference & Training How do we make predictions using RNNs? Forward propagation: Fprop Essentially a composition of functions: a2 = f2(f1(x)). We unroll the computational graph over time-steps. How do we train RNNs? Backward propagation: Backprop-through time We need to consider predictions over several time-steps! Credit assignment over time. We work backwards in time from the last state to the first.

52 Training: Ways to Train RNNs Echo State Networks: Initialize Wxh, Whh, Who, carefully, then only train Who! Backpropagation through time (BPTT): Propagate errors backwards through the unrolled graph. There are other options.

53 Training: ESNs Simple solution: don t train the recurrent weights (Whh & Wxh)! Initialization very important. Super simple. However, with recent improvements in initialization etc, BPTT does better! [Scholarpedia]

54 Inference & Training How do we make predictions using RNNs? Forward propagation: Fprop Essentially a composition of functions: a2 = f2(f1(x)). We unroll the computational graph over time-steps. Error How do we train RNNs? Propagate errors backwards through unrolled graph: Backprop-through time (BPTT). We need to consider predictions over several time-steps! Credit assignment over time. We work backwards in time from the last state to the first.

55 Training: BPTT Intuition

56 Training: Truncated BPTT

57 Training: Truncated BPTT

58 Training: Truncated BPTT

59 Unrolling the RNN Computation Graph h0 x1 y1 y2 h1 h2 x2 Unrolled over n time-steps. yt... x3 ht

60 Unrolling the RNN Computation Graph h0 x1 y1 y2 h1 h2 x2 Unrolled over n time-steps. yt... x3 ht

61 Step 1: Compute all errors. Unrolling the RNN Computation Graph y1 h0 y2 E1 h1 x1 yt E2 h2 x2 Unrolled over n time-steps.... x3 ht Et

62 Unrolling the RNN Computation Graph y1 h0 y2 E1 h1 x1 Step 1: Compute all errors. Step 2: Pass error back for each time-step from n back to 1. yt E2 h2 x2 Unrolled over n time-steps.... x3 ht Et

63 Unrolling the RNN Computation Graph y1 h0 y2 E1 h1 x1 Step 1: Compute all errors. Step 2: Pass error back for each time-step from n back to 1. Step 3: Update weights. yt E2 h2 x2 Unrolled over n time-steps.... x3 ht Et

64 Unrolling the RNN Computation Graph yt-2 yt-1 yt ht-2 ht-1 ht xt-1 xt

65 Unrolling the RNN Computation Graph yt-2 yt-1 yt ht-2 ht-1 ht Et

66 Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

67 Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

68 Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

69 Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

70 Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

71 Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

72 Unrolling the RNN Computation Graph yt Et where ht-2 ht-1 ht

73 Unrolling the RNN Computation Graph yt-2 Et-2 yt-1 Et-1 yt Et where ht-2 ht-1 ht Total Error = E1 + E2 + + Et Total gradient = sum of all det/dθ s

74 Training: Truncated BPTT Code def bptt(model, X_train, y_train, initial_state): Error # Forward Loss, caches = forward(x_train, y_train, model, initial_state) avg_loss /= y_train.shape[0] # Backward dh_next = np.zeros((1, last_state.shape[0])) grads = {k: np.zeros_like(v) for k, v in model.items()} for t in reversed(range(len(x_train))): grad, dh_next = cell_fn_backward(ys[t], y_train[t], dh_next, caches[t]) for k in grads.keys(): grads[k] += grad[k] return grads, avg_loss Lego block!

$# Backward dh_next = np.zeros((1, last_state.shape[0])) grads = {k: np.zeros_like(v) for k, v in model.$

75 Training: Truncated BPTT Code def bptt(model, X_train, y_train, initial_state): Error # Forward Loss, caches = forward(x_train, y_train, model, initial_state) avg_loss /= y_train.shape[0] Lego block! # Backward dh_next = np.zeros((1, last_state.shape[0])) grads = {k: np.zeros_like(v) for k, v in model.items()} for t in reversed(range(len(x_train))): grad, dh_next = cell_fn_backward(ys[t], y_train[t], dh_next, caches[t]) for k in grads.keys(): grads[k] += grad[k] return grads, avg_loss Total gradient = Sum of these lego-gradients over time!

76 Vanilla RNN Gradient Flow

77 Vanilla RNN Gradient Flow

78 Vanilla RNN Gradient Flow

79 Vanilla RNN Gradient Flow

80 Vanilla RNN Gradient Flow

81 Vanilla RNN Gradient Flow Part 2!

82 Gated Recurrent Models PART II: Gated Architectures & Applications

83 RECAP: The RNN API New state prev_state Recurrent function Input at current time-step next_state recurrent_fn() x outputs Previous state

84 The GATED RNN API prev_state It is the same! Just a different way of computing the outputs. recurrent_fn() Memory Cell next_state Gates x Controller outputs

85 Implementing a memory cell in a neural network

86 Propagating through a memory cell

87 Backpropagating through a memory cell?

88 LSTM = Long Short Term Memory LSTM concatenate : Xt Ht-1 Ct-1 σ σ tanh σ tanh + X = Xt Ht-1 vector sizes p+n n forget gate : f = σ(x.wf + bf) n update gate : u = σ(x.wu + bu) n result gate : r = σ(x.wr + br) input : X = tanh(x.wc + bc) n Ht new C : Ct = f * Ct-1 + u * X n Ct n new H : Ht = r * tanh(ct) Yt output : Yt = softmax(ht.w + b) m

89 LSTM Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

90 LSTM concatenation Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

91 LSTM Xt What to forget? Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

92 LSTM Xt Ht-1 Actually forget Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

93 LSTM What s the new value? Xt What to update? Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

94 LSTM Xt Ht-1 Ht σ σ tanh Actually update σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

95 LSTM Updated! Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

96 LSTM Result Gate! Xt Ht-1 Ht σ σ tanh σ Calculate Result tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

97 LSTM Remember the result for next time step Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations Output

98 LSTM = Long Short Term Memory LSTM X = Xt Ht-1 concatenation Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations f = σ(x.wf + bf) u = σ(x.wu + bu) r = σ(x.wr + br) X = tanh(x.wc + bc) Ct = f * Ct-1 + u * X Ht = r * tanh(ct) Yt = softmax(ht.w + b)

99 LSTM = Long Short Term Memory LSTM X = Xt Ht-1 concatenation Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations f = σ(x.wf + bf) u = σ(x.wu + bu) r = σ(x.wr + br) X = tanh(x.wc + bc) Ct = f * Ct-1 + u * X Ht = r * tanh(ct) Yt = softmax(ht.w + b)

100 Gru!

101 Gated Recurrent Units (GRUs) GRU = Gated Recurrent Unit 2 gates instead of 3 => cheaper GRU Ht Yt vector sizes p+n z = σ(x.wz + bz) r = σ(x.wr + br) X = Xt r * Ht-1 Xt Ht-1 X = Xt Ht-1 Ht n n p+n X = tanh(x.wc + bc) Ht = (1-z) * Ht-1 + z * X n Yt = softmax(ht.w + b) m n

102 Gated Recurrent Units (GRUs) GRU = Gated Recurrent Unit 2 gates instead of 3 => cheaper GRU Ht Yt vector sizes p+n z = σ(x.wz + bz) r = σ(x.wr + br) X = Xt r * Ht-1 Xt Ht-1 X = Xt Ht-1 Ht n n p+n X = tanh(x.wc + bc) Ht = (1-z) * Ht-1 + z * X n Yt = softmax(ht.w + b) m n

103 Long Short-term Memory (LSTM): Gradient Flow

104 Long Short-term Memory (LSTM): Gradient Flow

105 Applications/Tasks Image captioning Sequence classification (Practical 4: MNIST) Language modeling Sequence-labeling (lots of NLP tasks, e.g. POS tagging, NER, ) Sequence-to-sequence learning (Machine translation, Summarization, )

106 One-to-Many: Image captioning GOAL: Given image, generate a sentence to describe its content.

107 One-to-Many: Image captioning GOAL: Given image, generate a sentence to describe its content.

108 Many-to-one: Sequence Classifier (Prac 4) GOAL: Given a sequence of inputs, predict the label for the whole sequence. Examples: Given a sentence, say if it is {negative, neutral, positive} Given the words in an , predict if it is a spam message. Given pieces of an image, predict what number is in the image.

109 Many-to-1: Polarity/Sentiment Classifer We feed all the words into the model one at a time, and make one prediction at the end: h0 h1 Cats Pos/Neg? h2 are h3 awesome

110 Many-to-1: Spam Classifer We feed all the words into the model one at a time, and make one prediction at the end: h0 h1 Viagra Spam/Ham? h2 for h3 cheap

111 Many-to-1: Image Classifier (Prac 4) We chop up the image and feed all the pieces through the model, and then make one prediction at the end: h0 h1 h2 yimag e h3

112 Next-token Prediction: Language modeling 1. Computing p(next word previous words) 2. p(the) p(cat the) p(sits the cat) h0 h1 h2 the cat p(xm...)... h3

113 Next-token Prediction: Language modeling x1 x2 x3 h0 h1 h2 x1 x2 xn... x3 hn

114 Many-to-many: Sequence labeling Mapping each input x1, x2,, xn to its own label y1, y2,, yn (Notice: Same length m; each input has an output.) A lot of NLP Tasks fall in this category, e.g.: Part-of-speech tagging: map words to their parts-of-speech (noun, verb, etc). Named-entity Recognition: identify mentions of people, places, etc in text Semantic Role Labeling: find the main actions, and who performs them on whom/what

115 Many-to-many: Sequence labeling Part-of-speech tagging

116 Many-to-many: Sequence labeling Part-of-speech tagging

117 Many-to-many: Sequence labeling Heat water in a large vessel. h0 heat VERB NOUN h1 h2 water in hn

118 Many-to-many: Sequence-to-Sequence Modeling Mapping input sequences {x1, x2,, xm} to output sequences {y1, y2,, yn}. Note: Arbitrary m and n. Many applications, most notably Machine Translation. Also: POS tagging, Parsing, Summarization, Dialog

119 Many-to-many: Sequence-to-Sequence Modeling Encoder RNN Decoder RNN the blue la house <s> maison la bleu maison <eos> bleu MACHINE TRANSLATION

120 Many-to-many: Sequence-to-Sequence Modeling DIALOGUE MODELING Prompt: You can t eat pasta every day. You can t live only on Italian food. System: I don t eat Italian food, but I do eat a lot of Italian food.

121 Many-to-many: Sequence-to-Sequence Modeling Prompt: What do you have planned tonight? Baseline: I have no idea what I want to do with my life. I have no idea what I want to (Bad, Mediocre, Bad) Ours: I was planning on doing a few of these, but I m not sure what to choose. I need to find a place to camp, and I don t have a lot of time to make a plan. (Excellent, Good, Mediocre)

122 Many-to-many: Sequence-to-Sequence Modeling Prompt: Have you seen Suicide Squad? Baseline: I have not, but I will check it out. (Good, Good, Good) Ours: I saw it in a theater once and it was great. It was very... I don t know, a little dark. I like the psychological effects and the way it portrays the characters in the movie. (Excellent, Excellent, Excellent)

123 Key take-aways RNNs have memory/state that evolve over time. We unroll the graph over time to do forward propagation. Backprop-through-time (BPTT): Vanishing/exploding gradients Gated architectures Perform Chain Rule over the unrolled graph efficiently by saving and reusing previous computations. de/dw is sum over all time-steps (b/c of tied weights) State is selectively overwritten per time-step Uninterrupted gradient flow through time: no vanishing/exploding gradients! Lots of cool applications!

124 Slide Credits Thank-you to the following resources, from which some of these slides were drawn and adapted. Stanford CS231n TensorFlow without a PhD

125 The end.

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Sequence Modeling: Recurrent and Recursive Nets By Pyry Takala 14 Oct 2015 Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using