Recurrent Neural Nets II

Size: px

Start display at page:

Download "Recurrent Neural Nets II"

Aubrie Stevens
5 years ago
Views:

1 Recurrent Neural Nets II Steven Spielberg Pon Kumar, Tingke (Kevin) Shen Machine Learning Reading Group, Fall November, 2016

2 Outline 1 Introduction 2 Problem Formulations with RNNs 3 LSTM for Optimization 4 Seq2Seq Learning

3 Introduction Feed-Forward Neural Networks (NN)

4 Introduction Rolling NN over time

5 Introduction Rolling NN over time

6 Introduction Rolling NN over time

7 Introduction Computation Flow in RNN

8 Introduction Computation Flow in RNN

9 Introduction Computation Flow in RNN

10 Introduction Computation Flow in RNN

11 Introduction Computation Flow in RNN

12 Introduction Computation Flow in RNN

13 Introduction Computation Flow in RNN

14 Introduction Computation Flow in RNN

15 Introduction Computation Flow in RNN

16 Introduction Computation Flow in RNN

17 Introduction Computation Flow in RNN

18 Introduction Computation Flow in RNN

19 Introduction RNN Representation

20 Introduction RNN Representation

21 Introduction RNN Representation

22 Problem Formulations with RNNs Time Series Prediction Given x t 3, x t 2, x t 1, x t Find x t+1

23 Problem Formulations with RNNs Time Series Prediction - Learning Error, e = n i=1 ( ˆ x i t+1 x i t+1 )2 Update Weights θ with Error, e using Back Propogation through Time eg: Weather Forecasting, Stock Prediction etc.

24 Problem Formulations with RNNs Implemetation

25 Problem Formulations with RNNs Sentence Classification - RNN Error, e = - n i=1 y i log(ŷ i ) Update Weights θ with Error, e using Back Propogation through Time

26 Problem Formulations with RNNs Sentence Classification - Bidirectional RNN Error, e = - n i=1 y i log(ŷ i ) Update Weights θ with Error, e using Back Propogation through Time

27 Problem Formulations with RNNs Character Level RNN

28 Problem Formulations with RNNs Character Level RNN

29 Problem Formulations with RNNs Sampled Examples from Character level RNN

30 Problem Formulations with RNNs Dynamic Systems

31 Problem Formulations with RNNs Dynamic Systems Example

32 LSTM for Optimization Optimization with LSTM Gradient Descent θ t+1 = θ t + αg( f (θ)) where g( f (θ)) is handcrafted update Rule Learning Gradient Descent Update Rule where φ is parameters of LSTM θ t+1 = θ t + g( f (θ), φ)

33 LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

34 LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

35 LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

36 LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

37 LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

38 LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

39 LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent

40 LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent LSTM LSTM+GAC NTM-BFGS ADAM RMSprop Rprop Adadelta Adagrad SGD Figure 4: Comparisons between learned and hand-crafted optimizers performance. Learned optimizers are shown with solid lines and hand-crafted optimizers are shown with dashed lines. Units for the y axis in the MNIST plots are logits. Left: Performance of different optimizers on randomly sampled 10-dimensional quadratic functions. Center: the LSTM optimizer outperforms standard methods training the base network on MNIST. Right: Learning curves for steps by an optimizer trained to optimize for 100 steps (continuation of center plot).

3.1 Quadratic functions Recurrent Neural Nets II LSTM for Optimization Learning 0 to Learn 50 Gradient 100 Descent by Gradient Descent 0.0 0.0 0 50 100 0.

Left: Generalization to the different number of hidden units (40 instead of 20). Center: Generalization to the different number of hidden layers (2 instead of 1).

41 3.1 Quadratic functions Recurrent Neural Nets II LSTM for Optimization Learning 0 to Learn 50 Gradient 100 Descent by Gradient Descent Figure 5: Comparisons between learned and hand-crafted optimizers performance. Units for the y axis are logits. Left: Generalization to the different number of hidden units (40 instead of 20). Center: Generalization to the different number of hidden layers (2 instead of 1). This optimization problem is very hard, because the hidden layers are very narrow. Right: Training curves for an MLP with 20 hidden units using ReLU activations. The LSTM optimizer was trained on an MLP with sigmoid activations. Figure 6: Examples of images styled using the LSTM optimizer. Each triple consists of the content image (left), style (right) and image generated by the LSTM optimizer (center). Left: The result of applying the training style at the training resolution to a test image. Right: The result of applying a new style to a test image at double the resolution on which the optimizer was trained. a learning rate (e.g. decay coefficients for ADAM) we use the default values from the optim package in Torch7. Initial values of all optimizee parameters were sampled from an IID Gaussian distribution.

42 Seq2Seq Learning Sequence to Sequence Learning We covered how to use LSTM with fixed length inputs and fixed length outputs Sequence to Sequence learning uses two RNNs to solve general sequence to sequence problem of different length Eg: Machine Translation, Chatbots, Image Captioning etc.

43 Seq2Seq Learning Machine Translation

44 Seq2Seq Learning Machine Translation

45 Seq2Seq Learning Machine Translation

46 Seq2Seq Learning Machine Translation

47 Seq2Seq Learning Machine Translation

48 Seq2Seq Learning Machine Translation Error, e = - n i=1 Tt=1 y i t log(ŷ i t) Update Weights θ with Error, e using Back Propogation through Time

49 Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

50 Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

51 Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

52 Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

53 Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

54 Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

55 Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU LSTM cells 2000 dims per timestep GPU x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

56 arecurrent sizeable Neural margin, Nets IIdespite its inability to handle out-of-vocabulary words. The LSTM is within 0. BLEU points of the previous state of the art by rescoring the 1000-best list of the baseline system. Seq2Seq Learning 3.7 Performance on long sentences Projection of the Encoder LSTM We were surprised to discover that the LSTM did well on long sentences, which is shown quantita tively in figure 3. Table 3 presents several examples of long sentences and their translations. 3.8 Model Analysis 4 15 I was given a card by her in the garden Mary admires John Mary is in love with John 10 5 In the garden, she gave me a card She gave me a card in the garden John admires Mary John is in love with Mary Mary respects John 0 5 In the garden, I gave her a card She was given a card by me in the garden John respects Mary 15 I gave her a card in the garden Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtaine after processing the phrases in the figures. The phrases are clustered by meaning, which in these examples i

57 Seq2Seq Learning Seq2Seq Application: Video to Text Input: Output: A monkey is pulling a dog's tail and is chased by the dog. Supervised problem: Given video and sentence pairs

58 Seq2Seq Learning Why Describe Videos? Robotics applications: human to robot interaction Describing videos for the blind Video indexing Just because

59 Seq2Seq Learning Alternative Models Holistic video representation: Train classifiers to suggest subject, object, actions Combine objects/actions with language model using graphical model and real world knowledge Pick most probable subject, action, object triplet Insert triplet into sentence template Fix video sequence length Encode video with vanilla NN Output sentence using LSTM Holistic video representation:

60 Seq2Seq Learning Recall: why Sequence to Sequence? Model sees the entire input sequence before starting to output Output sequence length is not fixed to be equal to input sequence length

61 Seq2Seq Learning Model

62 Seq2Seq Learning Predictions We want to pick the most probable sequence of words generate sentences greedily, picking highest softmax probability at each time step use beam search: e.g. try top 3 most probable words each time step and only keep top 3 most probable sequences so far

63 Seq2Seq Learning Examples Demo: Paper, code, examples: vsub/s2vt.html

64 Seq2Seq Learning Seq2Seq Models and Attention Problem: Basic seq2seq RNN models cannot handle very long sequences NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE Authors: Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio

65 Seq2Seq Learning Basic Seq2Seq Model p(y i {y 1...y i 1 }, x) = g(y i 1, s i, c)

66 Seq2Seq Learning Attention Seq2Seq Model p(y i {y 1...y i 1 }, x) = g(y i 1, s i, c i ) T c i = α ij h j j=1

67 Seq2Seq Learning Encoder: Bidirectional RNN 2 independent RNNs, 1 each direction Overall hiddens is concatenation of 2 independent hiddens Hiddens at each time contain information from entire input h j is influenced by inputs around x j the most

68 Seq2Seq Learning Decoder: Attention α ij = T c i = α ij h j j=1 exp(e ij ) Tk=1 exp(e ik ) e ij = a(s i 1, h j )

69 Seq2Seq Learning Improvements

70 Seq2Seq Learning Attention Visualization Picture shows matrix of α ij α ij relevance of input word j to output word i White is 1, black is 0

71 Seq2Seq Learning End Thank you!

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

CS839: Probabilistic Graphical Models Lecture 22: The Attention Mechanism Theo Rekatsinas 1 Why Attention? Consider machine translation: We need to pay attention to the word we are currently translating.