Recurrent Neural Nets II Steven Spielberg Pon Kumar, Tingke (Kevin) Shen Machine Learning Reading Group, Fall 2016 9 November, 2016
Outline 1 Introduction 2 Problem Formulations with RNNs 3 LSTM for Optimization 4 Seq2Seq Learning
Introduction Feed-Forward Neural Networks (NN)
Introduction Rolling NN over time
Introduction Rolling NN over time
Introduction Rolling NN over time
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction Computation Flow in RNN
Introduction RNN Representation
Introduction RNN Representation
Introduction RNN Representation
Problem Formulations with RNNs Time Series Prediction Given x t 3, x t 2, x t 1, x t Find x t+1
Problem Formulations with RNNs Time Series Prediction - Learning Error, e = n i=1 ( ˆ x i t+1 x i t+1 )2 Update Weights θ with Error, e using Back Propogation through Time eg: Weather Forecasting, Stock Prediction etc.
Problem Formulations with RNNs Implemetation
Problem Formulations with RNNs Sentence Classification - RNN Error, e = - n i=1 y i log(ŷ i ) Update Weights θ with Error, e using Back Propogation through Time
Problem Formulations with RNNs Sentence Classification - Bidirectional RNN Error, e = - n i=1 y i log(ŷ i ) Update Weights θ with Error, e using Back Propogation through Time
Problem Formulations with RNNs Character Level RNN
Problem Formulations with RNNs Character Level RNN
Problem Formulations with RNNs Sampled Examples from Character level RNN
Problem Formulations with RNNs Dynamic Systems
Problem Formulations with RNNs Dynamic Systems Example https://youtu.be/meyqhkfwupg
LSTM for Optimization Optimization with LSTM Gradient Descent θ t+1 = θ t + αg( f (θ)) where g( f (θ)) is handcrafted update Rule Learning Gradient Descent Update Rule where φ is parameters of LSTM θ t+1 = θ t + g( f (θ), φ)
LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent
LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent
LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent
LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent
LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent
LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent
LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent
LSTM for Optimization Learning to Learn Gradient Descent by Gradient Descent LSTM LSTM+GAC NTM-BFGS ADAM RMSprop Rprop Adadelta Adagrad SGD 0.50 2.5 0.50 0.25 0.00 0 50 100 0.0 0 50 100 0.25 100 150 200 Figure 4: Comparisons between learned and hand-crafted optimizers performance. Learned optimizers are shown with solid lines and hand-crafted optimizers are shown with dashed lines. Units for the y axis in the MNIST plots are logits. Left: Performance of different optimizers on randomly sampled 10-dimensional quadratic functions. Center: the LSTM optimizer outperforms standard methods training the base network on MNIST. Right: Learning curves for steps 100-200 by an optimizer trained to optimize for 100 steps (continuation of center plot).
3.1 Quadratic functions Recurrent Neural Nets II LSTM for Optimization Learning 0 to Learn 50 Gradient 100 Descent by Gradient Descent 0.0 0.0 0 50 100 0.0 0 50 100 Figure 5: Comparisons between learned and hand-crafted optimizers performance. Units for the y axis are logits. Left: Generalization to the different number of hidden units (40 instead of 20). Center: Generalization to the different number of hidden layers (2 instead of 1). This optimization problem is very hard, because the hidden layers are very narrow. Right: Training curves for an MLP with 20 hidden units using ReLU activations. The LSTM optimizer was trained on an MLP with sigmoid activations. Figure 6: Examples of images styled using the LSTM optimizer. Each triple consists of the content image (left), style (right) and image generated by the LSTM optimizer (center). Left: The result of applying the training style at the training resolution to a test image. Right: The result of applying a new style to a test image at double the resolution on which the optimizer was trained. a learning rate (e.g. decay coefficients for ADAM) we use the default values from the optim package in Torch7. Initial values of all optimizee parameters were sampled from an IID Gaussian distribution.
Seq2Seq Learning Sequence to Sequence Learning We covered how to use LSTM with fixed length inputs and fixed length outputs Sequence to Sequence learning uses two RNNs to solve general sequence to sequence problem of different length Eg: Machine Translation, Chatbots, Image Captioning etc.
Seq2Seq Learning Machine Translation
Seq2Seq Learning Machine Translation
Seq2Seq Learning Machine Translation
Seq2Seq Learning Machine Translation
Seq2Seq Learning Machine Translation
Seq2Seq Learning Machine Translation Error, e = - n i=1 Tt=1 y i t log(ŷ i t) Update Weights θ with Error, e using Back Propogation through Time
Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
Seq2Seq Learning GPU Implementation GPU6 A B C D GPU5 A B C D 80k softmax by 1000 dims This is very big! GPU4 Split softmax into 4 GPUs GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language
arecurrent sizeable Neural margin, Nets IIdespite its inability to handle out-of-vocabulary words. The LSTM is within 0. BLEU points of the previous state of the art by rescoring the 1000-best list of the baseline system. Seq2Seq Learning 3.7 Performance on long sentences Projection of the Encoder LSTM We were surprised to discover that the LSTM did well on long sentences, which is shown quantita tively in figure 3. Table 3 presents several examples of long sentences and their translations. 3.8 Model Analysis 4 15 I was given a card by her in the garden 3 2 1 Mary admires John Mary is in love with John 10 5 In the garden, she gave me a card She gave me a card in the garden 0 1 2 John admires Mary John is in love with Mary Mary respects John 0 5 In the garden, I gave her a card She was given a card by me in the garden 3 10 4 5 John respects Mary 15 I gave her a card in the garden 6 8 6 4 2 0 2 4 6 8 10 20 15 10 5 0 5 10 15 20 Figure 2: The figure shows a 2-dimensional PCA projection of the LSTM hidden states that are obtaine after processing the phrases in the figures. The phrases are clustered by meaning, which in these examples i
Seq2Seq Learning Seq2Seq Application: Video to Text Input: https://www.youtube.com/watch?v=iidyue5qm0a Output: A monkey is pulling a dog's tail and is chased by the dog. Supervised problem: Given video and sentence pairs
Seq2Seq Learning Why Describe Videos? Robotics applications: human to robot interaction Describing videos for the blind Video indexing Just because
Seq2Seq Learning Alternative Models Holistic video representation: Train classifiers to suggest subject, object, actions Combine objects/actions with language model using graphical model and real world knowledge Pick most probable subject, action, object triplet Insert triplet into sentence template Fix video sequence length Encode video with vanilla NN Output sentence using LSTM Holistic video representation: http://www.aclweb.org/anthology/w13-1302
Seq2Seq Learning Recall: why Sequence to Sequence? Model sees the entire input sequence before starting to output Output sequence length is not fixed to be equal to input sequence length
Seq2Seq Learning Model
Seq2Seq Learning Predictions We want to pick the most probable sequence of words generate sentences greedily, picking highest softmax probability at each time step use beam search: e.g. try top 3 most probable words each time step and only keep top 3 most probable sequences so far
Seq2Seq Learning Examples Demo: https://www.youtube.com/watch?v=per0mjzsyam Paper, code, examples: https://www.cs.utexas.edu/ vsub/s2vt.html
Seq2Seq Learning Seq2Seq Models and Attention Problem: Basic seq2seq RNN models cannot handle very long sequences NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE Authors: Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio
Seq2Seq Learning Basic Seq2Seq Model p(y i {y 1...y i 1 }, x) = g(y i 1, s i, c)
Seq2Seq Learning Attention Seq2Seq Model p(y i {y 1...y i 1 }, x) = g(y i 1, s i, c i ) T c i = α ij h j j=1
Seq2Seq Learning Encoder: Bidirectional RNN 2 independent RNNs, 1 each direction Overall hiddens is concatenation of 2 independent hiddens Hiddens at each time contain information from entire input h j is influenced by inputs around x j the most
Seq2Seq Learning Decoder: Attention α ij = T c i = α ij h j j=1 exp(e ij ) Tk=1 exp(e ik ) e ij = a(s i 1, h j )
Seq2Seq Learning Improvements
Seq2Seq Learning Attention Visualization Picture shows matrix of α ij α ij relevance of input word j to output word i White is 1, black is 0
Seq2Seq Learning End Thank you!