Gated Recurrent Models. Stephan Gouws & Richard Klein

Similar documents
Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Machine Learning 13. week

CSC 578 Neural Networks and Deep Learning

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Slide credit from Hung-Yi Lee & Richard Socher

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of September 23, 2018.

Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

Recurrent Neural Networks

27: Hybrid Graphical Models and Neural Networks

Natural Language Processing with Deep Learning CS224N/Ling284

Transition-based Parsing with Neural Nets

Recurrent Neural Networks

A Quick Guide on Training a neural network using Keras.

Sentiment Classification of Food Reviews

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Lecture 20: Neural Networks for NLP. Zubin Pahuja

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

EECS 496 Statistical Language Models. Winter 2018

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Machine Learning for Natural Language Processing. Alice Oh January 17, 2018

Recurrent Neural Network (RNN) Industrial AI Lab.

Convolutional Networks for Text

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

Structured Attention Networks

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Recurrent Neural Nets II

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

CS 224n: Assignment #3

Deep Neural Networks Applications in Handwriting Recognition

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

CAP 6412 Advanced Computer Vision

Deep Learning Applications

This Talk. 1) Node embeddings. Map nodes to low-dimensional embeddings. 2) Graph neural networks. Deep learning architectures for graphstructured

Modeling Sequences Conditioned on Context with RNNs

Hidden Units. Sargur N. Srihari

Context Encoding LSTM CS224N Course Project

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Conditioned Generation

Reservoir Computing with Emphasis on Liquid State Machines

Deep neural networks

Neural Nets & Deep Learning

Deep Neural Networks Applications in Handwriting Recognition

The Mathematics Behind Neural Networks

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Semantic image search using queries

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

How GPUs Power Comcast's X1 Voice Remote and Smart Video Analytics. Jan Neumann Comcast Labs DC May 10th, 2017

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Can Active Memory Replace Attention?

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

FastText. Jon Koss, Abhishek Jindal

Structured Prediction Basics

CS489/698: Intro to ML

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Statistical Methods for NLP

RNNs as Directed Graphical Models

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Lecture 4: Backpropagation and Automatic Differentiation. CSE599W: Spring 2018

How to Develop Encoder-Decoder LSTMs

Replacing Neural Networks with Black-Box ODE Solvers

A Hybrid Neural Model for Type Classification of Entity Mentions

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

The Hitchhiker s Guide to TensorFlow:

A Simple (?) Exercise: Predicting the Next Word

Bayesian model ensembling using meta-trained recurrent neural networks

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework

Deep Learning in NLP. Horacio Rodríguez. AHLT Deep Learning 2 1

Predicting your Next Stop-Over from Location-Based Social Network Data with Recurrent Neural Networks RecTour workshop 2017 RecSys 2017, Como, Italy

House Price Prediction Using LSTM

Sentiment Classification of Food Reviews

ECE 6504: Deep Learning for Perception

Why is word2vec so fast? Efficiency tricks for neural nets

Advanced Search Algorithms

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Layerwise Interweaving Convolutional LSTM

Stack- propaga+on: Improved Representa+on Learning for Syntax

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Recurrent Neural Networks and Transfer Learning for Action Recognition

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

CSC321 Lecture 10: Automatic Differentiation

MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA

Research at Google: With a Case of YouTube-8M. Joonseok Lee, Google Research

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

A Neuro Probabilistic Language Model Bengio et. al. 2003

Frameworks. Advanced Machine Learning for NLP Jordan Boyd-Graber RECURRENT NEURAL NETWORKS IN DYNET

Transcription:

Gated Recurrent Models Stephan Gouws & Richard Klein

Outline Part 1: Intuition, Inference and Training Building intuitions: From Feedforward to Recurrent Models Inference in RNNs: Fprop Training in RNNs: Backpropagation-through-time (BPTT) SHORT BREAK

Outline Part 2: Gated models & Applications Long Short-term Memory (LSTMs) Gated Recurrent Units (GRUs) Applications: Image captioning Sequence classification (Practical 4: MNIST) Language modeling Sequence-labeling (lots of NLP tasks, e.g. POS tagging, NER, ) Sequence-to-sequence learning (Machine translation, Dialogue modeling, )

Recurrent Models PART 1: Intuition, Inference and Training

Introduction?

Introduction? x2

Introduction? x3

We need to be able to remember information from previous time steps

Recurrent neural networks: Intuition

Long-term dependencies: Why do they matter? Michel C. was born in Paris, France. He is married and has three children. He received a M.S. in neurosciences from the University Pierre & Marie Curie and the Ecole Normale Supérieure in 1987, and and then spent most of his career in Switzerland, at the Ecole Polytechnique de Lausanne. He specialized in child and adolescent psychiatry and his first field of research was severe mood disorders in adolescent, topic of his PhD in neurosciences (2002). His Michel Hn-1 C. was born in mother tongue is????? Short context Hn French Long context English, German, Russian, French Problems

Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

? Types of Sequence Models We ll talk about this a little later. We ll also implement this in today s practical! FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

Types of Sequence Models FFNs Image Captioning MNIST predictor Seq2Seq Sequence labeling (e.g. NER)

FFNs vs RNNs Classify following examples:,,,

FFNs vs RNNs PREDICT: 1 Y: outputs Y softmax tanh FFN H X X: inputs,,,

FFNs vs RNNs PREDICT: 9 Y: outputs Y softmax tanh FFN H X X: inputs,,,

FFNs vs RNNs PREDICT: 2 Y: outputs Y softmax tanh FFN H X X: inputs,,,

FFNs vs RNNs But what if these were not digits, but longer numbers?, Problem? Variable length inputs.,

FFNs vs RNNs Y: outputs Yt softmax tanh 1? RNN cell H Xt X: inputs at step t,,,

FFNs vs RNNs Y: outputs Yt softmax H: internal state tanh 19? RNN cell H Xt 1 H at t-1, X: inputs at step t,,

FFNs vs RNNs Maintains a state (memory) that carries information between inputs! Y: outputs Yt softmax H: internal state tanh 192? RNN cell H Xt 19 H at t-1, X: inputs at step t,,

The RNN API prev_state next_state recurrent_fn() x outputs

The RNN Computation Graph yt t-1 xt Feedback loop / state / memory / stack (previous time-step)

Unrolling the RNN Computation Graph y1 h0 h1 x1

Unrolling the RNN Computation Graph h0 x1 y1 y2 h1 h2 x2

Unrolling the RNN Computation Graph h0 x1 y1 y2 h1 h2 x2 Unrolled over n time-steps. yn... x3 hn

Unrolling the RNN Computation Graph NB: We reuse the same weights at every time-step! h0 y1 y2 h1 h2 x1 x2 The same Unrolled over n time-steps. yn... x3 hn

Unrolling the RNN Computation Graph NB: We reuse the same weights at every time-step! y1 y2 yn We can therefore think of an RNN as a composition of identical feedforward neural networks h1 (with replicated/tied h2 weights), one for each moment or step in time. h0 x1 x2 The same Unrolled over n time-steps. x3... hn

The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

The FFN API class FeedForwardModel(): y #... def forward(self, x): # Compute activations on the hidden layer. hidden_layer = self.act_fn(np.dot(self.w_xh, x) + b) # Compute the (linear) output layer activations. y = np.dot(self.w_hy, hidden_layer) return y x

The RNN API class RecurrentModel(): #... yt def recurrent_fn(self, x, prev_state): # Compute the new state based on the previous state and current input. new_state = self.act_fn(np.dot(self.w_xh, x) + np.dot(self.w_hh, prev_state) + b) t-1 # Compute the output vector. y = np.dot(self.w_hy, new_state) return new_state, y xt

The RNN API class RecurrentModel(): #... yt def recurrent_fn(self, x, prev_state): # Compute the new state based on the previous state and current input. new_state = self.act_fn(np.dot(self.w_xh, x) + np.dot(self.w_hh, prev_state) + b) t-1 # Compute the output vector. y = np.dot(self.w_hy, new_state) return new_state, y xt

The RNN API New state Recurrent function Input at current time-step Previous state class RecurrentModel(): #... yt def recurrent_fn(self, x, prev_state): # Compute the new state based on the previous state and current input. new_state = self.act_fn(np.dot(self.w_xh, x) + np.dot(self.w_hh, prev_state)) t-1 # Compute the output vector. y = np.dot(self.w_hy, new_state) return new_state, y xt

The RNN API New state Recurrent function Input at current time-step Previous state class RecurrentModel(): #... yt def recurrent_fn(self, x, prev_state): # Compute the new state based on the previous state and current input. new_state = self.act_fn(np.dot(self.w_xh, x) + np.dot(self.w_hh, prev_state)) t-1 # Compute the output vector. y = np.dot(self.w_hy, new_state) return new_state, y xt

The RNN API def forward(self, data_sequence, initial_state): state = initial_state all_states, all_ys = [state], [] cache = [] for x, y in data_sequence: new_state, y_pred = recurrent_fn(x, state) loss += cross_entropy(y_pred, y) cache.append((new_state, y_pred)) state = new_state return loss, cache

The RNN API def forward(self, data_sequence, initial_state): state = initial_state all_states, all_ys = [state], [] cache = [] for x, y in data_sequence: new_state, y_pred = recurrent_fn(x, state) loss += cross_entropy(y_pred, y) cache.append((new_state, y_pred)) state = new_state return loss, cache

The RNN API def forward(self, data_sequence, initial_state): state = initial_state all_states, all_ys = [state], [] cache = [] for x, y in data_sequence: new_state, y_pred = recurrent_fn(x, state) loss += cross_entropy(y_pred, y) cache.append((new_state, y_pred)) state = new_state return loss, cache

Math: FFNs v RNNs NOTATION: Wxh is a matrix that maps a vector x into a vector h. y x

NOTATION: Wxh is a matrix that maps a vector x into a vector h. Math: FFNs v RNNs y Input x

NOTATION: Wxh is a matrix that maps a vector x into a vector h. Math: FFNs v RNNs y Activation function Input x

NOTATION: Wxh is a matrix that maps a vector x into a vector h. Math: FFNs v RNNs Hidden layer y Activation function Input x

Math: FFNs v RNNs NOTATION: Wxh is a matrix that maps a vector x into a vector h. Hidden layer yt Activation function Input t-1... New state Recurrent function Input at current time-step xt

Math: FFNs v RNNs NOTATION: Wxh is a matrix that maps a vector x into a vector h. Hidden layer yt Activation function Input t-1 New state Recurrent function Input at current time-step xt

Math: FFNs v RNNs NOTATION: Wxh is a matrix that maps a vector x into a vector h. Hidden layer yt Activation function Input Recurrent weights t-1 New state Recurrent function Input at current time-step Previous state xt

Inference & Training How do we make predictions using RNNs? Forward propagation: Fprop Essentially a composition of functions: a2 = f2(f1(x)). We unroll the computational graph over time-steps. How do we train RNNs? Backward propagation: Backprop-through time We need to consider predictions over several time-steps! Credit assignment over time. We work backwards in time from the last state to the first.

Training: Ways to Train RNNs Echo State Networks: Initialize Wxh, Whh, Who, carefully, then only train Who! Backpropagation through time (BPTT): Propagate errors backwards through the unrolled graph. There are other options.

Training: ESNs Simple solution: don t train the recurrent weights (Whh & Wxh)! Initialization very important. Super simple. However, with recent improvements in initialization etc, BPTT does better! [Scholarpedia]

Inference & Training How do we make predictions using RNNs? Forward propagation: Fprop Essentially a composition of functions: a2 = f2(f1(x)). We unroll the computational graph over time-steps. Error How do we train RNNs? Propagate errors backwards through unrolled graph: Backprop-through time (BPTT). We need to consider predictions over several time-steps! Credit assignment over time. We work backwards in time from the last state to the first.

Training: BPTT Intuition

Training: Truncated BPTT

Training: Truncated BPTT

Training: Truncated BPTT

Unrolling the RNN Computation Graph h0 x1 y1 y2 h1 h2 x2 Unrolled over n time-steps. yt... x3 ht

Unrolling the RNN Computation Graph h0 x1 y1 y2 h1 h2 x2 Unrolled over n time-steps. yt... x3 ht

Step 1: Compute all errors. Unrolling the RNN Computation Graph y1 h0 y2 E1 h1 x1 yt E2 h2 x2 Unrolled over n time-steps.... x3 ht Et

Unrolling the RNN Computation Graph y1 h0 y2 E1 h1 x1 Step 1: Compute all errors. Step 2: Pass error back for each time-step from n back to 1. yt E2 h2 x2 Unrolled over n time-steps.... x3 ht Et

Unrolling the RNN Computation Graph y1 h0 y2 E1 h1 x1 Step 1: Compute all errors. Step 2: Pass error back for each time-step from n back to 1. Step 3: Update weights. yt E2 h2 x2 Unrolled over n time-steps.... x3 ht Et

Unrolling the RNN Computation Graph yt-2 yt-1 yt ht-2 ht-1 ht xt-1 xt

Unrolling the RNN Computation Graph yt-2 yt-1 yt ht-2 ht-1 ht Et

Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

Unrolling the RNN Computation Graph yt ht-2 ht-1 ht Et

Unrolling the RNN Computation Graph yt Et where ht-2 ht-1 ht

Unrolling the RNN Computation Graph yt-2 Et-2 yt-1 Et-1 yt Et where ht-2 ht-1 ht Total Error = E1 + E2 + + Et Total gradient = sum of all det/dθ s

Training: Truncated BPTT Code def bptt(model, X_train, y_train, initial_state): Error # Forward Loss, caches = forward(x_train, y_train, model, initial_state) avg_loss /= y_train.shape[0] # Backward dh_next = np.zeros((1, last_state.shape[0])) grads = {k: np.zeros_like(v) for k, v in model.items()} for t in reversed(range(len(x_train))): grad, dh_next = cell_fn_backward(ys[t], y_train[t], dh_next, caches[t]) for k in grads.keys(): grads[k] += grad[k] return grads, avg_loss Lego block!

Training: Truncated BPTT Code def bptt(model, X_train, y_train, initial_state): Error # Forward Loss, caches = forward(x_train, y_train, model, initial_state) avg_loss /= y_train.shape[0] Lego block! # Backward dh_next = np.zeros((1, last_state.shape[0])) grads = {k: np.zeros_like(v) for k, v in model.items()} for t in reversed(range(len(x_train))): grad, dh_next = cell_fn_backward(ys[t], y_train[t], dh_next, caches[t]) for k in grads.keys(): grads[k] += grad[k] return grads, avg_loss Total gradient = Sum of these lego-gradients over time!

Vanilla RNN Gradient Flow

Vanilla RNN Gradient Flow

Vanilla RNN Gradient Flow

Vanilla RNN Gradient Flow

Vanilla RNN Gradient Flow

Vanilla RNN Gradient Flow Part 2!

Gated Recurrent Models PART II: Gated Architectures & Applications

RECAP: The RNN API New state prev_state Recurrent function Input at current time-step next_state recurrent_fn() x outputs Previous state

The GATED RNN API prev_state It is the same! Just a different way of computing the outputs. recurrent_fn() Memory Cell next_state Gates x Controller outputs

Implementing a memory cell in a neural network

Propagating through a memory cell

Backpropagating through a memory cell?

LSTM = Long Short Term Memory LSTM concatenate : Xt Ht-1 Ct-1 σ σ tanh σ tanh + X = Xt Ht-1 vector sizes p+n n forget gate : f = σ(x.wf + bf) n update gate : u = σ(x.wu + bu) n result gate : r = σ(x.wr + br) input : X = tanh(x.wc + bc) n Ht new C : Ct = f * Ct-1 + u * X n Ct n new H : Ht = r * tanh(ct) Yt output : Yt = softmax(ht.w + b) m

LSTM Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

LSTM concatenation Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

LSTM Xt What to forget? Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

LSTM Xt Ht-1 Actually forget Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

LSTM What s the new value? Xt What to update? Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

LSTM Xt Ht-1 Ht σ σ tanh Actually update σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

LSTM Updated! Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

LSTM Result Gate! Xt Ht-1 Ht σ σ tanh σ Calculate Result tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations

LSTM Remember the result for next time step Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations Output

LSTM = Long Short Term Memory LSTM X = Xt Ht-1 concatenation Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations f = σ(x.wf + bf) u = σ(x.wu + bu) r = σ(x.wr + br) X = tanh(x.wc + bc) Ct = f * Ct-1 + u * X Ht = r * tanh(ct) Yt = softmax(ht.w + b)

LSTM = Long Short Term Memory LSTM X = Xt Ht-1 concatenation Xt Ht-1 Ht σ σ tanh σ tanh Ct-1 + Ct Yt σ tanh Neural net. layers tanh Element-wise operations f = σ(x.wf + bf) u = σ(x.wu + bu) r = σ(x.wr + br) X = tanh(x.wc + bc) Ct = f * Ct-1 + u * X Ht = r * tanh(ct) Yt = softmax(ht.w + b)

Gru!

Gated Recurrent Units (GRUs) GRU = Gated Recurrent Unit 2 gates instead of 3 => cheaper GRU Ht Yt vector sizes p+n z = σ(x.wz + bz) r = σ(x.wr + br) X = Xt r * Ht-1 Xt Ht-1 X = Xt Ht-1 Ht n n p+n X = tanh(x.wc + bc) Ht = (1-z) * Ht-1 + z * X n Yt = softmax(ht.w + b) m n

Gated Recurrent Units (GRUs) GRU = Gated Recurrent Unit 2 gates instead of 3 => cheaper GRU Ht Yt vector sizes p+n z = σ(x.wz + bz) r = σ(x.wr + br) X = Xt r * Ht-1 Xt Ht-1 X = Xt Ht-1 Ht n n p+n X = tanh(x.wc + bc) Ht = (1-z) * Ht-1 + z * X n Yt = softmax(ht.w + b) m n

Long Short-term Memory (LSTM): Gradient Flow

Long Short-term Memory (LSTM): Gradient Flow

Applications/Tasks Image captioning Sequence classification (Practical 4: MNIST) Language modeling Sequence-labeling (lots of NLP tasks, e.g. POS tagging, NER, ) Sequence-to-sequence learning (Machine translation, Summarization, )

One-to-Many: Image captioning GOAL: Given image, generate a sentence to describe its content.

One-to-Many: Image captioning GOAL: Given image, generate a sentence to describe its content.

Many-to-one: Sequence Classifier (Prac 4) GOAL: Given a sequence of inputs, predict the label for the whole sequence. Examples: Given a sentence, say if it is {negative, neutral, positive} Given the words in an email, predict if it is a spam message. Given pieces of an image, predict what number is in the image.

Many-to-1: Polarity/Sentiment Classifer We feed all the words into the model one at a time, and make one prediction at the end: h0 h1 Cats Pos/Neg? h2 are h3 awesome

Many-to-1: Spam Classifer We feed all the words into the model one at a time, and make one prediction at the end: h0 h1 Viagra Spam/Ham? h2 for h3 cheap

Many-to-1: Image Classifier (Prac 4) We chop up the image and feed all the pieces through the model, and then make one prediction at the end: h0 h1 h2 yimag e h3

Next-token Prediction: Language modeling 1. Computing p(next word previous words) 2. p(the) p(cat the) p(sits the cat) h0 h1 h2 the cat p(xm...)... h3

Next-token Prediction: Language modeling x1 x2 x3 h0 h1 h2 x1 x2 xn... x3 hn

Many-to-many: Sequence labeling Mapping each input x1, x2,, xn to its own label y1, y2,, yn (Notice: Same length m; each input has an output.) A lot of NLP Tasks fall in this category, e.g.: Part-of-speech tagging: map words to their parts-of-speech (noun, verb, etc). Named-entity Recognition: identify mentions of people, places, etc in text Semantic Role Labeling: find the main actions, and who performs them on whom/what

Many-to-many: Sequence labeling Part-of-speech tagging

Many-to-many: Sequence labeling Part-of-speech tagging

Many-to-many: Sequence labeling Heat water in a large vessel. h0 heat VERB NOUN h1 h2 water...... in hn

Many-to-many: Sequence-to-Sequence Modeling Mapping input sequences {x1, x2,, xm} to output sequences {y1, y2,, yn}. Note: Arbitrary m and n. Many applications, most notably Machine Translation. Also: POS tagging, Parsing, Summarization, Dialog

Many-to-many: Sequence-to-Sequence Modeling Encoder RNN Decoder RNN the blue la house <s> maison la bleu maison <eos> bleu MACHINE TRANSLATION

Many-to-many: Sequence-to-Sequence Modeling DIALOGUE MODELING Prompt: You can t eat pasta every day. You can t live only on Italian food. System: I don t eat Italian food, but I do eat a lot of Italian food.

Many-to-many: Sequence-to-Sequence Modeling Prompt: What do you have planned tonight? Baseline: I have no idea what I want to do with my life. I have no idea what I want to (Bad, Mediocre, Bad) Ours: I was planning on doing a few of these, but I m not sure what to choose. I need to find a place to camp, and I don t have a lot of time to make a plan. (Excellent, Good, Mediocre)

Many-to-many: Sequence-to-Sequence Modeling Prompt: Have you seen Suicide Squad? Baseline: I have not, but I will check it out. (Good, Good, Good) Ours: I saw it in a theater once and it was great. It was very... I don t know, a little dark. I like the psychological effects and the way it portrays the characters in the movie. (Excellent, Excellent, Excellent)

Key take-aways RNNs have memory/state that evolve over time. We unroll the graph over time to do forward propagation. Backprop-through-time (BPTT): Vanishing/exploding gradients Gated architectures Perform Chain Rule over the unrolled graph efficiently by saving and reusing previous computations. de/dw is sum over all time-steps (b/c of tied weights) State is selectively overwritten per time-step Uninterrupted gradient flow through time: no vanishing/exploding gradients! Lots of cool applications!

Slide Credits Thank-you to the following resources, from which some of these slides were drawn and adapted. Stanford CS231n TensorFlow without a PhD

The end.