Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Similar documents
Machine Learning 13. week

Recurrent Neural Network (RNN) Industrial AI Lab.

CSC 578 Neural Networks and Deep Learning

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

27: Hybrid Graphical Models and Neural Networks

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Gated Recurrent Models. Stephan Gouws & Richard Klein

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Natural Language Processing with Deep Learning CS224N/Ling284

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

10-701/15-781, Fall 2006, Final

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

House Price Prediction Using LSTM

RNNs as Directed Graphical Models

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Alternatives to Direct Supervision

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Recurrent Neural Networks

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Conditional Random Fields. Mike Brodie CS 778

CS6220: DATA MINING TECHNIQUES

A new approach for supervised power disaggregation by using a deep recurrent LSTM network

Image Captioning with Object Detection and Localization

A Hidden Markov Model for Alphabet Soup Word Recognition

Deep Learning Applications

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Unsupervised Learning

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of September 23, 2018.

Lecture 7: Neural network acoustic models in speech recognition

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Recurrent Convolutional Neural Networks for Scene Labeling

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Multi-Glance Attention Models For Image Classification

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

Slides credited from Dr. David Silver & Hung-Yi Lee

CS489/698: Intro to ML

CS5670: Computer Vision

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Semantic image search using queries

Deep Learning for Computer Vision

CAP 6412 Advanced Computer Vision

CS 224n: Assignment #3

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Neural Network Weight Selection Using Genetic Algorithms

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Natural Language Processing

Logical Rhythm - Class 3. August 27, 2018

7. Boosting and Bagging Bagging

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Sentiment Classification of Food Reviews

Hidden Units. Sargur N. Srihari

1 Overview Definitions (read this section carefully) 2

Binary Trees

Ensemble Methods, Decision Trees

Building Classifiers using Bayesian Networks

Statistical Machine Translation Part IV Log-Linear Models

Evolutionary Algorithms. CS Evolutionary Algorithms 1

CSE 250B Project Assignment 4

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

CS231N Section. Video Understanding 6/1/2018

FastText. Jon Koss, Abhishek Jindal

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

Structured Attention Networks

Deep Generative Models Variational Autoencoders

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

CAP 6412 Advanced Computer Vision

Crowd Scene Understanding with Coherent Recurrent Neural Networks

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

STA 4273H: Statistical Machine Learning

The Mathematics Behind Neural Networks

Transition-based Parsing with Neural Nets

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Trees. CSE 373 Data Structures

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w

Recurrent Neural Nets II

TREES Lecture 10 CS2110 Spring2014

12/5/17. trees. CS 220: Discrete Structures and their Applications. Trees Chapter 11 in zybooks. rooted trees. rooted trees

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

CS381V Experiment Presentation. Chun-Chen Kuo

Transcription:

Recurrent Neural Networks Nand Kishore, Audrey Huang, Rohan Batra

Roadmap Issues Motivation 1 Application 1: Sequence Level Training 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 2

Motivation Sequential data is found everywhere Sentence is sequence of words Video is sequence of images Time-series data is sequence across time Given a sequence (x1, x2,...xi) The xi s are not i.i.d! Might be strongly correlated Can we take advantage of sequential structure?

Machine Learning Tasks Sequence Prediction Predict next element in sequence: (x1, x2,,xi) -> xi+1 Sequence Labelling/Seq-to-Seq Assign label to each element in sequence: (x1, x2,,xi) -> (y1, y2,...yi) Sequence Classification Classify entire sequence (x1, x2,, xi) -> y

Machine Learning Tasks

Hidden Markov Model (HMM) Generative Model Markov Assumption on Hidden States

Hidden Markov Model (HMM) Weaknesses N states can only encode log(n) bits of information In practice, generative models tend to be more computationally intensive/less accurate

Feedforward Neural Networks

Weaknesses of Feedforward Networks Fixed # outputs and inputs Inputs are independent

Recurrent Neural Networks Overcome These Problems Fixed # outputs and inputs Inputs are independent Variable # outputs and inputs Inputs can be correlated

Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 11

Recurrent Neural Network - Structure RNNs are neural networks with loops Previous hidden states or predictions used as inputs

Recurrent Neural Network - Structure How you loop the nodes depends on the problem

Recurrent Neural Networks - Training Unroll the network through time Becomes a feedforward network Train using gradient descent backpropagation through time (BPTT)

Recurrent Neural Network - Equations Updating Hidden States st at time t st function of previous hidden state st-1 and current input xt encodes prior information ( memory )

Recurrent Neural Network - Equations Making Sequential Predictions ot at time t ot is function of current hidden state st

Backpropagation - Chain Rule

Backpropagation - Chain Rule

Example 1 - Learning A Language Model 0.9 0.2 0.1 0.4 0.5-0.3 0.2 1 0 0 0 Predicted distribution over letters Hidden states One-hot encoding of letters

Example 1 - Learning A Language Model E 0.2 0.9 0.1 0.0 0.5-0.3 0.2 H 1 0 0 0

Example 1 - Learning A Language Model E 0.2 0.9 0.1 0.0 0.5-0.3 0.2 H 1 0 0 0 E 0 1 0 0

Example 1 - Learning A Language Model E H 0.2 0.7 0.1 0.0 L 0.2 0.2 0.4 0.2 L 0.0 0.1 0.8 0.1 O 0.2 0.1 0.1 0.7 0.5-0.3 0.2 0.7 0.9-0.5 0.5-0.2 0.5 0.6 0.4 0.7 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 E L L

Example 2 - Machine Translation I like socks 0 1 0 1 0 0 J aime les chaussettes

Example 3 - Sentiment Analysis POSITIVE 0.9 0.1... 1 0 0 1 0 0 CS159 is lit!1!!!!!

Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 25

Vanishing and Exploding Gradient Composition of many nonlinear functions can lead to problems When we backpropagate, gradient increases exponentially with W Behavior depends on eigenvalues of W If eigenvalues > 1, Wn may cause gradient to explodes If eigenvalues < 1, Wn may cause gradient to vanish

Short-term Memory In practice, simple RNN s tend to only retain information from few time-steps in the past RNN s are good at remembering recent information RNN s have a harder time remembering relevant information from farther back

Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 28

Long short-term memory (LSTM) Adds a memory cell with input, forget, and output gates Can help learn long-term dependencies Helps solve exploding/vanishing gradient problem

Forget gate layer

Input gate layer

Update cell state

Generate filtered output

But wait... there s more A number of challenging problems remain in sequence learning Let s take a look at how the papers address these issues

Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 35

SEQUENCE LEVEL TRAINING WITH RECURRENT NEURAL NETWORKS By: Marc Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba

Text Generation Consider the problem of text generation Machine Translation Summarization Question Answering

Text Generation We want to predict a sequence of words [w1, w2,, wt] (that makes sense) At each time step, may take as input some context ct

RNN Let s use an RNN

Text Generation Simple Elman RNN:

Training Optimize cross-entropy loss at each time-step Given a target sequence [w1, w2,, wt]

Why is this a problem? Notice that the RNN is trained to maximize pθ(w wt,ht+1), where wt is the ground truth Loss function causes model to be good at predicting the next word given previous ground-truth word

Why is this a problem? But we don t have the ground truth at test time!!! Remember that we re trying to generate sequences At test time, the model needs to use it s own predictions to generate the next words

Why is this a problem? This can lead to predicting sub-optimal sequences The most likely sequence might actually contain a sub-optimal word at some time-step

Techniques Beam Search At each point instead of just taking highest scoring word, look at k next word candidates Significantly slows down word generation process Data as Demonstrator During training, with certain probability take either model prediction or ground truth Alignment issues I took a long walk vs. I took a walk End to End Backprop Instead of ground-truth, propagate top-k words at previous time-step Weigh each word by its score and re-normalize

Reinforcement Learning Suppose our RNN is an agent Parameters define a policy Picks an action After taking an action, updates internal states At the end of sequence, observes a reward Can use any reward function, authors use BLUE and ROUGE-2

REINFORCE We want to find parameters that maximize expected reward Loss is negative expected reward In practice, we actually approximate the expected reward with a single sample...

REINFORCE For estimating the gradients, If is a baseline estimator that can reduce variance Authors use linear regression with the hidden states of RNN as input to estimate this Unclear what is best technique to select this encourages word choice, or else discourages word choice

MIXED INCREMENTAL CROSS-ENTROPY REINFORCE Instead of starting from poor random policy, start from RNN trained using ground truth with cross-entropy Start using REINFORCE according to an annealing schedule

Results ROUGE-2 score for summarization BLEU-4 score for translation and image captioning

Results Algorithms + k Beam Search

Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 52

Tree-Structured Decoding with Doubly-Recurrent Neural Networks By: David Alvarez-Melis, Tommi S. Jaakkola

Motivation Given an encoding as input, generate a tree structure RNN s best suited for sequential data Trees and graphs do not naturally conform to linear ordering Various types of sequential data can be represented in trees Parse trees for sentences Abstract syntax trees for computer programs Problem: generate full tree structure with node-labels using encoder-decoder framework

Encoder-Decoder Framework Given ground-truth tree and string representation of tree Use RNN to encode vector representation from string Use RNN to decode tree from vector representation

Example ROOT B W F J V Preorder Traversal String Representation Input Tree X Encoder Vector Representation Decoder Output

Challenges with decoding Must generate tree top-down Don t know node labels beforehand Generating child node vs sibling node How to pass information? Label of siblings not independent A verb in a parse tree reduces chance of sibling nodes of being verb

Doubly Recurrent Neural Networks (DRNN) Ancestral and sibling flows of information Two input states: Receive from parent node, update, send to descendent Receive from previous sibling, update, send to next sibling Unrolled DRNN Nodes labeled in order generated Solid lines are ancestral, dotted lines are fraternal connections

Inside a node Update node s hidden ancestral and fraternal states Combine to obtain predictive hidden state

Producing an output Get probability of stopping ancestral or fraternal branch Combine to get output, binary variables corresponding to ground truth (training) or, (testing) Assign node label

Experiment: Synthetic Tree Recovery Generate dataset of labeled trees Vocabulary is 26 letters Condition label of each node on ancestors and siblings Probabilities of children or next-siblings dependent only on label and depth Generate string representations with pre-order traversal RNN encoder String to vector DRNN with LSTM module decoder Vector to tree

Results Node retrieval: 75% Edge retrieval: 71%

Results

Results

Experiment: Computer program generation Mapping sentences to functional programs Given sentence description of computer program, generate abstract syntax tree DRNN performed better than all other baselines

Summary DRNN s are an extension of sequential recurrent architectures to tree structures Information flow Parent to offspring Sibling to sibling Performed better than baselines on tasks involving tree generation

Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 67

Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition By: Zhiwei Deng, Arash Vahdat, Hexiang Hu, Greg Mori

Group Activity Recognition Classification Problem: What is each person doing? What is the scene as a whole doing?

Group Activity Recognition CNN action? CNN action? Classification Problem: What is each person doing? What is the scene as a whole doing?

Group Activity Recognition Classification Problem: What is each person doing? What is the scene as a whole doing? CNN action?

Group Activity Recognition waiting biking walking waiting Classification Problem: What is each person doing? waiting walking biking What is the scene as a whole doing? waiting

Improving Group Activity Recognition - Model Relationships waiting? walking? Individual actions inform other individual actions & the scene action relationships

Improving Group Activity Recognition - Model Relationships waiting together Individual actions inform other individual actions & the scene action relationships

Group Activity Recognition waiting waiting Relationships depend on Spatial distance Relative motions Concurrent actions Number of people

Model Relationships Using RNNs RNN structure reflects learning problem Fully connected Each edge represents a relationship Model learns how important each relationship is Train edge weights

Train Using Iterative Message Passing/Gradient Descent

Why do we need multiple iterations of message passing? Graphs are cyclic so exact inference isn t possible

Graph Model A B C S

Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

General Message-Passing Update Equations At time t, message m(t) is a function f of weighted sum with Input features x Last message m(t-1) Last prediction c(t-1) At time t, prediction c(t) is a function f of weighted sum with Input features x Current message m(t)

RNN Model A B C S

RNN Model - Get Feature Inputs A B Individual Features CNN C CNN CNN S Scene Features CNN

RNN Model - Initialize Messages at Time t=0 A B C S

Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

Training the Model - Updating Messages at Time t A B C S

Training the Model - Updating Messages at Time t A B C S Activation function on weighted sum of Input features to messaging node Prev. prediction by messaging node Prev. messages to messaging node

Training the Model - Updating Messages at Time t A B C S

Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

Training the Model - Updating Gates at Time t A B C S

Training the Model - Updating Gates at Time t A B C S Activation function on weighted sum of Input features to target node Prev. prediction by target node Current messages to target node

Training the Model - Updating Gates at Time t A B C S

Training the Model - Imposing Gates at Time t A B C Just multiply message vectors by gate scalars S

Training the Model - Updating Gates at Time t A B C S Gate Intuition: Determines whether edge is useful Modulates importance of edge

Training the Model - Imposing Gates A B C S Over Several Iterations Learns to ignore some dges And focus more on others

Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

Training the Model - Getting Predictions ci(t) at Time t A B C S

Training the Model - Getting Predictions ci(t) at Time t A B C S Activation function on weighted sum of Input features to node Current messages to node

Training the Model - Getting Predictions ci(t) at Time t A B C S

Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

Output - Getting Final Predictions ci(t) at Time T A B C S

Applied to Collective Activity Dataset 44 videos 7 actions Crossing Waiting Queueing Talking Jogging Dancing N/A

Visualization of Results on Collective Activity Dataset

Results on Collective Activity Dataset

Key Takeaways Previous outputs and hidden states can be looped back in as inputs Model problems where inputs are highly correlated

thanks