Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Size: px

Start display at page:

Download "Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra"

Evangeline Parsons
5 years ago
Views:

1 Recurrent Neural Networks Nand Kishore, Audrey Huang, Rohan Batra

2 Roadmap Issues Motivation 1 Application 1: Sequence Level Training 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 2

3 Motivation Sequential data is found everywhere Sentence is sequence of words Video is sequence of images Time-series data is sequence across time Given a sequence (x1, x2,...xi) The xi s are not i.i.d! Might be strongly correlated Can we take advantage of sequential structure?

4 Machine Learning Tasks Sequence Prediction Predict next element in sequence: (x1, x2,,xi) -> xi+1 Sequence Labelling/Seq-to-Seq Assign label to each element in sequence: (x1, x2,,xi) -> (y1, y2,...yi) Sequence Classification Classify entire sequence (x1, x2,, xi) -> y

5 Machine Learning Tasks

6 Hidden Markov Model (HMM) Generative Model Markov Assumption on Hidden States

7 Hidden Markov Model (HMM) Weaknesses N states can only encode log(n) bits of information In practice, generative models tend to be more computationally intensive/less accurate

8 Feedforward Neural Networks

9 Weaknesses of Feedforward Networks Fixed # outputs and inputs Inputs are independent

10 Recurrent Neural Networks Overcome These Problems Fixed # outputs and inputs Inputs are independent Variable # outputs and inputs Inputs can be correlated

11 Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 11

12 Recurrent Neural Network - Structure RNNs are neural networks with loops Previous hidden states or predictions used as inputs

13 Recurrent Neural Network - Structure How you loop the nodes depends on the problem

14 Recurrent Neural Networks - Training Unroll the network through time Becomes a feedforward network Train using gradient descent backpropagation through time (BPTT)

15 Recurrent Neural Network - Equations Updating Hidden States st at time t st function of previous hidden state st-1 and current input xt encodes prior information ( memory )

16 Recurrent Neural Network - Equations Making Sequential Predictions ot at time t ot is function of current hidden state st

17 Backpropagation - Chain Rule

18 Backpropagation - Chain Rule

19 Example 1 - Learning A Language Model Predicted distribution over letters Hidden states One-hot encoding of letters

20 Example 1 - Learning A Language Model E H

21 Example 1 - Learning A Language Model E H E

22 Example 1 - Learning A Language Model E H L L O E L L

23 Example 2 - Machine Translation I like socks J aime les chaussettes

24 Example 3 - Sentiment Analysis POSITIVE CS159 is lit!1!!!!!

25 Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 25

26 Vanishing and Exploding Gradient Composition of many nonlinear functions can lead to problems When we backpropagate, gradient increases exponentially with W Behavior depends on eigenvalues of W If eigenvalues > 1, Wn may cause gradient to explodes If eigenvalues < 1, Wn may cause gradient to vanish

are good at remembering recent information RNN s have a

27 Short-term Memory In practice, simple RNN s tend to only retain information from few time-steps in the past RNN s are good at remembering recent information RNN s have a harder time remembering relevant information from farther back

28 Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 28

29 Long short-term memory (LSTM) Adds a memory cell with input, forget, and output gates Can help learn long-term dependencies Helps solve exploding/vanishing gradient problem

30 Forget gate layer

31 Input gate layer

32 Update cell state

33 Generate filtered output

34 But wait... there s more A number of challenging problems remain in sequence learning Let s take a look at how the papers address these issues

35 Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 35

36 SEQUENCE LEVEL TRAINING WITH RECURRENT NEURAL NETWORKS By: Marc Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba

37 Text Generation Consider the problem of text generation Machine Translation Summarization Question Answering

38 Text Generation We want to predict a sequence of words [w1, w2,, wt] (that makes sense) At each time step, may take as input some context ct

39 RNN Let s use an RNN

40 Text Generation Simple Elman RNN:

41 Training Optimize cross-entropy loss at each time-step Given a target sequence [w1, w2,, wt]

42 Why is this a problem? Notice that the RNN is trained to maximize pθ(w wt,ht+1), where wt is the ground truth Loss function causes model to be good at predicting the next word given previous ground-truth word

43 Why is this a problem? But we don t have the ground truth at test time!!! Remember that we re trying to generate sequences At test time, the model needs to use it s own predictions to generate the next words

44 Why is this a problem? This can lead to predicting sub-optimal sequences The most likely sequence might actually contain a sub-optimal word at some time-step

45 Techniques Beam Search At each point instead of just taking highest scoring word, look at k next word candidates Significantly slows down word generation process Data as Demonstrator During training, with certain probability take either model prediction or ground truth Alignment issues I took a long walk vs. I took a walk End to End Backprop Instead of ground-truth, propagate top-k words at previous time-step Weigh each word by its score and re-normalize

46 Reinforcement Learning Suppose our RNN is an agent Parameters define a policy Picks an action After taking an action, updates internal states At the end of sequence, observes a reward Can use any reward function, authors use BLUE and ROUGE-2

47 REINFORCE We want to find parameters that maximize expected reward Loss is negative expected reward In practice, we actually approximate the expected reward with a single sample...

48 REINFORCE For estimating the gradients, If is a baseline estimator that can reduce variance Authors use linear regression with the hidden states of RNN as input to estimate this Unclear what is best technique to select this encourages word choice, or else discourages word choice

49 MIXED INCREMENTAL CROSS-ENTROPY REINFORCE Instead of starting from poor random policy, start from RNN trained using ground truth with cross-entropy Start using REINFORCE according to an annealing schedule

50 Results ROUGE-2 score for summarization BLEU-4 score for translation and image captioning

51 Results Algorithms + k Beam Search

52 Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 52

53 Tree-Structured Decoding with Doubly-Recurrent Neural Networks By: David Alvarez-Melis, Tommi S. Jaakkola

54 Motivation Given an encoding as input, generate a tree structure RNN s best suited for sequential data Trees and graphs do not naturally conform to linear ordering Various types of sequential data can be represented in trees Parse trees for sentences Abstract syntax trees for computer programs Problem: generate full tree structure with node-labels using encoder-decoder framework

55 Encoder-Decoder Framework Given ground-truth tree and string representation of tree Use RNN to encode vector representation from string Use RNN to decode tree from vector representation

56 Example ROOT B W F J V Preorder Traversal String Representation Input Tree X Encoder Vector Representation Decoder Output

57 Challenges with decoding Must generate tree top-down Don t know node labels beforehand Generating child node vs sibling node How to pass information? Label of siblings not independent A verb in a parse tree reduces chance of sibling nodes of being verb

58 Doubly Recurrent Neural Networks (DRNN) Ancestral and sibling flows of information Two input states: Receive from parent node, update, send to descendent Receive from previous sibling, update, send to next sibling Unrolled DRNN Nodes labeled in order generated Solid lines are ancestral, dotted lines are fraternal connections

59 Inside a node Update node s hidden ancestral and fraternal states Combine to obtain predictive hidden state

60 Producing an output Get probability of stopping ancestral or fraternal branch Combine to get output, binary variables corresponding to ground truth (training) or, (testing) Assign node label

61 Experiment: Synthetic Tree Recovery Generate dataset of labeled trees Vocabulary is 26 letters Condition label of each node on ancestors and siblings Probabilities of children or next-siblings dependent only on label and depth Generate string representations with pre-order traversal RNN encoder String to vector DRNN with LSTM module decoder Vector to tree

62 Results Node retrieval: 75% Edge retrieval: 71%

63 Results

64 Results

65 Experiment: Computer program generation Mapping sentences to functional programs Given sentence description of computer program, generate abstract syntax tree DRNN performed better than all other baselines

66 Summary DRNN s are an extension of sequential recurrent architectures to tree structures Information flow Parent to offspring Sibling to sibling Performed better than baselines on tasks involving tree generation

67 Roadmap 1 Application 1: Sequence Level Training Issues Motivation 2 Basic Structure 3 4 Variations 5 Application 3: Image Classification 6 7 Application 2: Tree Structure Prediction 67

68 Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition By: Zhiwei Deng, Arash Vahdat, Hexiang Hu, Greg Mori

69 Group Activity Recognition Classification Problem: What is each person doing? What is the scene as a whole doing?

70 Group Activity Recognition CNN action? CNN action? Classification Problem: What is each person doing? What is the scene as a whole doing?

71 Group Activity Recognition Classification Problem: What is each person doing? What is the scene as a whole doing? CNN action?

72 Group Activity Recognition waiting biking walking waiting Classification Problem: What is each person doing? waiting walking biking What is the scene as a whole doing? waiting

73 Improving Group Activity Recognition - Model Relationships waiting? walking? Individual actions inform other individual actions & the scene action relationships

74 Improving Group Activity Recognition - Model Relationships waiting together Individual actions inform other individual actions & the scene action relationships

75 Group Activity Recognition waiting waiting Relationships depend on Spatial distance Relative motions Concurrent actions Number of people

76 Model Relationships Using RNNs RNN structure reflects learning problem Fully connected Each edge represents a relationship Model learns how important each relationship is Train edge weights

77 Train Using Iterative Message Passing/Gradient Descent

78 Why do we need multiple iterations of message passing? Graphs are cyclic so exact inference isn t possible

80 Graph Model A B C S

81 Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

82 General Message-Passing Update Equations At time t, message m(t) is a function f of weighted sum with Input features x Last message m(t-1) Last prediction c(t-1) At time t, prediction c(t) is a function f of weighted sum with Input features x Current message m(t)

83 RNN Model A B C S

84 RNN Model - Get Feature Inputs A B Individual Features CNN C CNN CNN S Scene Features CNN

85 RNN Model - Initialize Messages at Time t=0 A B C S

86 Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

87 Training the Model - Updating Messages at Time t A B C S

88 Training the Model - Updating Messages at Time t A B C S Activation function on weighted sum of Input features to messaging node Prev. prediction by messaging node Prev. messages to messaging node

89 Training the Model - Updating Messages at Time t A B C S

90 Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

91 Training the Model - Updating Gates at Time t A B C S

92 Training the Model - Updating Gates at Time t A B C S Activation function on weighted sum of Input features to target node Prev. prediction by target node Current messages to target node

93 Training the Model - Updating Gates at Time t A B C S

94 Training the Model - Imposing Gates at Time t A B C Just multiply message vectors by gate scalars S

95 Training the Model - Updating Gates at Time t A B C S Gate Intuition: Determines whether edge is useful Modulates importance of edge

96 Training the Model - Imposing Gates A B C S Over Several Iterations Learns to ignore some dges And focus more on others

97 Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

98 Training the Model - Getting Predictions ci(t) at Time t A B C S

99 Training the Model - Getting Predictions ci(t) at Time t A B C S Activation function on weighted sum of Input features to node Current messages to node

100 Training the Model - Getting Predictions ci(t) at Time t A B C S

101 Training the Model For each iteration t For each edge (i, j) Update messages mi -> j(t) and mj -> i(t) Update gates gi -> j(t) and gj -> i(t) Impose gates on messages For each node i Calculate prediction ci(t) Output: Final predictions at time T, ci(t)

102 Output - Getting Final Predictions ci(t) at Time T A B C S

103 Applied to Collective Activity Dataset 44 videos 7 actions Crossing Waiting Queueing Talking Jogging Dancing N/A

104 Visualization of Results on Collective Activity Dataset

105 Results on Collective Activity Dataset

106 Key Takeaways Previous outputs and hidden states can be looped back in as inputs Model problems where inputs are highly correlated

107 thanks

Machine Learning 13. week

Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of