Sequence Modeling: Recurrent and Recursive Nets By Pyry Takala 14 Oct 2015
Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 1
Quiz 1. Where can you use RNNs? 2. Discuss for 1 minute 3. 4. 5. 2
RNNs model sequential data What are examples of sequential data? 3
RNNs model sequential data What are examples of sequential data? Time-series data, e.g. economics Videos Speech Images, as perceived by humans Robot sensors Language 4
RNNs model sequential data What are examples of sequential data? Time-series data, e.g. economics Videos Speech Images, as perceived by humans Robot sensors Language Some feed-forward net types can also model sequences (e.g. TDNN), but are not ideal for long sequences (memory, network size etc.) 5
Example application: RNNs can translate text The heatmap shows probability densities for predicted pen locations as the word under is written 6 Live: http://www.cs.toronto.edu/~graves/handwriting.html
Example application: RNNs can caption images and videos Live: https://www.youtube.com/watch?v=w2iv8gt5cd4&feature=youtu.be 7
Example application: RNNs can control robots 8
Example application: RNNs can translate text Mielenkiintoinen luento The interesting lecture 9
Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 10
Quiz 1. 2. What algorithm can be used to train RNNs? 3. 4. 5. 11
RNNs store a memory of the hidden state for the next sequence step Legend x = input s = state o = output U, V, W = weight matrices 12
RNNs store a memory of the hidden state for the next sequence step Legend x = input s = state o = output U, V, W = weight matrices 13
RNNs store a memory of the hidden state for the next sequence step Legend x = input s = state o = output U, V, W = weight matrices Shared parameters! 14
RNN computation: forward pass Forward pass Legend x = input s = state o = output U, V, W = weight matrices b,c = biases a t = value to hidden p t = output after softmax p t a t 15
RNN computation: loss Loss Legend x = input s = state o = output U, V, W = weight matrices b,c = biases a t = value to hidden p t = output after softmax y t = target class p t-1 p t p t+1 p t class1 class2 class3 target class = 3 16
RNNs can be trained with back-propagation through time (BPTT) BPTT Unfold the network Backpropagate the loss, calculating first a L for each hidden unit a p t-1 Legend x = input s = state o = output U, V, W = weight matrices b,c = biases a t = value to hidden p t = output after softmax y t = target class p t p t+1 Δ 17
RNNs can be trained with back-propagation through time (BPTT) BPTT Unfold the network Backpropagate the loss, calculating first a L for each hidden unit a and then θl for each parameter θ For instance, Δ p t-1 Legend x = input s = state o = output U, V, W = weight matrices b,c = biases a t = value to hidden p t = output after softmax y t = target class p t p t+1 Detailed derivate formulas can be found in the book. Theano calculates these automatically 18
Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 19
Quiz 1. 2. 3. What are limitations of RNNs? 4. 5. 20
RNNs have good generalization capabilities RNN learns which aspects of past sequence to keep and with what precision 21
RNNs have good generalization capabilities RNN learns which aspects of past sequence to keep and with what precision RNN can generalize because of shared parameters Generalization to different point in sequence Generalization between sequences of different length Complexity of function does not increase with sequence length 22
RNNs have good generalization capabilities RNN learns which aspects of past sequence to keep and with what precision RNN can generalize because of shared parameters Generalization to different point in sequence Generalization between sequences of different length Complexity of function does not increase with sequence length Limitations Hidden state must be large enough to remember all information Assumes stationarity Can be overcome, e.g. feed an additional input describing the position Difficult optimization 23
RNN states simplify the graph, allowing still complex dependencies Graphical model without states (inefficient parametrization) RNN with states (more efficient parametrization) vs 24
Gradients of RNNs can be unstable Non-linear recurrrence with itself, over many time steps à Highly non-linear function Derivatives tend to vanish or explode as the number of steps between two states increases This is because it is equal to product of state transition Jacobian matrices This can cause for instance exploding gradients For details, see chapter 8.2.6 25
Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 26
RNNs can generate sequences Generate an output, and feed it at the next time step Teacher forcing = use actual sequence Strict forcing often not advisable: inputs generated by net likely different A generative model needs to stop generation at some point. Alternatives: a) End of sequence symbol b) Binomial output stop/continue c) Model number of timesteps left 27
Adding extra context can be done in several ways or x 28
Conditional generative RNN assumes that we want to use also x to predict y 29
Some tricks of trade can be useful when training RNNs Gradient explosion can be dealt e.g. with gradient clipping The heuristic introduces a bias but works well in practice Even taking a random step helps Wall in error surface Clipped gradient 30
Some tricks of trade can be useful when training RNNs Gradient explosion can be dealt e.g. with gradient clipping The heuristic introduces a bias but works well in practice Even taking a random step helps Wall in error surface Clipped gradient Gradient vanishing can be dealt with memory units, e.g. LSTMs Smart initialization of weights and use of squashing non-linearity (e.g. tanh) can also help 31
Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 32
Quiz 1. 2. 3. 4. How can we capture long-term dependencies with RNNs? 5. 33
RNNs have been extended for different purposes Architectural variants with different expressive power Deep RNNs Bi-Directional RNNs Recursive nets 34
RNNs have been extended for different purposes Architectural variants with different expressive power Deep RNNs Bi-Directional RNNs Recursive nets Solutions to dealing with long-term dependencies and memory RNNs with multiple time-scales LSTM memory units Sequence-to-sequence models Attention Memory nets / Neural Turing Machines 35
Deep RNNs Multiple RNN-layers Additional MLP-layer Additional MLP-layer and skip connections May also hurt, as the path from an event becomes longer à harder to learn long-term dependencies 36
Bi-directional RNN considers information from two directions We don t always assume a causal left-to-right structure, sometimes the output depends on whole input Bi-directional RNNs give more information to your network You should know the future sequence ahead of time Extends to 2D 37
Recursive nets More general than an RNN chain, e.g. a tree Has been used used to process data structures as NN-inputs, in NLP and in computer vision With sequence of the same length N, depth reduced from N (for RNN) to O(logN) Tree structuring unclear Balanced binary? External method (parse tree for NLP)? 38
Long-term dependencies are hard to capture Hidden state of RNNs needs to remember a lot This is burdensome especially with long sequences 39
Long-term dependencies are hard to capture Hidden state of RNNs needs to remember a lot This is burdensome especially with long sequences Neural units that learn to remember some inputs can alleviate this 40
Long-term dependencies are hard to capture Hidden state of RNNs needs to remember a lot This is burdensome especially with long sequences Neural units that learn to remember some inputs can alleviate this Echo-state networks (liquid state machines, reservoir computing) fix all weights but the final layer Weights are set so that the net is at the edge of stability (values around 1 for the leading singular value of J of the state-to-state transition) 41
Long-term dependencies are hard to capture Hidden state of RNNs needs to remember a lot This is burdensome especially with long sequences Neural units that learn to remember some inputs can alleviate this Echo-state networks (liquid state machines, reservoir computing) fix all weights but the final layer Weights are set so that the net is at the edge of stability (values around 1 for the leading singular value of J of the state-to-state transition) Long-short term memory (LSTM) first, most commonly used memory units Can accumulate information, and forget it when it was used and no more needed Better at long-term dependencies than normal RNNs Can be trained to work on tasks requiring memory over >200 steps Very successful for instance at text generation, hand-writing recognition and speech recognition Other memory units exist, e.g. GRU and memory units with multiple layers 42
Multiple time scales could be used 43
LSTMs are a common solution RNN LSTM 44
LSTMs are a common solution RNN There is a path from x t-1 to h t+1 with no non-linearities All gates are sigmoid units Remembered state is passed on LSTM 45 Forget-gate (scale old cell value = reset) Input-gate (scale input to cell = write) Output gate (scale output from cell = read) State influences decisions at next time step
Some LSTM-cells are interpretable 46
An encoder-decoder (sequence-to-sequence) model can capture a different sequence relation 47
RNNs can be used with different kinds of sequences Vanilla mode, no RNN. E.g. image classification Sequence output E.g. image captioning Sequence input E.g. sentiment analysis Sequence input and output (encoderdecoder, sequence-tosequence) E.g. translation, question answering Synced sequence input and output E.g. label each video frame 48 Live: http://cs.stanford.edu/people/karpathy/recurrentjs/
Attention avoids having to memorize everything (1/2) Encoder-RNN needs to store a large number of information to a small state An attention mechanism creates an attention vector from all inputs When generating outputs, the mechanism learns to shifts its attention at each step to the most relevant part in the input 49
Attention avoids having to memorize everything (2/2) 50
Memory networks / Neural Turing Machines (NTMs) can shift their attention and write to memory 51 Neural nets are good at storing implicit knowledge, but bad at storing facts Humans have a working memory system Memory networks / NTMs have memory cells that can be read from (like in attention) and written to A cell stores a vector. The cells can be read from by location ( access cell 347 ) and by content ( access cell that has information about my dad ) Current systems implement a softattention (reading from multiple cells). This is convenient when training based on the gradient. Active research currently on hard attention (reading from a specific cell) Successfully used e.g. to learn to sort values and to perform reasoning from simplified text
Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 52
Quiz 1. 2. 3. 4. 5. How can neural networks learn to execute programs? 53
State-of-the art RNNs can learn to predict how a (simple) program would execute LSTM 2 layers Unrolled for 50 steps 400 units per layer Params initialized uniformly Clipped gradients Own learning rate scheme 54
State-of-the art RNNs can learn to predict conversation responses Sequence-to-sequence 400-words long interactions Single-layer LSTM 1024 units Gradient clipping Most common 20K words 30M tokens, 3M in validation Larger recurrent networks trained with 30-40 GPU machines 55
Code-demo 56
Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 57
Quiz 1. Where can you use RNNs? 2. What algorithm can be used to train RNNs? 3. What are limitations of RNNs? 4. How can we capture long-term dependencies with RNNs? 5. How can neural networks learn to execute programs? 58
Exercises Read Chapter 10 (Sequence modeling) Read Chapter 15 (Linear Factor Models and Auto-Encoders) Read the Theano-tutorial on recurrent neural networks: http://deeplearning.net/tutorial/rnnslu.html For practical code examples, other sources may be useful, e.g. https://github.com/gwtaylor/theano-rnn Exercise: Read MNIST columnwise, spit out the class at each step, plot training performance as a function of columns read. No lecture next week 59
References https://github.com/kjw0612/awesome-rnn http://arxiv.org/pdf/1507.01273.pdf http://karpathy.github.io/2015/05/21/rnn-effectiveness/ http://colah.github.io/posts/2015-08-understanding-lstms/ http://arxiv.org/pdf/1211.5063.pdf http://arxiv.org/abs/1506.02078 http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ http://arxiv.org/pdf/1506.03340.pdf http://arxiv.org/abs/1502.03044 http://arxiv.org/abs/1410.3916 http://arxiv.org/abs/1410.5401 http://arxiv.org/abs/1410.4615 http://arxiv.org/pdf/1506.05869.pdf 60