Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Similar documents
CSC 578 Neural Networks and Deep Learning

Machine Learning 13. week

Recurrent Neural Network (RNN) Industrial AI Lab.

Recurrent Neural Networks

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Modeling Sequences Conditioned on Context with RNNs

RNNs as Directed Graphical Models

Recurrent Neural Nets II

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Gated Recurrent Models. Stephan Gouws & Richard Klein

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Sentiment Classification of Food Reviews

The Hitchhiker s Guide to TensorFlow:

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

27: Hybrid Graphical Models and Neural Networks

Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

Deep Learning Applications

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of September 23, 2018.

Slide credit from Hung-Yi Lee & Richard Socher

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Deep Neural Networks Applications in Handwriting Recognition

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

CS 224n: Assignment #3

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

EECS 496 Statistical Language Models. Winter 2018

Deep Neural Networks Applications in Handwriting Recognition

RECURRENT NEURAL NETWORKS

Reservoir Computing with Emphasis on Liquid State Machines

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Relational inductive biases, deep learning, and graph networks

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

A Quick Guide on Training a neural network using Keras.

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Recurrent Neural Networks

CS489/698: Intro to ML

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Neural Network Neurons

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Knowledge Discovery and Data Mining

FastText. Jon Koss, Abhishek Jindal

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Neural Networks and Deep Learning

Deep Learning for Computer Vision II

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

On the Efficiency of Recurrent Neural Network Optimization Algorithms

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Deep Learning. Architecture Design for. Sargur N. Srihari

Recurrent Neural Networks and Transfer Learning for Action Recognition

Application of Deep Learning Techniques in Satellite Telemetry Analysis.

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Structured Attention Networks

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Context Encoding LSTM CS224N Course Project

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 15

House Price Prediction Using LSTM

Title. Author(s)Noguchi, Wataru; Iizuka, Hiroyuki; Yamamoto, Masahit. CitationEAI Endorsed Transactions on Security and Safety, 16

CAP 6412 Advanced Computer Vision

Hidden Units. Sargur N. Srihari

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Convolutional Networks for Text

Backpropagation + Deep Learning

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Programming Exercise 4: Neural Networks Learning

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

CS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes

Unsupervised Learning

Pointer Network. Oriol Vinyals. 박천음 강원대학교 Intelligent Software Lab.

Rationalizing Sentiment Analysis in Tensorflow

Administrative. Assignment 1 due Wednesday April 18, 11:59pm

19: Inference and learning in Deep Learning

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

This Talk. 1) Node embeddings. Map nodes to low-dimensional embeddings. 2) Graph neural networks. Deep learning architectures for graphstructured

Alternatives to Direct Supervision

CS 4510/9010 Applied Machine Learning. Deep Learning. Paula Matuszek Fall copyright Paula Matuszek 2016

Transcription:

Sequence Modeling: Recurrent and Recursive Nets By Pyry Takala 14 Oct 2015

Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 1

Quiz 1. Where can you use RNNs? 2. Discuss for 1 minute 3. 4. 5. 2

RNNs model sequential data What are examples of sequential data? 3

RNNs model sequential data What are examples of sequential data? Time-series data, e.g. economics Videos Speech Images, as perceived by humans Robot sensors Language 4

RNNs model sequential data What are examples of sequential data? Time-series data, e.g. economics Videos Speech Images, as perceived by humans Robot sensors Language Some feed-forward net types can also model sequences (e.g. TDNN), but are not ideal for long sequences (memory, network size etc.) 5

Example application: RNNs can translate text The heatmap shows probability densities for predicted pen locations as the word under is written 6 Live: http://www.cs.toronto.edu/~graves/handwriting.html

Example application: RNNs can caption images and videos Live: https://www.youtube.com/watch?v=w2iv8gt5cd4&feature=youtu.be 7

Example application: RNNs can control robots 8

Example application: RNNs can translate text Mielenkiintoinen luento The interesting lecture 9

Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 10

Quiz 1. 2. What algorithm can be used to train RNNs? 3. 4. 5. 11

RNNs store a memory of the hidden state for the next sequence step Legend x = input s = state o = output U, V, W = weight matrices 12

RNNs store a memory of the hidden state for the next sequence step Legend x = input s = state o = output U, V, W = weight matrices 13

RNNs store a memory of the hidden state for the next sequence step Legend x = input s = state o = output U, V, W = weight matrices Shared parameters! 14

RNN computation: forward pass Forward pass Legend x = input s = state o = output U, V, W = weight matrices b,c = biases a t = value to hidden p t = output after softmax p t a t 15

RNN computation: loss Loss Legend x = input s = state o = output U, V, W = weight matrices b,c = biases a t = value to hidden p t = output after softmax y t = target class p t-1 p t p t+1 p t class1 class2 class3 target class = 3 16

RNNs can be trained with back-propagation through time (BPTT) BPTT Unfold the network Backpropagate the loss, calculating first a L for each hidden unit a p t-1 Legend x = input s = state o = output U, V, W = weight matrices b,c = biases a t = value to hidden p t = output after softmax y t = target class p t p t+1 Δ 17

RNNs can be trained with back-propagation through time (BPTT) BPTT Unfold the network Backpropagate the loss, calculating first a L for each hidden unit a and then θl for each parameter θ For instance, Δ p t-1 Legend x = input s = state o = output U, V, W = weight matrices b,c = biases a t = value to hidden p t = output after softmax y t = target class p t p t+1 Detailed derivate formulas can be found in the book. Theano calculates these automatically 18

Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 19

Quiz 1. 2. 3. What are limitations of RNNs? 4. 5. 20

RNNs have good generalization capabilities RNN learns which aspects of past sequence to keep and with what precision 21

RNNs have good generalization capabilities RNN learns which aspects of past sequence to keep and with what precision RNN can generalize because of shared parameters Generalization to different point in sequence Generalization between sequences of different length Complexity of function does not increase with sequence length 22

RNNs have good generalization capabilities RNN learns which aspects of past sequence to keep and with what precision RNN can generalize because of shared parameters Generalization to different point in sequence Generalization between sequences of different length Complexity of function does not increase with sequence length Limitations Hidden state must be large enough to remember all information Assumes stationarity Can be overcome, e.g. feed an additional input describing the position Difficult optimization 23

RNN states simplify the graph, allowing still complex dependencies Graphical model without states (inefficient parametrization) RNN with states (more efficient parametrization) vs 24

Gradients of RNNs can be unstable Non-linear recurrrence with itself, over many time steps à Highly non-linear function Derivatives tend to vanish or explode as the number of steps between two states increases This is because it is equal to product of state transition Jacobian matrices This can cause for instance exploding gradients For details, see chapter 8.2.6 25

Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 26

RNNs can generate sequences Generate an output, and feed it at the next time step Teacher forcing = use actual sequence Strict forcing often not advisable: inputs generated by net likely different A generative model needs to stop generation at some point. Alternatives: a) End of sequence symbol b) Binomial output stop/continue c) Model number of timesteps left 27

Adding extra context can be done in several ways or x 28

Conditional generative RNN assumes that we want to use also x to predict y 29

Some tricks of trade can be useful when training RNNs Gradient explosion can be dealt e.g. with gradient clipping The heuristic introduces a bias but works well in practice Even taking a random step helps Wall in error surface Clipped gradient 30

Some tricks of trade can be useful when training RNNs Gradient explosion can be dealt e.g. with gradient clipping The heuristic introduces a bias but works well in practice Even taking a random step helps Wall in error surface Clipped gradient Gradient vanishing can be dealt with memory units, e.g. LSTMs Smart initialization of weights and use of squashing non-linearity (e.g. tanh) can also help 31

Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 32

Quiz 1. 2. 3. 4. How can we capture long-term dependencies with RNNs? 5. 33

RNNs have been extended for different purposes Architectural variants with different expressive power Deep RNNs Bi-Directional RNNs Recursive nets 34

RNNs have been extended for different purposes Architectural variants with different expressive power Deep RNNs Bi-Directional RNNs Recursive nets Solutions to dealing with long-term dependencies and memory RNNs with multiple time-scales LSTM memory units Sequence-to-sequence models Attention Memory nets / Neural Turing Machines 35

Deep RNNs Multiple RNN-layers Additional MLP-layer Additional MLP-layer and skip connections May also hurt, as the path from an event becomes longer à harder to learn long-term dependencies 36

Bi-directional RNN considers information from two directions We don t always assume a causal left-to-right structure, sometimes the output depends on whole input Bi-directional RNNs give more information to your network You should know the future sequence ahead of time Extends to 2D 37

Recursive nets More general than an RNN chain, e.g. a tree Has been used used to process data structures as NN-inputs, in NLP and in computer vision With sequence of the same length N, depth reduced from N (for RNN) to O(logN) Tree structuring unclear Balanced binary? External method (parse tree for NLP)? 38

Long-term dependencies are hard to capture Hidden state of RNNs needs to remember a lot This is burdensome especially with long sequences 39

Long-term dependencies are hard to capture Hidden state of RNNs needs to remember a lot This is burdensome especially with long sequences Neural units that learn to remember some inputs can alleviate this 40

Long-term dependencies are hard to capture Hidden state of RNNs needs to remember a lot This is burdensome especially with long sequences Neural units that learn to remember some inputs can alleviate this Echo-state networks (liquid state machines, reservoir computing) fix all weights but the final layer Weights are set so that the net is at the edge of stability (values around 1 for the leading singular value of J of the state-to-state transition) 41

Long-term dependencies are hard to capture Hidden state of RNNs needs to remember a lot This is burdensome especially with long sequences Neural units that learn to remember some inputs can alleviate this Echo-state networks (liquid state machines, reservoir computing) fix all weights but the final layer Weights are set so that the net is at the edge of stability (values around 1 for the leading singular value of J of the state-to-state transition) Long-short term memory (LSTM) first, most commonly used memory units Can accumulate information, and forget it when it was used and no more needed Better at long-term dependencies than normal RNNs Can be trained to work on tasks requiring memory over >200 steps Very successful for instance at text generation, hand-writing recognition and speech recognition Other memory units exist, e.g. GRU and memory units with multiple layers 42

Multiple time scales could be used 43

LSTMs are a common solution RNN LSTM 44

LSTMs are a common solution RNN There is a path from x t-1 to h t+1 with no non-linearities All gates are sigmoid units Remembered state is passed on LSTM 45 Forget-gate (scale old cell value = reset) Input-gate (scale input to cell = write) Output gate (scale output from cell = read) State influences decisions at next time step

Some LSTM-cells are interpretable 46

An encoder-decoder (sequence-to-sequence) model can capture a different sequence relation 47

RNNs can be used with different kinds of sequences Vanilla mode, no RNN. E.g. image classification Sequence output E.g. image captioning Sequence input E.g. sentiment analysis Sequence input and output (encoderdecoder, sequence-tosequence) E.g. translation, question answering Synced sequence input and output E.g. label each video frame 48 Live: http://cs.stanford.edu/people/karpathy/recurrentjs/

Attention avoids having to memorize everything (1/2) Encoder-RNN needs to store a large number of information to a small state An attention mechanism creates an attention vector from all inputs When generating outputs, the mechanism learns to shifts its attention at each step to the most relevant part in the input 49

Attention avoids having to memorize everything (2/2) 50

Memory networks / Neural Turing Machines (NTMs) can shift their attention and write to memory 51 Neural nets are good at storing implicit knowledge, but bad at storing facts Humans have a working memory system Memory networks / NTMs have memory cells that can be read from (like in attention) and written to A cell stores a vector. The cells can be read from by location ( access cell 347 ) and by content ( access cell that has information about my dad ) Current systems implement a softattention (reading from multiple cells). This is convenient when training based on the gradient. Active research currently on hard attention (reading from a specific cell) Successfully used e.g. to learn to sort values and to perform reasoning from simplified text

Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 52

Quiz 1. 2. 3. 4. 5. How can neural networks learn to execute programs? 53

State-of-the art RNNs can learn to predict how a (simple) program would execute LSTM 2 layers Unrolled for 50 steps 400 units per layer Params initialized uniformly Clipped gradients Own learning rate scheme 54

State-of-the art RNNs can learn to predict conversation responses Sequence-to-sequence 400-words long interactions Single-layer LSTM 1024 units Gradient clipping Most common 20K words 30M tokens, 3M in validation Larger recurrent networks trained with 30-40 GPU machines 55

Code-demo 56

Agenda Why Recurrent neural networks? Anatomy and basic training of an RNN (10.2, 10.2.1) Properties of RNNs (10.2.2, 8.2.6) Using RNNs (10.2.3, 10.7.7) RNN extensions (10.3 10.7) Demos Next steps & references 57

Quiz 1. Where can you use RNNs? 2. What algorithm can be used to train RNNs? 3. What are limitations of RNNs? 4. How can we capture long-term dependencies with RNNs? 5. How can neural networks learn to execute programs? 58

Exercises Read Chapter 10 (Sequence modeling) Read Chapter 15 (Linear Factor Models and Auto-Encoders) Read the Theano-tutorial on recurrent neural networks: http://deeplearning.net/tutorial/rnnslu.html For practical code examples, other sources may be useful, e.g. https://github.com/gwtaylor/theano-rnn Exercise: Read MNIST columnwise, spit out the class at each step, plot training performance as a function of columns read. No lecture next week 59

References https://github.com/kjw0612/awesome-rnn http://arxiv.org/pdf/1507.01273.pdf http://karpathy.github.io/2015/05/21/rnn-effectiveness/ http://colah.github.io/posts/2015-08-understanding-lstms/ http://arxiv.org/pdf/1211.5063.pdf http://arxiv.org/abs/1506.02078 http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ http://arxiv.org/pdf/1506.03340.pdf http://arxiv.org/abs/1502.03044 http://arxiv.org/abs/1410.3916 http://arxiv.org/abs/1410.5401 http://arxiv.org/abs/1410.4615 http://arxiv.org/pdf/1506.05869.pdf 60