Recurrent Neural Networks

Similar documents
CSC 578 Neural Networks and Deep Learning

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Machine Learning 13. week

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Deep Learning Applications

Sentiment Classification of Food Reviews

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks

Outline GF-RNN ReNet. Outline

Empirical Evaluation of RNN Architectures on Sentence Classification Task

27: Hybrid Graphical Models and Neural Networks

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Slide credit from Hung-Yi Lee & Richard Socher

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Modeling Sequences Conditioned on Context with RNNs

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Layerwise Interweaving Convolutional LSTM

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Deep Learning. Architecture Design for. Sargur N. Srihari

LSTM with Working Memory

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

Image Captioning with Object Detection and Localization

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

CS489/698: Intro to ML

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Recurrent Neural Nets II

Recurrent Neural Network (RNN) Industrial AI Lab.

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

The Hitchhiker s Guide to TensorFlow:

House Price Prediction Using LSTM

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of September 23, 2018.

A Quick Guide on Training a neural network using Keras.

LSTM: An Image Classification Model Based on Fashion-MNIST Dataset

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Hidden Units. Sargur N. Srihari

Domain-Aware Sentiment Classification with GRUs and CNNs

Keras: Handwritten Digit Recognition using MNIST Dataset

EECS 496 Statistical Language Models. Winter 2018

Convolutional Sequence to Sequence Learning. Denis Yarats with Jonas Gehring, Michael Auli, David Grangier, Yann Dauphin Facebook AI Research

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Recurrent Neural Networks

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Rationalizing Sentiment Analysis in Tensorflow

Keras: Handwritten Digit Recognition using MNIST Dataset

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

RNNs as Directed Graphical Models

Xuedong Huang Chief Speech Scientist & Distinguished Engineer Microsoft Corporation

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Voice command module for Smart Home Automation

On the Efficiency of Recurrent Neural Network Optimization Algorithms

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Supervised Learning in Neural Networks (Part 2)

Title. Author(s)Noguchi, Wataru; Iizuka, Hiroyuki; Yamamoto, Masahit. CitationEAI Endorsed Transactions on Security and Safety, 16

Prediction of Pedestrian Trajectories Final Report

Lecture 20: Neural Networks for NLP. Zubin Pahuja

CAP 6412 Advanced Computer Vision

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Pixel-level Generative Model

arxiv: v2 [cs.ne] 10 Nov 2018

Gated Recurrent Models. Stephan Gouws & Richard Klein

Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit

Neural Network Neurons

Structured Attention Networks

Neural Networks (pp )

An Online Sequence-to-Sequence Model Using Partial Conditioning

Deep Learning with Tensorflow AlexNet

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Novel Image Captioning

Deep Neural Networks Applications in Handwriting Recognition

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

CS6220: DATA MINING TECHNIQUES

Deep Learning based Authorship Identification

Data Mining. Neural Networks

Semantic image search using queries

Deep neural networks II

Deep Learning for Computer Vision II

A Deep Learning primer

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Structured Attention Networks

RNNs in TensorFlow. CS 20SI: TensorFlow for Deep Learning Research Lecture 11 2/22/2017

Lecture 7: Neural network acoustic models in speech recognition

Sentiment Classification of Food Reviews

Differentiable Data Structures (and POMDPs)

Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Deep Generative Models Variational Autoencoders

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Transcription:

Recurrent Neural Networks Javier Béjar Deep Learning 2018/2019 Fall Master in Artificial Intelligence (FIB-UPC)

Introduction

Sequential data Many problems are described by sequences Time series Video/audio processing Natural Language Processing (translation, dialogue) Bioinformatics (DNA/protein) Process control Model the problem = Extract elements sequence dependencies DL 2018/2019 Fall - MAI - FIB 1/49

Long time dependences Sequences can be modeled using non sequential ML methods (e.g. Sliding windows), but All sequences must have the same length Order of the elements always matter We cannot model dependencies longer than the chosen sequence length We need methods that explicitly model time dependencies capable of: Processing arbitrary length examples Providing different mappings (one to many, many to one, many to many) DL 2018/2019 Fall - MAI - FIB 2/49

Input-Output mapping Usual NN One to many Many to one Many to many Recurrent NN DL 2018/2019 Fall - MAI - FIB 3/49

One to many - Image captioning Black and white dog jumps over bar from Deep Visual-Semantic Alignments for Generating Image Descriptions Andrej Karpathy, Li Fei-Fei DL 2018/2019 Fall - MAI - FIB 4/49

Many to One - Sentiment Analysis My flight was just delayed, s**t Negative Never again BA, thanks for the dreadful flight Negative We arrived on time, yeehaaa! Positive Another day, another flight Neutral Efficient, quick, delightful, always with BA Positive DL 2018/2019 Fall - MAI - FIB 5/49

Many to Many - Machine Translation [How, many, programmers, for, changing, a, lightbulb,?] [Wie, viele, Programmierer, zum, Wechseln, einer, Glühbirne,?] [Combien, de, programmeurs, pour, changer, une, ampoule,?] [,Cuántos, programadores, para, cambiar, una, bombilla,?] [Zenbat, bonbilla, bat, aldatzeko, programatzaileak,?] DL 2018/2019 Fall - MAI - FIB 6/49

Recurrent Networks

Recurrent Neural Networks RNN are feed forward NN with edges that span adjacent time steps (recurrent edges) At each time step nodes receive input from the current data and from the previous state This makes that input data from previous time steps can influence the output at the current time step RNN are universal function approximators (Turing Complete) DL 2018/2019 Fall - MAI - FIB 7/49

Recurrent Node DL 2018/2019 Fall - MAI - FIB 8/49

Recurrent Neural Networks Input (x) is a vector of values for time t The hidden node (h) stores the state Weights are shared through time Each step the computation uses the previous step h (t+1) = f (h (t), x t+1 ; θ) = f (f (h (t 1), x t ; θ), x t+1 ; θ) = We can think of a RNN as a deep network that stacks layers through time DL 2018/2019 Fall - MAI - FIB 9/49

Training RNN (unfolding) DL 2018/2019 Fall - MAI - FIB 10/49

Activation Functions There are different choices for the activation function to compute the hidden state, but: The hyperbolic tangent function (tanh) is a popular choice versus the usual sigmoid function Good results are also achieved using the rectified linear function (ReLU) instead DL 2018/2019 Fall - MAI - FIB 11/49

RNN computation a (t) = b + W h (t 1) + U x (t) (1) h (t) = tanh(a (t) ) (2) y (t) = c + V h (t) (3) b and c are bias, an additional step can be added depending on the task DL 2018/2019 Fall - MAI - FIB 12/49

Training RNN RNN are usually trained using backpropagation The computation is unfolded through the sequence to propagate the activations and to compute the gradient This is known as Backpropagation Through Time (BPTT) Usually the input is limited in length to reduce computational cost, this is known as Truncated BPTT Assumes that influence is limited to a time horizon DL 2018/2019 Fall - MAI - FIB 13/49

Recurrent NN unfolded... DL 2018/2019 Fall - MAI - FIB 14/49

Recurrent NN (Regresssion/Classification) Loss... DL 2018/2019 Fall - MAI - FIB 15/49

Recurrent NN (Sequence to Sequence) Loss Loss Loss... DL 2018/2019 Fall - MAI - FIB 16/49

Gradient problems Two main problems during training Exploding Gradient Vanishing Gradient Problems appear because of the sharing of weights Recurrent edge weights in combination with activation function magnify (W > 1) or shrink (W < 1) the gradient exponentially with the length of the sequence Clipping gradients and regularization are usual solutions to exploding gradient DL 2018/2019 Fall - MAI - FIB 17/49

Recurrent NN (Gradient problems)... DL 2018/2019 Fall - MAI - FIB 18/49

Recurrent NN (Gradient problems) Applying the chain rule makes that propagating the values forward and backward, the longer the sequence, the smaller the influence of the past: f t W = f t s t s t W = f t s t s t 1 s t s t 1 W = = f t s t s t s t 1 s 2 s 1 s 1 W DL 2018/2019 Fall - MAI - FIB 19/49

Beyond Vanilla RNN Learning long time dependencies is difficult for vanilla RNN More sophisticated recurrent architectures allow reducing gradient problems Gated RNNs introduce memory and gating mechanisms When to store information in the state How much new information changes the state DL 2018/2019 Fall - MAI - FIB 20/49

LSTMs

Long Short Term Memory units (LSTMs) LSTMs specialize on learning long time dependencies They are composed by a memory cell and control gates Gates allow regulating how much the new information changes the state and flows to the next step Forget Gate Input Gate Update Gate Output Gate DL 2018/2019 Fall - MAI - FIB 21/49

Long Short Term Memory units (LSTMs) An LSTM propagates a hidden state (h t ) and a cell state (c t ) The hidden state acts as a short-term memory (large updates) The cell state acts as a long-term memory (small updates) Gates control updates from layer to layer Information can flow between short-term and long-term memory DL 2018/2019 Fall - MAI - FIB 22/49

LSTMs http://colah.github.io/posts/2015-08-understanding-lstms/ DL 2018/2019 Fall - MAI - FIB 23/49

Role of activation functions Gates use as activation functions the sigmoid and tanh functions Their role is to perform fuzzy decisions tanh: squashes value to range [-1,1] (substract, neutral, add) sigmoid: squashes value to range [0,1] (closed, open) DL 2018/2019 Fall - MAI - FIB 24/49

LSTMs computations x + x tanh x f t = σ(w f [h t 1, x t ]) (Forget) tanh DL 2018/2019 Fall - MAI - FIB 25/49

LSTMs computations x + x tanh x f t = σ(w f [h t 1, x t ]) (Forget) i t = σ(w i [h t 1, x t ]) (Input) c t = tanh(w c [h t 1, x t ]) tanh DL 2018/2019 Fall - MAI - FIB 25/49

LSTMs computations x + x tanh tanh x f t = σ(w f [h t 1, x t ]) (Forget) i t = σ(w i [h t 1, x t ]) (Input) c t = tanh(w c [h t 1, x t ]) c t = f t c t 1 + i t c t (Update) DL 2018/2019 Fall - MAI - FIB 25/49

LSTMs computations x + x tanh tanh x f t = σ(w f [h t 1, x t ]) (Forget) i t = σ(w i [h t 1, x t ]) (Input) c t = tanh(w c [h t 1, x t ]) c t = f t c t 1 + i t c t (Update) o t = σ(w o [h t 1, x t ]) h t = o t tanh(c t ) (Output) DL 2018/2019 Fall - MAI - FIB 25/49

GRUs

Gated Recurrent Units Reduces the complexity of the LSTMs Unifies the forgetting and the update gates as a unique update gate The update gate computes how the input and previous state are combined A reset gate controls the access to the previous state near to one previous state has more effect near to zero new state (updated) has more effect DL 2018/2019 Fall - MAI - FIB 26/49

GRU Reset Update z t = σ(w z [h t 1, x t ]) (upd) r t = σ(w r [h t 1, x t ]) (res) h t = tanh(w h [r t h t 1, x t ]) h t = (1 z t ) h t 1 + z t h t DL 2018/2019 Fall - MAI - FIB 27/49

Pros/Cons

Vanilla RNN vs LSTMs vs GRUs Empirically Vanilla RNN underperforms on complex tasks LSTMs are widely used but GRUs are very recent (2014) There is not yet theoretical arguments in the LSTMs vs GRUs question Empirical studies do not shed light to the question (see references) DL 2018/2019 Fall - MAI - FIB 28/49

Vanilla RNN vs LSTMs vs GRUs Jozefowicz, R., Zaremba, W., Sutskever, I. (2015). An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 2342-2350). Chung, J., Gulcehre, C., Cho, K., Bengio, Y. (2014).Empirical evaluation of gated recurrent neural networks on sequence modeling. arxiv preprint arxiv:1412.3555. DL 2018/2019 Fall - MAI - FIB 29/49

Other RNN Variations

Bidirectional RNNs - Back from the future In some domains it is easier to learn dependencies if information flows in both directions For instance, domains where decisions depend on the whole sequence Part of Speech tagging Machine translation Speech/handwritting recognition A RNN can be split in two, so sequences are processed in both directions DL 2018/2019 Fall - MAI - FIB 30/49

Regular RNNs (only forward) DL 2018/2019 Fall - MAI - FIB 31/49

Bidirectional RNNs (forward-backward) DL 2018/2019 Fall - MAI - FIB 32/49

Bidirectional RNNs - Training Both directions are independent (no interconnections) Backpropagation needs to follow the graph dependencies, not all weights can be updated at the same time Propagating Forward: First compute the RNN forward and backward pass from the inputs, then the outputs Propagating Backward: First compute the RNN forward and backward pass from the outputs, then the inputs DL 2018/2019 Fall - MAI - FIB 33/49

Input Sequence (N) Sequence to sequence Direct sequence association Input and output sequences have the same lengths All the outputs of the BPTT are used for the output Training is straightforward RNN RNN RNN RNN Output Sequence (N) DL 2018/2019 Fall - MAI - FIB 34/49

Sequence to sequence Encoder-decoder architecture Input and output sequences can have different lengths Encoder RNN summarizes input in a coding state Decoder RNN generates output from that state Different options connecting Encoder-Decoder (direct, peeking, attention) or training (teacher forcing) Inference: The sequence is generated element by element using the output of the previous step or using a beam search DL 2018/2019 Fall - MAI - FIB 35/49

Encoder Decoder Encoder-Decoder (Plain) Today it is raining <start> hoy está lloviendo <eos> Encoder state DL 2018/2019 Fall - MAI - FIB 36/49

Encoder-Decoder (Plain) - Inference When generating a sequence of discrete values, the greedy approach could not be the best policy A limited exploration of the branching alternatives is needed (beam search) The sequence with the largest joint probability is the sequence generated Hoy (0.6) Ahora (0.3) Esta (0.7) Estaba (0.2) Esta (0.8) Estaba (0.1) Lloviendo (0.9) Lloviendo (0.9) Lloviendo (0.9) Lloviendo (0.9) DL 2018/2019 Fall - MAI - FIB 37/49

Encoder Decoder Encoder-Decoder (Peeking) Today it is raining <start> hoy está lloviendo <eos> Encoder state DL 2018/2019 Fall - MAI - FIB 38/49

Encoder Decoder Encoder-Decoder (Attention) Today it is raining <start> hoy está lloviendo <eos> Encoder state Attention 0.8 0.1 0.9 0.95 DL 2018/2019 Fall - MAI - FIB 39/49

Encoder-Decoder (Teacher Forcing) Input Sequence Today it is raining <start> hoy está lloviendo Decoder Encoder hoy está lloviendo <eos> Output Sequence Encoder state DL 2018/2019 Fall - MAI - FIB 40/49

Augmented neural networks RNNs are Turing complete, but it is difficult to achieve it in practice New architectures include: Data structures to store information (Read/Write Operations) RNNs control the operations Attention mechanisms Graves, Wayne, Danihelka Neural Turing Machines, ArXiv preprint arxiv:1410.5401 C. Olah, S. Carter, Attention and Augmented Recurrent Neural Networks, Distill, Sept 9, 2016 DL 2018/2019 Fall - MAI - FIB 41/49

Applications

Large variety of applications Time series prediction NLP Reasoning Multimodal POS tagging, Machine traslation, Question answering Caption Generation for images/video Speech recognition/generation Reinforcement learning DL 2018/2019 Fall - MAI - FIB 42/49

Guided Laboratory

Sequence prediction Sequence to value Predicting the next step of a time series Classification of time series Predicting sentiment from tweets Text generation (predicting characters) Sequence to sequence Learning to add DL 2018/2019 Fall - MAI - FIB 43/49

Task 1: Air Quality prediction Time series regression TS Window t Dataset: Air Quality every hour Goal: Predict wind speed next hour Training time 35064 Train/ 8784 Test/6 lag 32 Neurons /1 Layer/30 epochs 30 sec TS t+1 RNN RNN Dense DL 2018/2019 Fall - MAI - FIB 44/49

Task 2: Electric Devices Time series classification Power Consumption Daily power consumption of household devices (7 classes, 96 attributes) Goal: Predict household device class Training time 64 Neurons /2 Layers/30 epochs 1 min Dev Class RNN RNN Dense DL 2018/2019 Fall - MAI - FIB 45/49

Task 3: Sentiment analysis Tweets (neg, pos, neutral) Tweets as sequences of words Preprocess: Generate vocabulary, recode sequences Embed sequences to a more convenient space Tweet Sequence Embedding RNN RNN Training time 5000 words/ 40 dim embedding 64 Neurons /1 Layer/50 epochs 3 min Sentiment Dense DL 2018/2019 Fall - MAI - FIB 46/49

Task 4: Text Generation Poetry text Character prediction from text windows Preprocess: sequences of characters one hot encoding Text generation by predicting characters iteratively Training time poetry1/50 chars/3 skip 64 Neu/1 Ly/10 it/10 ep it 23 min Character Text Window One hot enc RNN RNN Dense DL 2018/2019 Fall - MAI - FIB 47/49

Task 5: Learning to add Predicting addition results from text Input sequence: NUMBER+NUMBER Output sequence: NUMBER Preprocess: sequences of characters one hot encoding Training time 50000 ex/3 digits 128 Neu/1 Ly/50 ep 5 min 30 sec RNN Encode + RNN Decode DL 2018/2019 Fall - MAI - FIB 48/49

Task 5: Learning to add One Hot enc XXX+XXX RNN (enc) RepeatVector TimeDistributed RNN (dec) Dense XXXX DL 2018/2019 Fall - MAI - FIB 49/49