RNNs as Directed Graphical Models

Similar documents
Modeling Sequences Conditioned on Context with RNNs

Recurrent Neural Network (RNN) Industrial AI Lab.

Variational Autoencoders. Sargur N. Srihari

Sequence Modeling: Recurrent and Recursive Nets. By Pyry Takala 14 Oct 2015

Deep Generative Models Variational Autoencoders

Deep Boltzmann Machines

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

CSC 578 Neural Networks and Deep Learning

Pixel-level Generative Model

27: Hybrid Graphical Models and Neural Networks

Generative Models in Deep Learning. Sargur N. Srihari

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Machine Learning 13. week

Lecture 20: Neural Networks for NLP. Zubin Pahuja

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Bayesian Classification Using Probabilistic Graphical Models

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Variational Autoencoders

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

CS489/698: Intro to ML

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Recurrent Neural Networks. Nand Kishore, Audrey Huang, Rohan Batra

Bayesian model ensembling using meta-trained recurrent neural networks

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Hidden Units. Sargur N. Srihari

3 : Representation of Undirected GMs

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Table of Contents. What Really is a Hidden Unit? Visualizing Feed-Forward NNs. Visualizing Convolutional NNs. Visualizing Recurrent NNs

Directed Graphical Models (Bayes Nets) (9/4/13)

Machine Learning

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning

16-785: Integrated Intelligence in Robotics: Vision, Language, and Planning. Spring 2018 Lecture 14. Image to Text

CS 664 Flexible Templates. Daniel Huttenlocher

Learning Bayesian Networks (part 3) Goals for the lecture

Machine Learning. Sourangshu Bhattacharya

Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift. Amit Patel Sambit Pradhan

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Lecture 3: Conditional Independence - Undirected

Deep Learning. Architecture Design for. Sargur N. Srihari

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Recurrent Neural Nets II

10703 Deep Reinforcement Learning and Control

CS839: Probabilistic Graphical Models. Lecture 22: The Attention Mechanism. Theo Rekatsinas

Structured Prediction Basics

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

CSE 490R P1 - Localization using Particle Filters Due date: Sun, Jan 28-11:59 PM

Deep Learning Srihari. Autoencoders. Sargur Srihari

Deep Learning and Its Applications

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Bayesian Machine Learning - Lecture 6

COMS 4771 Clustering. Nakul Verma

Simple Model Selection Cross Validation Regularization Neural Networks

Probabilistic Graphical Models

Challenges motivating deep learning. Sargur N. Srihari

LSTM for Language Translation and Image Captioning. Tel Aviv University Deep Learning Seminar Oran Gafni & Noa Yedidia

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Computer Vision I - Filtering and Feature detection

Mixture Models and the EM Algorithm

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

Deep Learning Cook Book

CAP 6412 Advanced Computer Vision

Pointer Network. Oriol Vinyals. 박천음 강원대학교 Intelligent Software Lab.

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

10-701/15-781, Fall 2006, Final

COMP90051 Statistical Machine Learning

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Recurrent Neural Networks

Tracking Algorithms. Lecture16: Visual Tracking I. Probabilistic Tracking. Joint Probability and Graphical Model. Deterministic methods

Machine Learning

Learning Undirected Models with Missing Data

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning

Generative Adversarial Text to Image Synthesis

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Probabilistic Graphical Models

Tutorial on Machine Learning Tools

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Introduction to Graphical Models

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Structured Attention Networks

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Graphical models are a lot like a circuit diagram they are written down to visualize and better understand a problem.

Amortised MAP Inference for Image Super-resolution. Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi & Ferenc Huszár ICLR 2017

Lecture 19: Generative Adversarial Networks

A Brief Introduction to Bayesian Networks. adapted from slides by Mitch Marcus

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Auto-Encoding Variational Bayes

Transcription:

RNNs as Directed Graphical Models Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1

10. Topics in Sequence Modeling Overview 1. Unfolding Computational Graphs 2. Recurrent Neural Networks 3. Bidirectional RNNs 4. Encoder-Decoder Sequence-to-Sequence Architectures 5. Deep Recurrent Networks 6. Recursive Neural Networks 7. The Challenge of Long-Term Dependencies 8. Echo-State Networks 9. Leaky Units and Other Strategies for Multiple Time Scales 10. LSTM and Other Gated RNNs 11. Optimization for Long-Term Dependencies 12. Explicit Memory 2

10.2 Topics in Recurrent Neural Networks 10.2.0 Overview 10.2.1 Teacher forcing for output-to-hidden RNNs 10.2.2 Computing the gradient in a RNN 10.2.3 RNNs as Directed Graphical Models 10.2.4 Modeling Sequences Conditioned on Context with RNNs 3

10.2.3 Topics in RNNs as Directed Graphical Models Motivation What are directed graphical models? RNNs as Directed Graphical Models Conditioning Predictions on Past Values Fully connected graphical model for a sequence Hidden units as random variables Efficiency of RNN parameterization How to draw samples from the model 4

Motivation: From images to text Combining ConvNets and Recurrent Net Modules Caption generated by a recurrent neural network (RNN) taking as input: 1. Representation generated by a deep CNN 2. RNN trained to translate high-level representations of images into captions 5

Better translation of images to captions Different focus (lighter patches given more attention) As it generates each word (bold), it exploits it to achieve better translation of images to captions 6

What is a Directed graphical model? Specifies a joint distribution p(x) = p(x 1,..x n ) in terms of conditional distributions (called a Bayesian Network) where are conditional distributions pa represents parents in a directed graph Computational efficiency is obtained by omitting edges that do not correspond to strong interactions Model is generative: p(x) = p(x i pa(x i )) p(x i pa(x i )) N i=1 We can easily sample from a BN (ancestral sampling) We will see here that RNNs have an interpretation as an efficient directed graphical model 7

Example of a Directed Graphical Model (Bayesian Net) Instead of 2x2x3x2x2=48 parameters we need 18 parameters using the directed graphical model (which makes certain condition independence assumptions) P(D, I,G, S, L) = P(D)P(I)P(G D, I)P(S I)P(L G) P(i 1, d 0, g 2, s 1,l 0 ) = P(i 1 )P(d 0 )P(g 2 i 1, d 0 )P(s 1 i 1 )P(l 0 g 2 ) =0.3 0.6 0.08 0.8 0.4=0.004608 8

CPDs of model depend on RNN Design pattern 1 1. Recurrent connections between hidden units 2. Recurrent connections only from output at one time step to hidden units at next time step 2 9

Loss function for RNNs In feed-forward networks our goal is to minimize E x~ˆpdata log p model (x) Which is the cross-entropy between distribution of training set (data) and probability distribution defined by model Definition of cross entropy between distributions p and q is H(p,q)=E p [-log q]=h(p)+d KL (p q) For discrete distributions H(p,q)=-Σ x p(x)log q(x) As with a feedforward network, we wish to interpret the output of the RNN as a probability distribution With RNNs the losses L (t) are cross entropies between training targets y (t) and outputs o (t) Mean squared error is the cross-entropy loss associated with an output distribution that is a unit Gaussian 10

Types of CPDs in an RNN 1. With a predictive log-likelihood training objective: L( { x (1),..x (t) },{ y (1),..y (t) })= L (t) = - logp model we train the RNN to estimate the conditional distribution of next sequence element y (t) given past inputs. This means that we maximize log-likelihood log p(y (t) x (1),..x (t) ) 2. Alternatively if the model includes connections from output at one time step to the next log p(y (t) x (1),..x (t), y (1),..y (t-1) ) t t ( y (t) { x (1),..x (t) }) 11

Conditioning predictions on past y values Decomposing joint probability over sequence of y values as a series of one-step probabilistic predictions is one way to capture the full joint distribution across the whole sequence. 1. Do not feed past y values as inputs for next step prediction, directed model contains no edges from any y (i ) in the past to current y (t). In this case outputs y (i) are conditionally independent given the x values 2. Do feed the actual y values (not their prediction, but the actual observed or generated values) back into the network the directed PGM contains edges from all y (i) values in the past to the current y (t) value. 12

Deep Learning A simple example Srihari RNN models a sequence of scalar random variables Y={y (1),..y (τ) } with no additional inputs x. Input at time step t is simply the output at time step t-1 RNN defines a directed graphical model over y variables Parameterize joint distribution of observations using the chain rule for conditional probabilities: τ P(Y ) = P(y (1),..y (τ) ) = P(y (t) y (t 1),y (t 2),..,y (1) ) t=1 where the right hand side of the bar is empty for t=1 Hence the negative log-likelihood of a set of values {y (1),..y (τ) } according to such a model is L= L (t) t where L (t) = - logp y (t) = y (t) y (t-1),y (t-2),..,y (1) t ( ) 13

Fully connected graphical model for a sequence Every past observation y (i) may influence the conditional distribution of some y (t) for t >1 Parameterizing this way is inefficient. RNNs obtain the same full connectivity but efficient parameterization as shown next

Introducing the state variable in the PGM of an RNN Introduce the state variable in the PGM of RNN even though it is a deterministic function of its inputs, helps see we get efficient parameterization h (t) =f (h (t-1), x (t) ; θ ) Every stage in the sequence for h (t) and y (t) involves the same structure and can share the parameters with other stages

Efficient parameterization in RNNs If y can take on k different values, the tabular representation would have O(k τ ) parameters By comparison because of parameter sharing the no of parameters in the RNN is O(1) as a function of sequence length Even with efficient parameterization of PGM, some operations are computationally challenging Difficult to predict missing values in the middle of the sequence Price RNNs pay for reduced no of parameters is that oprimizing parameters may be difficult 16

Stationarity in RNNs Parameter sharing in RNNs relies on assuming same parameters can be used in different time steps Equivalently CPD over variables at time t+1 given the variables at time t is stationary In principle it is possible to use t as an extra input at each time step Learner discovers time dependence 17

How to draw samples from the model Simply sample from the conditional distribution at each time step One additional complication RNN must have some mechanism for determining the length of the sequence 18

Methods for predicting end of sequence This can be achieved in various ways 1. Adding a special symbol for end of sequence When the symbol is generated, the sampling process stops 2. Introduce an extra Bernoulli output It represents the decision to either continue generation or halt generation at each time step More general than previous method as it can be added to any RNN rather than only RNNs that output a sequence of symbols, e.g., RNN that emits sequence of real numbers 3. Add an extra output to the model to predict τ Model can sample value of τ and then sample τ steps worth of data 19