Deep Boltzmann Machines

Similar documents
A Fast Learning Algorithm for Deep Belief Nets

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Variational Autoencoders. Sargur N. Srihari

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th

Generative Models in Deep Learning. Sargur N. Srihari

Deep Generative Models Variational Autoencoders

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

A Two-stage Pretraining Algorithm for Deep Boltzmann Machines

Implicit Mixtures of Restricted Boltzmann Machines

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

RNNs as Directed Graphical Models

Learning Undirected Models with Missing Data

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Advanced Introduction to Machine Learning, CMU-10715

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER

3D Object Recognition with Deep Belief Nets

Efficient Feature Learning Using Perturb-and-MAP

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient

Training Restricted Boltzmann Machines with Multi-Tempering: Harnessing Parallelization

Deep Learning. Architecture Design for. Sargur N. Srihari

STA 4273H: Statistical Machine Learning

Modeling Sequences Conditioned on Context with RNNs

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Alternatives to Direct Supervision

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Neural Networks and Deep Learning

arxiv: v1 [cs.lg] 26 Dec 2014

Modeling image patches with a directed hierarchy of Markov random fields

Unsupervised Learning

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Autoencoders, denoising autoencoders, and learning deep networks

Variational Methods for Graphical Models

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

Training Restricted Boltzmann Machines with Overlapping Partitions

Introduction to Machine Learning CMU-10701

3 : Representation of Undirected GMs

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Learning robust features from underwater ship-radiated noise with mutual information group sparse DBN

Markov Network Structure Learning

Multiresolution Deep Belief Networks

Probabilistic Graphical Models

Restricted Boltzmann Machines. Shallow vs. deep networks. Stacked RBMs. Boltzmann Machine learning: Unsupervised version

Learning Class-Specific Image Transformations with Higher-Order Boltzmann Machines

An Evolutionary Approximation to Contrastive Divergence in Convolutional Restricted Boltzmann Machines

Cambridge Interview Technical Talk

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Deep Belief Network for Clustering and Classification of a Continuous Data

Deep Belief Nets (An updated and extended version of my 2007 NIPS tutorial)

Deep networks for robust visual recognition

Machine Learning

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Parallel Implementation of Deep Learning Using MPI

10-701/15-781, Fall 2006, Final

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Chapter 8 of Bishop's Book: Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Learning Class-relevant Features and Class-irrelevant Features via a Hybrid third-order RBM

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CS281 Section 9: Graph Models and Practical MCMC

Introduction to Generative Adversarial Networks

Logistic Regression. Abstract

Challenges motivating deep learning. Sargur N. Srihari

Deep Learning. Volker Tresp Summer 2014

Latent Regression Bayesian Network for Data Representation

Convexization in Markov Chain Monte Carlo

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Neural Networks: promises of current research

Deep Belief Networks for phone recognition

Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines

Auto-Encoding Variational Bayes

Machine Learning Basics. Sargur N. Srihari

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Probabilistic Graphical Models

Probabilistic Graphical Models

Two Distributed-State Models For Generating High-Dimensional Time Series

Unsupervised Deep Learning for Scene Recognition

Learning the Structure of Deep, Sparse Graphical Models

Perceptron: This is convolution!

Building Classifiers using Bayesian Networks

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

How Learning Differs from Optimization. Sargur N. Srihari

Fitting Social Network Models Using the Varying Truncation S. Truncation Stochastic Approximation MCMC Algorithm

Large-scale Deep Unsupervised Learning using Graphics Processors

Recurrent Neural Network (RNN) Industrial AI Lab.

Optimizing Neural Networks that Generate Images. Tijmen Tieleman

Deep Learning Srihari. Autoencoders. Sargur Srihari

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Machine Learning

A fast learning algorithm for deep belief nets

Early Stopping. Sargur N. Srihari

Bayesian model ensembling using meta-trained recurrent neural networks

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Static Gesture Recognition with Restricted Boltzmann Machines

Transcription:

Deep Boltzmann Machines Sargur N. Srihari srihari@cedar.buffalo.edu

Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous data 6. Convolutional Boltzmann machines 7. Boltzmann machines for structured and sequential outputs 8. Other Boltzmann machines 9. Backpropagation through random operations 10. Directed generative nets 11. Drawing samples from autoencoders 12. Generative stochastic networks 13. Other generative schemes 14. Evaluating generative models 15. Conclusion 2

Topics in Deep Boltzmann Machines What is a Deep Boltzmann machine (DBM)? Example of a Deep Boltzmann machine DBM Representation DBM Properties DBM Mean Field Inference DBM Parameter Learning Layerwise Pre-training Jointly training DBMs 3

What is a Deep Boltzmann Machine? It is deep generative model Unlike a Deep Belief network (DBN) it is an entirely undirected model An RBM has only one hidden layer A Deep Boltzmann machine (DBM) has several hidden layers 4

PGM for a DBM Unlike a DBN, a DBM is an entirely undirected model This one has one visible layer and two hidden layers Connections are only between units in neighboring layers Like RBMs and DBNs, DBMs contain only binary units 5

Parameterization of DBM Joint parameterization as energy function E Case of DBM with one visible layer v and three hidden layers h (1), h (2), h (3), joint probability is Bias parameters are omitted for simplicity DBM energy function is defined as follows: In comparison with RBM energy function E(v,h)= -b T v c T h v T Wh DBM includes connections between hidden units in the form of weight matrices W (1) and W (2)

Structure of DBM In comparison to fully connected Boltzmann machines (with every unit connected to every other unit), DBM offers advantages similar to those offered by RBM DBM layers can be organized as a bipartite graph 7

DBM Rearranged as Bipartite Rearranged to show its bipartite structure Odd layers on one side, even layers on other 8

Conditional distributions in DBM Bipartite structure implication Can use same equations as for conditionals of RBM Units within each layer are conditionally independent of each other given the values of neighboring layers So distributions over binary variables can be fully described by the Bernoulli parameters For two hidden layers, activation probabilities are: 9

Gibbs sampling in a DBM Bipartite structure makes Gibbs sampling efficient Naiive approach to Gibbs is to update only one variable at a time RBMs allow all visible units to be updated in one block and all hidden units in a second block Instead of l+1 updates, can update all units in two iterations Efficient sampling is important for training with stochastic maximum likelihood 10

Interesting properties of DBMs Computing P(h v) is simpler than for DBNs Simplicity allows richer approximations to posterior Classification is performed using a heuristic approximate inference Mean field expectation of hidden units provided by an upward pass in an MLP with sigmoid and same weights as original DBN AnyQ(h) obtains variational lower bound on log-likelihood Lack of intra-layer interactions makes possible use of fixed point equations to optimize variational lower bound 11

DBM and Neuroscience Use of proper mean-field allows approximate inference to capture influence of top-down feedback interactions This makes DBMs interesting from neuroscience viewpoint Human brain is known to use many top-down feedback connections Therefore DBMs have been used as computational models for real neuroscientific phenomena 12

Sampling from a DBM is difficult One unfortunate property is that sampling from a DBM is relatively difficult This is in contrast with DBNs Need only need to use MCMC sampling in their top pair of layers, other layers are only used at the end in an efficient ancestral sampling pass To generate a sample from a DBM, it is necessary to use MCMC across all layers With every layer of the model participating in every Markov chain interaction 13

DBM Mean Field Inference The conditional distribution over one DBM layer given the neighboring layers is factorial In the case of two hidden layers they are: P(v h (1) ), P(h (1) v,h (2) ) and P(h (2) h (1) ) The distribution over all hidden layers generally does not factorize because of the interaction of the interaction weights W (2) between h (1) and h (2), which render these variables dependent 14

Approximating the DBM Posterior As with DBN, we are left to seek out methods to approximate the DBM posterior distribution Unlike DBN, posterior (given visible units) over hidden units is easy to variationally approximate using mean-field In mean field we restrict the approximating distribution to fully factorial distributions In DBMs mean field equations capture bidirectional interactions between layers Approximating family is the set of distributions where the hidden units are conditionally independent 15

Equations for Mean-Field Inference We develop mean field approach for the example with two hidden layers Let Q(h (1), h (2) v) be the approximation to P(h (1), h (2) v) The mean-field assumption means that Q(h (1), h (2) v) =Π j Q(h j (1)) v) Π k Q(h k (2) v) The mean field approximation attempts to find a member of this family that best fits the true posterior P(h (1), h (2) v) Importantly the inference process must be run again to find a distribution Q every time we use a new value of v 16

Measuring how well Q(h v) fits P(h v) The variational approach is to minimize In general we do not have to provide a parametric form for the approximating distribution beyond enforcing the independence assumptions The variational approximation is able to recover a functional form of the approximate distribution However in mean field on binary hidden units No loss of generality in fixing parameterization in advance 17

Parameterizing Q as Bernoulli product Parameterize Q as a product of Bernoulli distributions i.e., we associate the probability of each element of h (1) with a parameter, Specifically For each j, ĥj (1) =Q(h (1) j = 1 v), where ĥj (1) [0,1] For each k, ĥk (2) =Q(h (2) k = 1 v), where ĥk (2) [0,1] Thus we have the following approximation to the posterior 18

Choosing member of family Mean field approximation: Optimal q(h i v) may be obtained by normalizing the unnormalized distribution These equations were derived by solving for where the derivatives for the variational bound are zero Applying these general equations, we obtain update rules At a fixed point of this system of equations we have a local maximum of the variational lower bound L(Q) They define an iterative algorithm where we alternate updates of the two equations 19

DBM Parameter Learning Learning in a DBM must confront two intractabilities: Intractable partition function P(h v) is intractable (since DBM is undirected) Solution: construct mean-field distribution Q(h v) Intractable posterior distribution Log-likelihood log P(v ;θ) is intractable Solution: maximize L(v,Q,θ), the variational lower bound on log P(v ;θ) 20

Lower bound on log-likelihood For a DBM with two layers Expression still contains log partition log Z(θ) So approximate methods are required Training the model requires gradient of log-partition DBMs are typically trained using Stochastic Maximum Likelihood Contrastive Divergence is slow for DBMs Because they do not allow efficient sampling of the hidden units given the visible units CD would require burning in a Markov chain every time a 21 negative sample is required

Variational Stochastic MLE Algorithm: For training a 2-layer DBM 22

Layer-wise Pretraining Training a DBM using stochastic maximum likelihood usually results in failure Various techniques that permit joint training have been developed Popular method: greedy layer-wise pretraining Each layer is trained in isolation as an RBM First layer is trained to model input data Each subsequent RBM is trained to model samples from the previous RBM s posterior distribution All the RBMs are combined to form a DBM The DBM may then be trained with PCD PCD makes small change in parameters/performance

DBM Training to classify MNIST Train an RBM using CD to approximately maximize log p(v) Train an second RBM that models h (1) and target class y using CD-k to appropriately maximize log p(h (1),y) where h (1) is drawn from the first RBMs posterior conditioned on the data. Increase k from 1 to 20 during learning. Combine the two RBMs into a DBM. Train it to approximately maximize log P(v,y) using stochastic maximum likelihood with k=5 Delete y from the model. Define a new set of features h (1) and h (2) that are obtained by running mean field inference in the model lacking y. Use these features as input to an MLP whose structure is the same as an additional pass of a mean field, with and additional output layer for the estimate of y. Initialize the MLP s weights to be the same as the DBM.s weights. Train the MLP to approximately maximize log P(y v) using SGD and dropout 24

Jointly Training DBMs DBMs require greedy unsupervised pretraining To perform classification well, require a separate MLP-based classifier on top of the hidden features they extract This has undesirable properties Hard to track performance during training We cannot evaluate performance of full DBM while training first RBM cannot track performance of hyperparameters Software implementations of DBMs require many different CD training of RBMs, PCD training 25 of full DBM

Resolving the joint training problem Two ways to resolve joint training of DBMs 1. Centered DBM Reparameterize the model in order to make the Hessian of the cost function better conditioned at the beginning of the learning process 2. Multiprediction DBM (MP-DBM) Works by viewing the mean field equations as defining a family of recurrent networks for approximately solving every possible inference problem 26

Multiprediction Training of DBM Each row indicates a different example within a minibatch for the same training step Each column represents a time step within the mean-field process For each example, we sample a subset of the data variables to serve as inputs to the inference process They are shaded black to indicate conditioning We then run the mean field inference process, with arrows indicating which variables influence which other variables in the process 27