Learning the Structure of Deep, Sparse Graphical Models

Similar documents
Modeling Dyadic Data with Binary Latent Factors

Learning Global-to-Local Discrete Components with Nonparametric Bayesian Feature Construction

A Fast Learning Algorithm for Deep Belief Nets

Learning Objects and Parts in Images

Probabilistic Graphical Models

Beam Search based MAP Estimates for the Indian Buffet Process

Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process

Latent Regression Bayesian Network for Data Representation

A Nonparametric Bayesian Approach to Modeling Overlapping Clusters

CS281 Section 9: Graph Models and Practical MCMC

Keeping flexible active contours on track using Metropolis updates

Claims Severity Modeling. Radhika Anand. Program in Statistical and Economic Modeling Duke University. Date: Approved. Sayan Mukherjee, Supervisor

Short-Cut MCMC: An Alternative to Adaptation

Non-Parametric Bayesian Sum-Product Networks

Variational Methods for Graphical Models

Deep Boltzmann Machines

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data

Approximate Bayesian Computation. Alireza Shafaei - April 2016

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Geoff McLachlan and Angus Ng. University of Queensland. Schlumberger Chaired Professor Univ. of Texas at Austin. + Chris Bishop

Particle Filters for Visual Tracking

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Convexization in Markov Chain Monte Carlo

Nested Sampling: Introduction and Implementation

Towards Asynchronous Distributed MCMC Inference for Large Graphical Models

A Dynamic Relational Infinite Feature Model for Longitudinal Social Networks

José Miguel Hernández Lobato Zoubin Ghahramani Computational and Biological Learning Laboratory Cambridge University

Ludwig Fahrmeir Gerhard Tute. Statistical odelling Based on Generalized Linear Model. íecond Edition. . Springer

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Semi-Supervised Learning with Trees

Statistical Matching using Fractional Imputation

Loopy Belief Propagation

Post-Processing for MCMC

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Introduction to Generative Adversarial Networks

Variational Autoencoders. Sargur N. Srihari

Efficient Feature Learning Using Perturb-and-MAP

CSC412: Stochastic Variational Inference. David Duvenaud

Markov Random Fields and Gibbs Sampling for Image Denoising

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

3. Replace any row by the sum of that row and a constant multiple of any other row.

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Stacked Denoising Autoencoders for Face Pose Normalization

Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

RJaCGH, a package for analysis of

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Learning stick-figure models using nonparametric Bayesian priors over trees

Clustering Relational Data using the Infinite Relational Model

A Probabilistic Framework for Multimodal Retrieval using Integrative Indian Buffet Process

Unsupervised Discovery of Non-Linear Structure using Contrastive Backpropagation

Modified Metropolis-Hastings algorithm with delayed rejection

Alternatives to Direct Supervision

Scene Grammars, Factor Graphs, and Belief Propagation

Learning Sparse Topographic Representations with Products of Student-t Distributions

Probabilistic Robotics

Parallel Gibbs Sampling From Colored Fields to Thin Junction Trees

1 Methods for Posterior Simulation

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

08 An Introduction to Dense Continuous Robotic Mapping

Probabilistic Graphical Models

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Max-Product Particle Belief Propagation

Day 3 Lecture 1. Unsupervised Learning

Graphical Models, Bayesian Method, Sampling, and Variational Inference

arxiv: v1 [stat.ml] 22 Mar 2014

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Introduction to Graphical Models

Edge-exchangeable graphs and sparsity

Probabilistic Graphical Models

GiRaF: a toolbox for Gibbs Random Fields analysis

arxiv: v1 [cs.lg] 24 Jan 2019

A Deterministic Global Optimization Method for Variational Inference

MSA101/MVE Lecture 5

Mesh segmentation. Florent Lafarge Inria Sophia Antipolis - Mediterranee

Training Restricted Boltzmann Machines with Multi-Tempering: Harnessing Parallelization

Scalable Inference in Hierarchical Generative Models

Scene Grammars, Factor Graphs, and Belief Propagation

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION

Bayesian model ensembling using meta-trained recurrent neural networks

Automatic Alignment of Local Representations

Nonparametric Bayesian Texture Learning and Synthesis

STA 4273H: Statistical Machine Learning

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient

Summary: A Tutorial on Learning With Bayesian Networks

Warped Mixture Models

Calibration and emulation of TIE-GCM

Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing

Quantitative Biology II!

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

Bilevel Sparse Coding

Deep Learning. Architecture Design for. Sargur N. Srihari

Statistical Techniques in Robotics (16-831, F12) Lecture#05 (Wednesday, September 12) Mapping

Stat 547 Assignment 3

From the Jungle to the Garden: Growing Trees for Markov Chain Monte Carlo Inference in Undirected Graphical Models

Deep Learning With Noise

Transcription:

Learning the Structure of Deep, Sparse Graphical Models Hanna M. Wallach University of Massachusetts Amherst wallach@cs.umass.edu Joint work with Ryan Prescott Adams & Zoubin Ghahramani

Deep Belief Networks Deep belief nets are probabilistic generative models that are composed of multiple layers of stochastic latent variables. The latent variables typically have binary values and are often called hidden units or feature detectors. [...] The lower layers receive top-down, directed connections from the layer above. The states of the units in the lowest layer represent a data vector. Geoff Hinton ('09) Scholarpedia Hanna M. Wallach :: UMass Amherst :: 2

Network Structure Structural questions: # units in each hidden layer? # hidden layers? What network connectivity? What type(s) of unit behavior? Goal: learn the structure Hanna M. Wallach :: UMass Amherst :: 3

Network Structure Structural questions: # units in each hidden layer? # hidden layers? What network connectivity? What type(s) of unit behavior? Goal: learn the structure Hanna M. Wallach :: UMass Amherst :: 4

Network Structure Structural questions: # units in each hidden layer? # hidden layers? What network connectivity? What type(s) of unit behavior? Goal: learn the structure Hanna M. Wallach :: UMass Amherst :: 5

Network Structure Structural questions: # units in each hidden layer? # hidden layers? What network connectivity? What type(s) of unit behavior? Goal: learn the structure Hanna M. Wallach :: UMass Amherst :: 6

Network Structure??? Structural questions: # units in each hidden layer? # hidden layers? What network connectivity? What type(s) of unit behavior? Goal: learn the structure Hanna M. Wallach :: UMass Amherst :: 7

Network Structure Structural questions: # units in each hidden layer? # hidden layers? What network connectivity? What type(s) of unit behavior? Goal: learn the structure Hanna M. Wallach :: UMass Amherst :: 8

This Talk A nonparametric Bayesian approach for learning the structure of a layered, directed, deep belief network. + [Mike Jordan] [Yee Whye Teh] [Geoff Hinton] [Zoubin Ghahramani] Hanna M. Wallach :: UMass Amherst :: 9

Outline Background: finite single-layer networks Infinite belief networks: Learning the number of hidden units in each layer Learning the number of hidden layers Learning the type(s) of unit behavior Experimental results Hanna M. Wallach :: UMass Amherst :: 10

Outline Background: finite single-layer networks Hanna M. Wallach :: UMass Amherst :: 11

Finite Single-Layer Networks hiddens visibles Use a binary matrix to represent the edge structure (connectivity) of a directed graph A prior distribution on binary matrices a prior distribution on singlelayered belief networks Hanna M. Wallach :: UMass Amherst :: 12

Finite Single-Layer Networks hiddens visibles An infinite number of columns an infinite number of hidden units Can we let these binary matrices to have an infinite number of columns? Yes: Indian buffet process Hanna M. Wallach :: UMass Amherst :: 13

Outline Background: finite single-layer networks Infinite belief networks: Learning the number of hidden units in each layer Hanna M. Wallach :: UMass Amherst :: 14

Infinitely-Wide Layers [Wood, Griffiths & Ghahramani, '06] Use an Indian buffet process (IBP) as a prior on binary matrices with countably infinite columns: Unbounded number of hidden units Posterior inference determines the subset of hidden units responsible for the observations The IBP ensures that the matrices are extremely sparse: always a finite number of nonzero columns Finite number of actively-used hidden units Hanna M. Wallach :: UMass Amherst :: 15

The Indian Buffet Process [Ghahramani et al., '07] customers dishes Parameters: α and β First customer tries Poisson(α) dishes Hanna M. Wallach :: UMass Amherst :: 16

The Indian Buffet Process [Ghahramani et al., '07] customers dishes Parameters: α and β First customer tries Poisson(α) dishes n th customer tries: Previously-tasted dish k with probability nk / (β + n - 1) Poisson(αβ / (β + n 1)) completely new dishes Hanna M. Wallach :: UMass Amherst :: 17

The Indian Buffet Process [Ghahramani et al., '07] customers dishes Parameters: α and β First customer tries Poisson(α) dishes n th customer tries: Previously-tasted dish k with probability nk / (β + n - 1) Poisson(αβ / (β + n 1)) completely new dishes Hanna M. Wallach :: UMass Amherst :: 18

The Indian Buffet Process [Ghahramani et al., '07] customers dishes Parameters: α and β First customer tries Poisson(α) dishes n th customer tries: Previously-tasted dish k with probability nk / (β + n - 1) Poisson(αβ / (β + n 1)) completely new dishes Hanna M. Wallach :: UMass Amherst :: 19

Properties of the IBP For a finite number of customers, there will always be a finite number of dishes tasted Infinitely exchangeable rows and columns There is a related stick-breaking construction Popular for shared latent feature (hidden cause) models Latent features can be added/removed from a model without dimensionality-altering MCMC methods Hanna M. Wallach :: UMass Amherst :: 20

Single-Layer Belief Networks [Wood, Griffiths & Ghahramani, '06] customers dishes Parameters: α and β... customers = observed units dishes = hidden units Hanna M. Wallach :: UMass Amherst :: 21

Single-Layer Belief Networks [Wood, Griffiths & Ghahramani, '06] customers dishes Parameters: α and β... customers = observed units dishes = hidden units Hanna M. Wallach :: UMass Amherst :: 22

Single-Layer Belief Networks [Wood, Griffiths & Ghahramani, '06] customers dishes Parameters: α and β... customers = observed units dishes = hidden units Hanna M. Wallach :: UMass Amherst :: 23

Single-Layer Belief Networks [Wood, Griffiths & Ghahramani, '06] customers dishes Parameters: α and β... customers = observed units dishes = hidden units Hanna M. Wallach :: UMass Amherst :: 24

Single-Layer Multi-Layer? Single-layer belief networks have limited utility: Hidden units are independent a priori Deep networks = multiple hidden layers: Hidden units are dependent a priori Goal: extend the IBP in order to construct deep belief networks with unbounded width and depth Hanna M. Wallach :: UMass Amherst :: 25

Outline Background: finite single-layer networks Infinite belief networks: Learning the number of hidden units in each layer Learning the number of hidden layers Hanna M. Wallach :: UMass Amherst :: 26

Multi-Layer Belief Networks... Hanna M. Wallach :: UMass Amherst :: 27

Multi-Layer Belief Networks...... Hanna M. Wallach :: UMass Amherst :: 28

Multi-Layer Belief Networks......... Hanna M. Wallach :: UMass Amherst :: 29

Multi-Layer Belief Networks.................. Hanna M. Wallach :: UMass Amherst :: 30

Layered Belief Nets We could use a finite number of IBPs But... what about using an infinite recursion: Every dish is a customer in another restaurant Unbounded number of layers, each of unbounded width Remarkably, we will always hit a layer with zero units! Always stop at a finite but unbounded depth We don't have to make an ad hoc choice of depth Hanna M. Wallach :: UMass Amherst :: 31

The Cascading IBP (CIBP) A stochastic process which results in an infinite sequence of infinite binary matrices Each matrix is exchangeable in both rows and columns How do we know the CIBP converges? The number of dishes in one layer depends only on the number of customers in the previous layer Can prove that this Markov chain reaches an absorbing state in finite time with probability one Hanna M. Wallach :: UMass Amherst :: 32

CIBP Properties For a unit in layer m+1: Expected # of parents: α Expected # of children: c(β, Km) = 1 + (Km - 1) / (1 + β) lim β 0 c(β, Km) = Km and lim β c(β, Km) = 1 We do not want network properties to be constant at all depths, e.g., some levels should be sparser than others: Each layer can have different IBP parameters α and β so long as they are bounded from above Hanna M. Wallach :: UMass Amherst :: 33

Samples from the CIBP Prior Hanna M. Wallach :: UMass Amherst :: 34

Outline Background: finite single-layer networks Infinite belief networks: Learning the number of hidden units in each layer Learning the number of hidden layers Learning the type(s) of unit behavior Hanna M. Wallach :: UMass Amherst :: 35

Learning Unit Behavior [Frey, '97; Frey & Hinton, '99] Unit activations are weighted linear sums with biases Nonlinear Gaussian belief network approach: sigmoid(activation + Gaussian noise with precision ν) Hanna M. Wallach :: UMass Amherst :: 36

Priors and Inference Layer-wise Gaussian priors on weights and biases Layer-wise Gamma priors on noise precisions Layer-wise parameters tied via global hyperparameters Markov chain Monte Carlo inference: Parameter inference is easy given the network structure Edges added/removed using Metropolis-Hastings Hanna M. Wallach :: UMass Amherst :: 37

Outline Background: finite single-layer networks Infinite belief networks: Learning the number of hidden units in each layer Learning the number of hidden layers Learning the type(s) of unit behavior Experimental results Hanna M. Wallach :: UMass Amherst :: 38

Experimental Results Olivetti Faces: 350+50 images of 40 faces; 64x64 Inferred: ~3 hidden layers; 70 units per hidden layer MNIST Digits: 50+10 images of 10 digits; 28x28 Inferred: ~3 hidden layers; 120, 100, 70 units Frey Faces: 1865+100 images of Brendan Frey; 20x28 Inferred: ~3 hidden layers; 260, 120, 35 units Hanna M. Wallach :: UMass Amherst :: 39

Olivetti: Reconstructions & Features Hanna M. Wallach :: UMass Amherst :: 40

Olivetti: Fantasies & Activations Hanna M. Wallach :: UMass Amherst :: 41

MNIST Digits Hanna M. Wallach :: UMass Amherst :: 42

Frey Faces Hanna M. Wallach :: UMass Amherst :: 43

Summary United deep belief networks & Bayesian nonparametrics Introduced the CIBP & proved convergence properties Addressed 3 issues with deep belief networks: Number of units in each hidden layer Number of hidden layers Type(s) of hidden unit behavior Hanna M. Wallach :: UMass Amherst :: 44

Thanks! Ryan Prescott Adams & Zoubin Ghahramani Acknowledgements: Brendan Frey, David MacKay, Iain Murray, Radford Neal