Learning Tractable Probabilistic Models Pedro Domingos

Similar documents
Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Inference Complexity As Learning Bias. The Goal Outline. Don t use model complexity as your learning bias

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Learning the Structure of Sum-Product Networks

Markov Logic: Representation

A Co-Clustering approach for Sum-Product Network Structure Learning

3 : Representation of Undirected GMs

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Lecture 5: Exact inference

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Probabilistic Graphical Models

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Non-Parametric Bayesian Sum-Product Networks

Learning Arithmetic Circuits

Learning Markov Networks With Arithmetic Circuits

Bayesian Networks Inference (continued) Learning

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Learning Markov Networks With Arithmetic Circuits

Tractable Probabilistic Knowledge Bases with Existence Uncertainty

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Machine Learning

STA 4273H: Statistical Machine Learning

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Learning Arithmetic Circuits

Advanced Introduction to Machine Learning, CMU-10715

Lecture 9: Undirected Graphical Models Machine Learning

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Probabilistic Graphical Models

Algorithms for Learning the Structure of Monotone and Nonmonotone Sum-Product Networks

Deep Generative Models Variational Autoencoders

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Day 3 Lecture 1. Unsupervised Learning

Cambridge Interview Technical Talk

Building Classifiers using Bayesian Networks

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Expectation Propagation

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

CREDAL SUM-PRODUCT NETWORKS

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Markov Decision Processes (MDPs) (cont.)

arxiv: v1 [stat.ml] 21 Aug 2017

Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees

PRACTICE FINAL EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Dynamic Bayesian network (DBN)

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

Structured Learning. Jun Zhu

Probabilistic Graphical Models

CS 229 Midterm Review

A Taxonomy of Semi-Supervised Learning Algorithms

Generative and discriminative classification techniques

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Bioinformatics - Lecture 07

Learning Bounded Treewidth Bayesian Networks

Learning Efficient Markov Networks

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Bayes Net Learning. EECS 474 Fall 2016

Mean Field and Variational Methods finishing off

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Mean Field and Variational Methods finishing off

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Ch9: Exact Inference: Variable Elimination. Shimi Salant, Barak Sternberg

Machine Learning. Sourangshu Bhattacharya

10-701/15-781, Fall 2006, Final

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Bayesian Networks Inference

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

Machine Learning. Supervised Learning. Manfred Huber

Probabilistic Graphical Models

Structured Bayesian Networks: From Inference to Learning with Routes

Unsupervised Learning: Clustering

Collective classification in network data

Neural Networks and Deep Learning

Clustering web search results

K-Means and Gaussian Mixture Models

Loopy Belief Propagation

Efficient Feature Learning Using Perturb-and-MAP

K-Means Clustering. Sargur Srihari

ECG782: Multidimensional Digital Signal Processing

Lecture 3: Conditional Independence - Undirected

Generative and discriminative classification techniques

Machine Learning. Chao Lan

Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory

Tree-structured approximations by expectation propagation

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Probabilistic Graphical Models

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Sequence Labeling: The Problem

Finding Non-overlapping Clusters for Generalized Inference Over Graphical Models

COMP90051 Statistical Machine Learning

Transcription:

Learning Tractable Probabilistic Models Pedro Domingos Dept. Computer Science & Eng. University of Washington 1

Outline Motivation Probabilistic models Standard tractable models The sum-product theorem Bounded-inference graphical models Feature trees Sum-product networks Tractable Markov logic Other tractable models 2

The Hardest Part of Learning Is Inference Inference is subroutine of: Learning undirected graphical models Learning discriminative graphical models Learning w/ incomplete data, latent variables Bayesian learning Deep learning Statistical relational learning Etc. 3

Goal: Large Joint Models Natural language Vision Social networks Activity recognition Bioinformatics Etc. 4

Example: Friends & Smokers

Inference Is the Bottleneck Inference is #P-complete It s tough to have #P as a subroutine Approximate inference and parameter optimization interact badly An intractable accurate model is in effect an inaccurate model What can we do about this? 6

One Solution: Learn Only Tractable Models Pro: Inference problem is solved Con: Insufficiently expressive Recent development: Expressive tractable models (theme of this tutorial) 7

Outline Motivation Probabilistic models Standard tractable models The sum-product theorem Bounded-inference graphical models Feature trees Sum-product networks Tractable Markov logic Other tractable models 8

Why Use Probabilistic Models? Correctly handle uncertainty and noise Learn with missing data Jointly infer multiple variables Do inference in any direction It s the standard Powerful, consistent set of techniques 9

Probabilistic Models Bayesian networks Markov networks Log-linear models Mixture models Logistic regression Hidden Markov models Cond. random fields Max. entropy models Probabilistic grammars Exponential family Markov random fields Gibbs distributions Boltzmann machines Deep architectures Markov logic Etc.

Markov Networks Undirected graphical models Smoking Cancer Asthma Cough Potential functions defined over cliques 1 P ( x) c ( x c ) Z Z ( ) c x c c x c Smoking Cancer Ф(S,C) False False 4.5 False True 4.5 True False 2.7 True True 4.5

Log-Linear Models 1 P ( x) exp wi fi ( x) Z i Weight of Feature i Feature i f 1 (Smoking,Cancer) 1 0 if Smoking otherwise Cancer w 1 0.51

Representation and Inference Bayesian Networks Markov Networks Deep Architectures Earthquake Burglar Alarm Advantage: Compact representation Inference: P(Burglar Alarm) =?? Need to sum out Earthquake Inference cost exponential in treewidth of graph 13

Learning Graphical Models General idea: Empirical statistics = Predicted statistics Requires inference! Approximate inference is very unreliable No closed-form solution (except rare cases) Hidden variables No global optimum Result: Learning is very hard

Outline Motivation Probabilistic models Standard tractable models The sum-product theorem Bounded-inference graphical models Feature trees Sum-product networks Tractable Markov logic Other tractable models 15

Thin Junction Trees [Karger & Srebro, SODA-01; Bach & Jordan, NIPS-02; Narasimhan & Bilmes, UAI-04; Chechetka & Guestrin, NIPS-07] Junction tree: obtained by triangulating the Markov network Inference is exponential in treewidth (size of largest clique in junction tree) Solution: Learn only low-treewidth models Problem: Too restricted (treewidth 3) 16

Very Large Mixture Models [Lowd & Domingos, ICML-05] Just learn a naive Bayes mixture model with lots of components (hundreds or more) Inference is linear in model size (no worse than scanning training set) Compared to Bayes net structure learning: Comparable data likelihood Better query likelihood Much faster & more reliable inference Problem: Curse of dimensionality 17

Outline Motivation Probabilistic models Standard tractable models The sum-product theorem Bounded-inference graphical models Feature trees Sum-product networks Tractable Markov logic Other tractable models 18

Efficiently Summable Functions A function is efficiently summable iff its sum over any subset of its scope can be computed in time polynomial in the cardinality of the subset. 19

The Sum-Product Theorem If a function is: A sum of efficiently summable functions with the same scope, or A product of efficiently summable functions with disjoint scopes, Then it is also efficiently summable. 20

Corollary Every low-treewidth distribution is efficiently summable, but not every efficiently summable distribution has low treewidth. 21

Compactly Representable Probability Distributions Graphical Models Sum-Product Models Standard Tractable Models

Compactly Representable Probability Distributions Graphical Models Sum-Product Models Standard Tractable Models Linear-time exact inference

Outline Motivation Probabilistic models Standard tractable models The sum-product theorem Bounded-inference graphical models Feature trees Sum-product networks Tractable Markov logic Symmetric models Other tractable models 24

Arithmetic Circuits [Darwiche, JACM, 2003] Inference consists of sums and products Can be represented as an arithmetic circuit Complexity of inference = Size of circuit 25

Arithmetic Circuit X 1 X 2 P(X) 1 1 0.4 1 0 0.2 0.4 0.3 0.2 0.1 0 1 0.1 0 0 0.3 X 1 X1 X 2 X2 Rooted DAG of sums and products Leaves are indicator variables Computes marginals in linear time Graphical models can be compiled into ACs 26

Learning Bounded-Inference Graphical Models [Lowd & D., UAI-08] Use standard Bayes net structure learner (with context-specific independence) Key idea: Instead of using representation complexity as regularizer: score(m,t) = log P(T M) k p n p (M) (log-likelihood #parameters) Use inference complexity: score(m,t) = log P(T M) k c n c (M) (log-likelihood circuit size) 27

Learning Bounded-Inference Graphical Models (contd.) Incrementally compile circuit as structure added (splits in decision trees) Compared to Bayes nets w/ Gibbs sampling: Comparable data likelihood Better query likelihood Much faster & more reliable inference Large treewidth (10 s 100 s) 28

Outline Motivation Probabilistic models Standard tractable models The sum-product theorem Bounded-inference graphical models Feature trees Sum-product networks Tractable Markov logic Symmetric models Other tractable models 29

Feature Trees [Gogate, Webb & D., NIPS-10] Thin junction tree learners work by repeatedly finding a subset of variables A such that P(B,C A) P(B A) P(C A) where A,B,C is a partition of the variables LEM algorithm: Instead find a feature F s.t. P(B,C F) P(B F) P(C F) and recurse on variables and instances Result is a tree of features 30

A Feature Tree x 1 x 2 0 1 x 3 x 4 x 5 x 6 x 3 x 5 x 4 x 6 0 1 0 1 0 1 0 1 31

Feature Trees (contd.) High treewidth because of context-specific independence More flexible than decision tree CPDs PAC-learning guarantees Outperforms thin junction trees and other algorithms for learning Markov networks More generally: Feature graphs 32

Outline Motivation Probabilistic models Standard tractable models The sum-product theorem Bounded-inference graphical models Feature trees Sum-product networks Tractable Markov logic Symmetric models Other tractable models 33

A Univariate Distribution Is an SPN X Multinomial Gaussian Poisson...

A Product of SPNs over Disjoint Variables Is an SPN X Y

A Weighted Sum of SPNs over the Same Variables Is an SPN w 1 w 2 Sums out a mixture variable X Y X Y

Recurse Freely... X 1 X 1 X 1 X 1 X 2 X 2 X 2 X 2 X 3 X 3 X 3 X 3 X 4 X 4 X 4 X 4 X 5 X 5 X 5 X 5 X 6 X 6 X 6 X 6

All Marginals Are Computable in Linear Time = 0.26 0.5 0.4 0.6 0.1 0.5 1 0.1 1 X Y X Y Evidence Marginalize Evidence Marginalize

All MAP States Are Computable in Linear Time = 0.12 max 0.3 0.4 0.6.04 0.5 0.6 0.1 0.4 Y X Y X Y Evidence Mode Evidence Mode

All MAP States Are Computable in Linear Time = 0.12 max 0.3 0.4 0.6.04 0.5 0.6 X Y 0.1 0.4 X Y Evidence Mode Evidence Mode

What Does an SPN Mean? Products = Features Sums = Clusters

Special Cases of SPNs Hierachical mixture models Thin junction trees (e.g.: hidden Markov models) Non-recursive probabilistic context-free grammars Etc.

Discriminative SPNs [Gens & D., NIPS-12; Best Student Paper Award] Y 1 0.4 0.1 0.5 Y 1 Y 2 Y 2 Hidden Query f (X) 1 f (X) 2 f(x) Features (non-negative) Evidence

Discriminative Training Correct label Tractable! Best guess

Backpropagation.18 1 For each sum child j: 0.3 0.7.24.16.3.7

Backpropagation 0.3.3.24 For each product child j: 0.9 f (X) 1.1?.1 0.6.45.2

Problem: Gradient Diffusion

Solution: Hard Inference Soft Inference (Marginals) Hard Inference (MAP States)

Hard Gradient max max f 1(X) f (X) - f (X) f (X) 2 max max max max max max max max 1 2 Y 1 Y 1 Y 2 Y 2 Y 1 Y 1 Y 2 Y 2

Hard Gradient max f (X) 1 f (X) 2 Number with correct label Number with model guess max max max max Y 1 Y 1 Y 2 Y 2

Empirical Evaluation: Object Recognition CIFAR-10 32x32 pixels 50k training exs. 10k test exs. STL-10 96x96 pixels 5k training exs. 8k test exs. 100k unlabeled exs.

Feature Extraction 6x6 K-means Triangle encoding K Max-pooling K 32x32 27x27xK GxGxK [Coates et al., AISTATS 2011]

Architecture Classes + Parts x Mixture + Location + WxWxK GxGxK

CIFAR-10 Results SPN Pooling SVM Autoencoder RBM 4x4xK

STL-10 Results without unlabeled data

Generative Weight Learning [Poon & D., UAI-11; Best Paper Award] Model joint distribution of all variables Algorithm: Online hard EM Sum node maintains counts for each child For each example Find MAP instantiation with current weights Increment count for each chosen child Renormalize to set new weights Repeat until convergence 56

Empirical Evaluation: Image Completion Datasets: Caltech-101 and Olivetti Compared with DBNs, DBMs, PCA and NN SPNs reduce MSE by ~1/3 Orders of magnitude faster than DBNs, DBMs

Architecture Whole Image Region............ Pixel x y 58

Example Completions Original SPN DBM DBN PCA Nearest Neighbor 59

Weight Learning: Summary Update Generative EM Generative Gradient Discriminative Gradient Soft Inference (Marginals) Hard Inference (MAP States)

Structure Learning [Gens & D., ICML-13; no best paper award]

Empirical Evaluation 20 varied real-world datasets 10s-1000s of variables 1000s-100,000s of samples Compared with state-of-the-art Bayesian network and Markov random field learners Likelihood: typically comparable Query accuracy: much higher Inference: orders of magnitude faster

Outline Motivation Probabilistic models Standard tractable models The sum-product theorem Bounded-inference graphical models Feature trees Sum-product networks Tractable Markov logic Symmetric models Other tractable models 63

Tractable Markov Logic [D. & Webb, AAAI-12] Tractable representation for statistical relational learning Three types of weighted rules and facts Subclass: Is(Family,SocialUnit) Is(Smiths,Family) Subpart: Has(Family,Adult,2) Has(Smiths,Anna,Adult1) Relation: Parent(Family,Adult,Child) Married(Anna,Bob) 64

Restrictions One top class One top object (all others are subparts) Relations must be among subparts of some object Subclasses are mutually exclusive Objects do not share subparts

TML Semantics Tvc.Qbsujujpo!Gvodujpo Z(X,C) Pckfdu D mbtt e w s S P R Z(X, S) Z(P(X),C P ) n P (1 e w R ) Tvcdmbttft Tvcqbsut Sfmbujpot Z(KB) Z(TopObject,TopClass)

Tractability Theorem: The partition function of every TML knowledge base can be computed in time and space polynomial in the size of the knowledge base. Time = Space = O(#Rules X #Objects) 67

Why TML Is Tractable KB structure is isomorphic to Z computation: Parts = Products Classes = Sums

Why TML Is Tractable KB structure is isomorphic to Z computation: Parts = Products Classes = Sums

Why TML Is Tractable KB structure is isomorphic to Z computation: Parts = Products Classes = Sums

Expressiveness The following can be compactly represented in TML: Junction trees Sum-product networks Probabilistic context-free grammars Probabilistic inheritance hierarchies Etc. 71

Learning Tractable MLNs Alternate between: Dividing / aggregating the domain into subparts Inducing class hierarchies over similar subparts 72

Other Sum-Product Models Relational sum-product networks Tractable probabilistic knowledge bases Tractable probabilistic programs Etc. 73

What If This Is Not Enough? Use variational inference, with the most expressive tractable representation available as the approximating family [Lowd & D., NIPS-10] 74

Outline Motivation Probabilistic models Standard tractable models The sum-product theorem Bounded-inference graphical models Feature trees Sum-product networks Tractable Markov logic Other tractable models 75

Other Tractable Models Symmetry Liftable models Exchangeable models Submodularity Determinantal point processes Etc. Several papers at ICML-14 Workshop on Thursday 76

Summary Intractable inference is the bane of learning Tractable models avoid it Standard ones are too limited We have powerful new tractable classes Sum-product theorem Symmetry Etc. 77

Summary Tractability Graphical Models Expressiveness

Summary Tractability New Tractable Classes Graphical Models Expressiveness