Inference Complexity As Learning Bias. The Goal Outline. Don t use model complexity as your learning bias

Similar documents
Learning Arithmetic Circuits

Learning Arithmetic Circuits

Learning Tractable Probabilistic Models Pedro Domingos

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Learning the Structure of Sum-Product Networks

Learning Markov Networks With Arithmetic Circuits

Learning Markov Networks With Arithmetic Circuits

A Co-Clustering approach for Sum-Product Network Structure Learning

Bayesian Networks Inference (continued) Learning

Building Classifiers using Bayesian Networks

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Developer s Guide for Libra 1.1.2d

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Machine Learning

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

10708 Graphical Models: Homework 2

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

STA 4273H: Statistical Machine Learning

Lecture 5: Exact inference

Warm-up as you walk in

Machine Learning. Sourangshu Bhattacharya

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation:

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Markov Network Structure Learning

Learning Bounded Treewidth Bayesian Networks

Computer vision: models, learning and inference. Chapter 10 Graphical Models

User Manual for Libra 1.1.0

Machine Learning

Machine Learning

Ch9: Exact Inference: Variable Elimination. Shimi Salant, Barak Sternberg

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

CS 188: Artificial Intelligence

Machine Learning

Summary: A Tutorial on Learning With Bayesian Networks

Loopy Belief Propagation

Algorithms for Learning the Structure of Monotone and Nonmonotone Sum-Product Networks

Machine Learning

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

CS 343: Artificial Intelligence

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

CS 664 Flexible Templates. Daniel Huttenlocher

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Bayesian Networks and Decision Graphs

6.034 Quiz 2, Spring 2005

Approximate (Monte Carlo) Inference in Bayes Nets. Monte Carlo (continued)

BAYESIAN NETWORKS STRUCTURE LEARNING

Probabilistic Graphical Models

Structured Bayesian Networks: From Inference to Learning with Routes

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Parallel Gibbs Sampling From Colored Fields to Thin Junction Trees

Graphical Models & HMMs

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Relax, Compensate and then Recover

Value of Information Lattice: Exploiting Probabilistic Independence for Effective Feature Subset Acquisition

Learning Directed Probabilistic Logical Models using Ordering-search

An efficient approach for finding the MPE in belief networks

Approximate Structure Learning for Large Bayesian Networks

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Bayesian Networks Inference

AIC-PRAiSE User Guide

CS 231: Algorithmic Problem Solving

CSE 530A. B+ Trees. Washington University Fall 2013

CS 343: Artificial Intelligence

6 : Factor Graphs, Message Passing and Junction Trees

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

The wolf sheep cabbage problem. Search. Terminology. Solution. Finite state acceptor / state space

Integrating Probabilistic Reasoning with Constraint Satisfaction

CSE 100: B+ TREE, 2-3 TREE, NP- COMPLETNESS

Collective classification in network data

LOCAL STRUCTURE AND DETERMINISM IN PROBABILISTIC DATABASES. Theodoros Rekatsinas, Amol Deshpande, Lise Getoor

Belief propagation in a bucket-tree. Handouts, 275B Fall Rina Dechter. November 1, 2000

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER

Directed Graphical Models (Bayes Nets) (9/4/13)

Boosting Simple Model Selection Cross Validation Regularization

Geometric data structures:

Finding Non-overlapping Clusters for Generalized Inference Over Graphical Models

Dynamic Bayesian network (DBN)

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Speeding Up Computation in Probabilistic Graphical Models using GPGPUs

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Learning Bayesian Networks (part 2) Goals for the lecture

Multiagent Planning with Factored MDPs

Variational Methods for Graphical Models

Combinatorial Selection and Least Absolute Shrinkage via The CLASH Operator

CS 350 : Data Structures B-Trees

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Introduction to Graphical Models

CREDAL SUM-PRODUCT NETWORKS

Simple Model Selection Cross Validation Regularization Neural Networks

Spatial biosurveillance

Graphical Models and Markov Blankets

Backtracking. Chapter 5

Graphical models are a lot like a circuit diagram they are written down to visualize and better understand a problem.

Randomized Algorithms for Fast Bayesian Hierarchical Clustering

Machine Learning

Transcription:

Inference Complexity s Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Don t use model complexity as your learning bias Use inference complexity. Joint work with Pedro Domingos The Goal Outline Data Learning algorithms Model B C nswers! Inference algorithms The actual task This talk: How to learn accurate and efficient models by tightly integrating learning and inference Experiments: exact inference in < 100ms in models with treewidth > 100 Standard solutions (and why they fail) Background Learning with Bayesian networks Inference with arithmetic circuits Learning arithmetic circuits Scoring Search Efficiency Experiments Conclusion 1

Solution 1: Exact Inference Solution 2: pproximate Inference Data Model Jointree nswers! Data Model pproximation nswers! B C B C SLOW SLOW SLOW pproximations are often too inaccurate. More accurate algorithms tend to be slower. Solution 3: Learn a tractable model Related work: Thin junction trees Thin junction trees are thin Their junction trees Our junction trees Data Jointree nswers! Polynomial in data, but still exponential in treewidth [E.g.: Chechetka & Guestrin, 2007] Maximum effective treewidth is 2-5 We have learned models with treewidth >100 2

Solution 3: Learn a tractable model Our work: rithmetic circuits with penalty on circuit size Data Circuit + + Model ( ) B C nswers! Outline Standard solutions (and why they fail) Background Learning with Bayesian networks Inference with arithmetic circuits Learning arithmetic circuits Overview Optimizations Experiments Conclusion Bayesian networks Problem: Compactly represent probability distribution over many variables Solution: Conditional independence B D C P(,B,C,D) = P() P(B ) P(C ) P(D B,C) With decision-tree CPDs Problem: Number of parameters is exponential in the maximum number of parents Solution: Context-specific independence P(D B,C) = 0.2 B=? C=? 0.5 0.7 3

Compiled to circuits rithmetic circuits Problem: Inference is exponential in treewidth Solution: Compile to arithmetic circuits + + B B B B + + B B B B Directed, acyclic graph with single root Leaf nodes are inputs Interior nodes are addition or multiplication Can represent any distribution Inference is linear in model size! Never larger than junction tree Can exploit local structure to save time/space Cs for Inference Bayesian network: P(,B,C) = P() P(B) P(C,B) Network polynomial: B C B C B + B C B C B + Can compute arbitrary marginal queries by evaluating network polynomial. rithmetic circuits (Cs) offer efficient, factored representations of this polynomial. Can take advantage of local structure such as context-specific independence. BN Structure Learning Start with an empty network Greedily add splits to decision trees one at a time, enforcing acyclicity score(c,t) = log P(T C) k p n p (C) (accuracy # parameters) [Chickering et al., 1996] 4

Key Idea For an arithmetic circuit C on an i.i.d. training sample T: Typical cost function: score(c,t) = log P(T C) k p n p (C) (accuracy # parameters) Our cost function: score(c,t) = log P(T C) k p n p (C) k e n e (C) (accuracy # parameters circuit size) Basic algorithm Following Chickering et al. (1996), we induce our statistical models by greedily selecting splits for the decision-tree CPDs. Our approach has two key differences: 1. We optimize a different objective function 2. We return a Bayesian network that has already been compiled into a circuit Efficiency Compiling each candidate C from scratch at each step is too expensive. Instead: Incrementally modify C as we add splits. Before split B B 5

fter split + B B B B B B lgorithm Create initial product of marginals circuit Create initial split list Until convergence: For each split in list pply split to circuit Score result Undo split pply highest-scoring split to circuit dd new child splits to list Remove inconsistent splits from list How to split a circuit D: Parameter nodes to be split V: Indicators for the splitting variable M: First mutual ancestors of D and V For each indicator in V, Copy all nodes between M and D or V, conditioned on. For each m in M, Replace children of m that are ancestors of D or V with a sum over copies of the ancestors times the each copy was conditioned on. Optimizations We avoid rescoring splits every iteration by: 1. Noting that likelihood gain never changes, only number of edges added 2. Evaluating splits with higher likelihood gain first, since likelihood gain is an upper bound on score. 3. Re-evaluate number of edges added only when another split may have affected it (C-Greedy). 4. ssume the number of edges added by a split only increases as the algorithm progresses (C-Quick). 6

Experiments Learning time Datasets: KDD-Cup 2000: 65 variables nonymous MSWeb: 294 variables EachMovie (subset): 500 variables lgorithms: WinMine toolkit + Gibbs sampling C + Exact inference Queries: Generated queries from the test data with varying numbers of evidence and query variables Inference time ccuracy: EachMovie 7

ccuracy: MSWeb ccuracy: KDD Cup Our method learns complex, accurate models Treewidth of learned models: C-Greedy C-Quick WinMine EachMovie 35 54 281 KDD Cup 38 38 53 MSWeb 114 127 118 Log-likelihood of test data: C-Greedy C-Quick WinMine EachMovie 55.7 54.9 53.7 KDD Cup 2.16 2.16 2.16 MSWeb 9.85 9.85 9.69 (2 35 = 34 billion) Conclusion Problem: Learning accurate intractable models = Learning inaccurate models Solution: Use inference complexity as learning bias lgorithm: Learn arithmetic circuits with penalty on circuit size Result: Much faster and more accurate inference than standard Bayes net learning 8