BAYESIAN NETWORKS STRUCTURE LEARNING

Similar documents
Scale Up Bayesian Networks Learning Dissertation Proposal

An Improved Lower Bound for Bayesian Network Structure Learning

An Improved Admissible Heuristic for Learning Optimal Bayesian Networks

Learning Optimal Bayesian Networks Using A* Search

Bayesian Networks Structure Learning Second Exam Literature Review

Survey of contemporary Bayesian Network Structure Learning methods

Scale Up Bayesian Network Learning

On Pruning with the MDL Score

Learning Bayesian networks with ancestral constraints

Bayesian network model selection using integer programming

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

An Exact Approach to Learning Probabilistic Relational Model

Learning Bounded Tree-width Bayesian Networks using Integer Linear Programming

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Enumerating Equivalence Classes of Bayesian Networks using EC Graphs

Integer Programming for Bayesian Network Structure Learning

Finding the k-best Equivalence Classes of Bayesian Network Structures for Model Averaging

Enumerating Equivalence Classes of Bayesian Networks using EC Graphs

INTRODUCTION TO HEURISTIC SEARCH

Improved Local Search in Bayesian Networks Structure Learning

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

The max-min hill-climbing Bayesian network structure learning algorithm

Heuristic (Informed) Search

Evaluating the Explanatory Value of Bayesian Network Structure Learning Algorithms

Local Search Methods for Learning Bayesian Networks Using a Modified Neighborhood in the Space of DAGs

Bayesian Networks Inference (continued) Learning

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

Node Aggregation for Distributed Inference in Bayesian Networks

Learning decomposable models with a bounded clique size

A General Scheme for Automatic Generation of Search Heuristics from Specification Dependencies. Kalev Kask and Rina Dechter

Improving Acyclic Selection Order-Based Bayesian Network Structure Learning

Learning Bayesian Networks with Non-Decomposable Scores

Lecture 5: Exact inference

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Dependency detection with Bayesian Networks

Last time: Problem-Solving

Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion

From Exact to Anytime Solutions for Marginal MAP

10708 Graphical Models: Homework 2

Summary: A Tutorial on Learning With Bayesian Networks

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Understanding the effects of search constraints on structure learning

Graphical Models and Markov Blankets

Learning Treewidth-Bounded Bayesian Networks with Thousands of Variables

An efficient approach for finding the MPE in belief networks

Finding Optimal Bayesian Network Given a Super-Structure

3 SOLVING PROBLEMS BY SEARCHING

CS 188: Artificial Intelligence. Recap Search I

SCORE EQUIVALENCE & POLYHEDRAL APPROACHES TO LEARNING BAYESIAN NETWORKS

STAT 598L Learning Bayesian Network Structure

Ordering-Based Search: A Simple and Effective Algorithm for Learning Bayesian Networks

Mini-Buckets: A General Scheme for Generating Approximations in Automated Reasoning

Building Classifiers using Bayesian Networks

Learning Bounded Treewidth Bayesian Networks

SRI VIDYA COLLEGE OF ENGINEERING & TECHNOLOGY REPRESENTATION OF KNOWLEDGE PART A

A Novel Algorithm for Scalable and Accurate Bayesian Network Learning

Ch9: Exact Inference: Variable Elimination. Shimi Salant, Barak Sternberg

DIT411/TIN175, Artificial Intelligence. Peter Ljunglöf. 23 January, 2018

Example: Map coloring

Learning Bayesian Network Structure using LP Relaxations

Learning Probabilistic Relational Models

Inference Complexity As Learning Bias. The Goal Outline. Don t use model complexity as your learning bias

Announcements. CS 188: Artificial Intelligence Fall Reminder: CSPs. Today. Example: 3-SAT. Example: Boolean Satisfiability.

CS 188: Artificial Intelligence Fall 2008

Problem Solving & Heuristic Search

A Well-Behaved Algorithm for Simulating Dependence Structures of Bayesian Networks

Artificial Intelligence

Today. CS 188: Artificial Intelligence Fall Example: Boolean Satisfiability. Reminder: CSPs. Example: 3-SAT. CSPs: Queries.

Hybrid Feature Selection for Modeling Intrusion Detection Systems

A Probabilistic Relaxation Framework for Learning Bayesian Network Structures from Data

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

Belief propagation in a bucket-tree. Handouts, 275B Fall Rina Dechter. November 1, 2000

Evaluating the Effect of Perturbations in Reconstructing Network Topologies

Artificial Intelligence (part 4a) Problem Solving Using Search: Structures and Strategies for State Space Search

Set 2: State-spaces and Uninformed Search. ICS 271 Fall 2015 Kalev Kask

Causality in Communication: The Agent-Encapsulated Bayesian Network Model

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

CS 4100 // artificial intelligence

CSE 473: Artificial Intelligence

Probabilistic Belief. Adversarial Search. Heuristic Search. Planning. Probabilistic Reasoning. CSPs. Learning CS121

The Size Robust Multiple Knapsack Problem

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

Dr. Mustafa Jarrar. Chapter 4 Informed Searching. Sina Institute, University of Birzeit

Dr. Mustafa Jarrar. Chapter 4 Informed Searching. Artificial Intelligence. Sina Institute, University of Birzeit

A Factor Tree Inference Algorithm for Bayesian Networks and its Applications

Exam Advanced Data Mining Date: Time:

CS 188: Artificial Intelligence

Constraint Satisfaction Problems (CSPs)

Semi-Independent Partitioning: A Method for Bounding the Solution to COP s

A Bayesian Network Approach for Causal Action Rule Mining

Search Algorithms for Solving Queries on Graphical Models & the Importance of Pseudo-trees in their Complexity.

The wolf sheep cabbage problem. Search. Terminology. Solution. Finite state acceptor / state space

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Announcements

Learning Directed Probabilistic Logical Models using Ordering-search

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Geometric data structures:

Transcription:

BAYESIAN NETWORKS STRUCTURE LEARNING Xiannian Fan Uncertainty Reasoning Lab (URL) Department of Computer Science Queens College/City University of New York http://url.cs.qc.cuny.edu 1/52

Overview : Bayesian Networks Bayesian Networks Structure Learning 2/52

Understanding uncertain relations Summary Many research problems involve understanding the uncertain relations between a set of random variables Ignorance: Limited knowledge Laziness: Details ignored?? 3/52

Bayesian networks Summary A Bayesian network [Pearl 1988] is a directed acyclic graph (DAG) consisting of two parts: The qualitative part, encoding a domain's variables (nodes) and the probabilistic (usually causal) influences among them (arcs). The quantitative part, encoding a joint probability distribution over these variables. 4/52

Bayesian networks: Numerical parameters Summary Prior probability distributions for nodes without predecessors (Obesity, alcoholism,...) Conditional probability distributions for nodes with predecessors (fatigue, jaundice,...) 5/52

Inference in Bayesian networks Summary Inference: Compute the probability of a hypothesis (e.g., a diagnosis) given new evidence (e.g., medical findings, test results). Belief updating: P(Chronic hepatitis Jaundice, Ascites)? Which disease is most likely? Maximum a Posteriori assignment (MAP) What is the most likely joint state of the diseases? 6/52

Why Learn Bayesian networks? Summary Provide compact and intuitive graphical representations of the uncertain relations between the random variables Offer well-understood principles for solving many challenging tasks, e.g. Incorporating prior knowledge with observational data Handling missing data Knowledge discovery from data 7/52

Applications of Bayesian networks Summary Computer vision Natural language processing Computer security Bioinformatics Data mining User modeling Robotics Medical and machine diagnosis/prognosis Information retrieval Planning Fraud detection Planning 8/52

Learning Bayesian networks Summary Very often we have data sets. We can extract knowledge from these data. structure data numerical parameters 9/52

Why struggle for accurate structure Summary Missing an arc Cannot be compensated by fitting parameters Wrong assumptions about domain structure Adding an arc Increases the number of parameters to be estimated Wrong assumptions about domain structure 10/52

Major learning approaches Summary Score-based structure learning Find the highest structure network structure Constraint-based structure learning Find a network that best explains the dependencies and independencies in the data Hybrid approaches Integrated constraint and score based structure learning 11/52

Score-based learning Summary Define scoring function that evaluates how well a structure a matches the data Search for a structure that optimizes a given scoring function 12/52

Scoring function Summary Minimum Description Length (MDL) Bayesian Dirichlet Family (BD) Factorized Normalized Maximum Likelihood (fnml) MDL consistently outperforms other scoring functions in recovering the underlying Bayesian network structures. [Liu, Malone, Yuan, BMC-2012] 13/52

Minimum Description Length(MDL) Summary MDL views learning as data compression. MDL consists of two components: Model encoding Data encoding, using the model Search for a structure that optimizes the score 14/52

Decomposability Summary For all the scoring function we mentioned MDL: Decomposable Decomposable BDeu: fnml: Decomposable All of these are expressed as a sum over the individual variables. This property is called decomposability and will be quite important For structure learning. [Heckerman1995, etc.] 15/52

Score-based learning Summary Define scoring function that evaluates how well a structure a matches the data Search for a structure that optimizes a given scoring function 16/52

Search space of DAGs Summary Number of possible DAGs containing n nodes 17/52

Search strategies Summary Local search strategies Hill Climbing Max-Min Hill Climbing Others, e.g, stochastic local search, genetic algorithm. Optimal search strategies Branch-and-Bound Dynamic Programming Integer Linear Programming Heuristic Search 18/52

Local search: Hill Climbing Start from an initial state Iteratively Update: Add, delete and reverse. Summary 1 3 2 1 3 2 1 2 3 1 2 3 19/52

Local search: Max-Min Hill Climbing Summary First stage: Max-Min Parent Children (MMPC): use statistical test of conditional independence to find the probable sets of parents and children for each variable x in the domain. Second stage: greedy local search search in the space of DAGs, but restricted to the skeleton identified by first stage MMPC. 20/52

Search strategies Local search strategies Hill Climbing Max-Min Hill Climbing Others, e.g, stochastic local search, genetic algorithm. Optimal search strategies Branch-and-Bound Dynamic Programming Integer Linear Programming Heuristic Search 21/52

Genearal form, for scores, x,pa s x, PA subject to : PA is possible parents set of x in a optimal BN Begin by calculating optimal parents sets for all variables Number of optimal parents sets s(, ): 2 n-1 scores for each variable Most scores can be pruned: not possible in any optimal BNs Bounds on Parents Sets for MDL [Tian, UAI-00]) Properties of Decomposable Score Functions [de Campos and Ji, JMLR-11] Sparse Parent Graph [Yuan and Malone, UAI-12] [Yuan, Malone, UAI-12] 22/52

Input s(, ) : Potential Optimal Parents Set Example: Potential Optimal Parent Set for X 1 Structure Learning Graph Search Reduce Search Space Experiments Summary Prune useless local scores. (b) s(x 1,PA) (c) s(x 1,PA) 23/52

Input: Potential Optimal Parents Set (POPS) POPS for another 8-variable example [Yuan, Malone, Wu, IJCAI-11] 24/52

Optimal search: Branch-and Bound Begin by calculating optimal parents sets for all variables. These sets are represents as a directed graph G that may have cycles. 25/52

Optimal search: Branch-and Bound Search over all possible directed graph which is a sub-graph of G, by removing one edge at a time. Use bounds to prune search space Claim to be optimal [de Campos and Ji, JMLR-11] Actually, not optimal 26/52

Optimal Search: Dynamic Programming Dynamic programming: optimal Bayesian network = a best leaf with optimal parents + an optimal subnetwork. 1 3 2 [Singh & Moore 2005, Silander & Myllymaki 2006] 2 1 1 1 2 3 3 3 2 27/52

Optimal Search: Integer Linear Programming Formulate the learning task as a Integer Linear Programming (ILP) problem Encode any graph by creating a binary ILP variables I(PA x) for each variable x and each candidate parent set PA. I(PA x) =1 if and only if PA are the parents of x in a optimal BN Each I(PA x) has a corresponding local score s(x, PA). Table 1. Coding for family variables [Yuan, Malone, Wu, IJCAI-11] 28/52

Optimal Search: Integer Linear Programming Learning formulation: Instantiate the I(PA X) to maximize s,pa Clusters Constraint : x V PA:PA C=ϕ s x, PA I(PA X) subject to all I(PA X) represent a DAG. PA I PA X = 1 for each x V s x, PA I PA X 1, where C V any subset C of nodes in a DAG must contains at least one node who has parent in that subset. Learning method: GOBNILP using Cutting Plane. [Cussens, etc, UAI-2011] 29/52

Optimal Search: Heuristic Search Formulate the learning task as a shortest path finding problem The shortest path solution to a graph search problem corresponds to an optimal Bayesian network [Yuan, Malone, Wu, IJCAI-11] 30/52

Motivation Observation: Every DAG has at least one leaf. Subnetwork 1 3 Leaf 2 31/52

Search graph (Order graph) ϕ 1 2 3 4 1,2 1,3 2,3 1,4 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2,3,4 Formulation: Search space: Variable subsets Start node: Empty set Goal node: Complete set Edges: Select parents Edge cost: BestMDL(X,U) for edge U U {X} Task: find the shortest path between start and goal nodes 1,4,3,2 3 4 3 1 2 [Yuan, Malone, Wu, IJCAI-11] 4 2 32/52

Simple Heuristic ϕ g 1 2 3 4 A* search: Expands the nodes in the order of promisingness: f=g+h g(u) = MDL(U) h(u) = X V\U BestMDL(X, V\{X}) h({2,3}): 1,3 2,3 1,4 2,4 h 1,2,4 1,3,4 2 3 1 4 1,2,3,4 [Yuan, Malone, Wu, IJCAI-11] 33/52

Properties of simple heuristic Theorem: The simple heuristic function h is admissible Optimistic estimation Theorem: h is also consistent Monotonic f values Consistency => admissibility 34/52

BFBnB Algorithm ϕ 1 2 3 4 1,2 1,3 2,3 1,4 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2,3,4 Breadth-first branch and bound search (BFBnB): Motivation: Exponential-size order&parent graphs Observation: Natural layered structure Solution: Search one layer at a time [Malone, Yuan, Hansen, UAI-11] 35/52

BFBnB Algorithm ϕ 1 2 3 4 [Malone, Yuan, Hansen, UAI-11] 36/52

BFBnB Algorithm ϕ 1 2 3 4 1,2 1,3 2,3 1,4 2,4 3,4 [Malone, Yuan, Hansen, UAI-11] 37/52

BFBnB Algorithm ϕ 1 2 3 4 1,2 1,3 2,3 1,4 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 [Malone, Yuan, Hansen, UAI-11] 38/52

BFBnB Algorithm ϕ 1 2 3 4 1,2 1,3 2,3 1,4 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2,3,4 [Malone, Yuan, Hansen, UAI-11] 39/52

Pruning in BFBnB For pruning, estimate an upper bound solution before search Can be done using greedy local search Prune a node when f-cost > upper bound ϕ 1 2 3 1,3 2,3 1,4 2,4 1,2,4 1,3,4 2,3,4 1,2,3,4 [Malone, Yuan, Hansen, UAI-11] 40/52

Critiquing the simple heuristic Drawback of the simple heuristic Let each variable to choose optimal parents from all the other variables Completely relaxes the acyclic constraint Bayesian network Heuristic estimation 1 1 2 4 Relaxation 3 2 3 4 41/52

Potential solution Breaking the cycles to obtain tighter heuristic 1 2 1 2 3 4 BestMDL(1, {2,3,4}) + BestMDL(2, {3,4})={3} 3 4 BestMDL(1, {2,3,4})={2,3,4} + BestMDL(2, {1,3,4})={1,4} 1 2 3 4 min c({1,2}) BestMDL(1, {3,4})={3,4} + BestMDL(2, {1,3,4}) [Yuan, Malone, UAI-12] 42/52

k-cycle conflict heuristic Compute costs for all 2-variable groups Each group, called a pattern, has a tighter score 1,2 1,3 2,3 1,4 2,4 3,4 Based on dynamically partitioned pattern database [Felner et al. 2004] Avoid cycles for all 3-variable groups Avoid cycles for variable groups with size up to k [Yuan, Malone, UAI-12] 43/52

Compute k-cycle conflict heuristic Compute the pattern database with a backward breadth-first search in the order graph for k layers 1,2 1,3 2,3 1,4 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2,3,4 Reverse g cost: g r ({3,4}) = min(p1, P2) P1: BestMDL(1, {2,3,4})+BestMDL(2,{3,4}) P2: BestMDL(2, {1,3,4})+BestMDL(1,{3,4}) g r ({3,4}) == c({1,2})! [Yuan, Malone, UAI-12] 44/52

Computing heuristic value using dynamic PD To compute the heuristic for a node, find non-overlapping patterns and sum their costs ϕ h({1})= c({2,3}) + c({4}) h({1})= c({2,4}) + c({3}) 1 2 3 4 h({1})= c({3,4}) + c({2}) Can be formulated as a maximum-weight matching problem O(N 3 ) for k=2 [Papadimitriou&Steiglitz 1982] NP-hard for k 3 [Garey&Johnson 1979] Greedy method Order the patterns based on improvements Greedily find the best available pattern at each step [Yuan, Malone, UAI-12] 45/52

Statically partitioned pattern database Calculate full pattern databases for non-overlapping static groups: {1,2,3,4,5,6} {1,2,3}, {4,5,6} ϕ ϕ 1 2 3 4 5 6 1,2 1,3 2,3 4,5 4,6 5,6 1,2,3 4,5,6 [Yuan, Malone, UAI-12] 46/52

Computing heuristic value using static PD Sum costs of pattern databases according to static grouping ϕ ϕ 1 2 3 4 5 6 1,2 1,3 2,3 4,5 4,6 5,6 1,2,3 4,5,6 h({1,5,6}) = c({2,3})+c({4}) = g r ({1})+g r ({5,6}) [Yuan, Malone, UAI-12] 47/52

Properties of k-cycle conflict heuristic Theorem: The k-cycle conflict heuristic is admissible Dynamic version Static version Theorem: The static k-cycle conflict heuristic is consistent Theorem: The dynamic k-cycle conflict heuristic is consistent Given that we use an optimal maximum matching algorithm Will lose the consistency property if using an approximation algorithm [Yuan, Malone, UAI-12] 48/52

Comparison of optimal search algorithms Branch-and-Bound search all sub-graphs of the directed graph G that may have cycles, getting from optimal parents sets. (Not optimal solution). Prune by bound. DP Search all DAGs with DP techniques Integer Linear Programming (ILP) Number of family variables Heuristic Search Search all subsets of variables Tightness of heuristic function Heuristic Search and ILP are much more efficient. [Yuan, Malone, JAIR-12] 49/52

Comparison of optimal search algorithms 50/52

Summary A* Search BFBnB Improved Heuristic A survey on the problem of Bayesian networks structure learning. Topics covered Basics of Bayesian networks Scoring function Local search strategies hill climbing mmhc Branch-and-bound Dynamic programming Integer linear programming Heuristic search Efficient Packages/software GOBNILP, using Integer linear programming URLearning, using Heuristic search 51/52

52/52