Building Classifiers using Bayesian Networks

Similar documents
10708 Graphical Models: Homework 2

Bayesian Networks Inference (continued) Learning

Learning Bayesian Networks (part 3) Goals for the lecture

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Machine Learning

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Exam Advanced Data Mining Date: Time:

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Machine Learning

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Machine Learning. Supervised Learning. Manfred Huber

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Augmented Naive Bayesian Classifiers

10601 Machine Learning. Model and feature selection

Summary: A Tutorial on Learning With Bayesian Networks

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Simple Model Selection Cross Validation Regularization Neural Networks

Inference Complexity As Learning Bias. The Goal Outline. Don t use model complexity as your learning bias

Probabilistic Classifiers DWML, /27

Probabilistic Graphical Models

What is machine learning?

Random projection for non-gaussian mixture models

Structural EM Learning Bayesian Networks and Parameters from Incomplete Data

Regularization and model selection

Machine Learning. Sourangshu Bhattacharya

Slides for Data Mining by I. H. Witten and E. Frank

Structured Learning. Jun Zhu

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayes Net Learning. EECS 474 Fall 2016

Machine Learning Techniques for Data Mining

Probabilistic Graphical Models

Boosting Simple Model Selection Cross Validation Regularization

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

Graphical Models. David M. Blei Columbia University. September 17, 2014

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Computationally Efficient M-Estimation of Log-Linear Structure Models

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Bayesian model ensembling using meta-trained recurrent neural networks

Bayesian Classification Using Probabilistic Graphical Models

The Basics of Graphical Models

Network Traffic Measurements and Analysis

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

STA 4273H: Statistical Machine Learning

Deep Generative Models Variational Autoencoders

10-701/15-781, Fall 2006, Final

STA 4273H: Statistical Machine Learning

Evaluating Classifiers

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

7. Decision or classification trees

Machine Learning Lecture 3

Forward Feature Selection Using Residual Mutual Information

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

Loopy Belief Propagation

Generative and discriminative classification techniques

Louis Fourrier Fabien Gaie Thomas Rolf

Evaluating the Explanatory Value of Bayesian Network Structure Learning Algorithms

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Bayesian Methods in Vision: MAP Estimation, MRFs, Optimization

Probabilistic Learning Classification using Naïve Bayes

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Machine Learning / Jan 27, 2010

Machine Learning

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

Contents. Preface to the Second Edition

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Collective classification in network data

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Topics In Feature Selection

Hyperparameters and Validation Sets. Sargur N. Srihari

How Learning Differs from Optimization. Sargur N. Srihari

Randomized Algorithms for Fast Bayesian Hierarchical Clustering

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

STAT 598L Learning Bayesian Network Structure

Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting

MultiVariate Bayesian (MVB) decoding of brain images

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Dependency detection with Bayesian Networks

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Automatic Domain Partitioning for Multi-Domain Learning

Link Prediction for Social Network

Performance Evaluation of Various Classification Algorithms

6. Dicretization methods 6.1 The purpose of discretization

Mixture Models and the EM Algorithm

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

3 : Representation of Undirected GMs

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

CS 229 Midterm Review

Discretizing Continuous Attributes While Learning Bayesian Networks

Artificial Intelligence Naïve Bayes

Machine Learning

CSE 446 Bias-Variance & Naïve Bayes

Module 4. Non-linear machine learning econometrics: Support Vector Machine

SSV Criterion Based Discretization for Naive Bayes Classifiers

Chapter 3: Supervised Learning

Transcription:

Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger

Paper Summary The Naive Bayes classifier has reasonable performance compared to more sophisticated methods. Naive Bayes classifiers can be represented by Bayesian networks. The paper explores the application of Bayesian networks to classification tasks. This could lead to better performance, but is computationally expensive. Proposes the Tree Augmented Naive Bayes (TAN) form of restricted Bayesian networks that performs better than Naive Bayes in most cases. An efficient algorithm for learning TAN networks is provided Extensive empirical results are presented comparing different classification methods on 22 different datasets. TAN appears to have the highest overall performance.

Naive Bayes Classification Task: Determine {c,a 1,,a n } p C A 1 A n for data instances Assume attributes are conditionally independent given the class label c. Formally: N p C, A 1,, A n = p C i=1 p A i C This strong independence assumption is not true for many data sets. The conditional distribution of each attribute given the class is modelled. Continuous distributions such as Gaussians can be used, but discrete representations are used in this paper. Naive Bayes requires minimal storage and computation compared to more sophisticated methods.

Bayesian Networks Provide an efficient framework appropriate for representing independence assertions. Are directed acyclic graphs (DAGs) representing the joint probability distribution of a set of random variables (nodes). Edges represent direct correlations. Figure: Naive Bayes classifier as a Bayesian Network

Bayesian Networks for Classification Allow arbitrary connections between class C and attributes. Each node stores the conditional distribution of the corresponding random variable given its parents. For a fixed network structure, this is trivial to extract for discrete data when no data is missing calculate the frequencies in the data. To classify a data instance, use Bayes rule to calculate the posterior probability of each class. Choose the class with the highest value. P c A 1,, A n p A 1,, A n c p c Figure: general Bayesian network for classification

Learning Bayesian Networks Similar to unsupervised learning, since we are trying to learn the probability distribution of the data, while treating the class value like an attribute. Finding the best network structure is hard, the first requirement is a scoring criteria to determine which network is best. Log likelihood of the data: N LL B D = i=1 log PB c i,a 1 i,, a n i Parameters for a fixed network structure that maximise the log likelihood are easy to compute. Simply store the conditional probability of each variable given its parents.

Leaning Bayesian Networks (contd.) A fully connected network will always have the highest log likelihood for the training data, but overfitting tends to occur and the learned parameters will have extremely high variance (when trained on different datasets). This would not be a problem if very large amounts of training data were available Finding the best network structure is intractable there is no know polynomial time algorithm. Exhaustive search seems to be required. The number of possible network structures is exponential in the number of attributes. Greedy search over network structures is used in the paper, edges are added, deleted, or reversed in each step. Changes are kept if the scoring criteria improves.

Minimum Description Length Trade off between log likelihood and network complexity. Based on information theory represents the minimum number of bits needed to transmit the network parameters and the data. Defined as: MDL B D = 1 log N B LL B D 2 B - number of network parameters, N number of data instances. The first term represents the theoretical minimum number of bits needed to represent the parameters, and the negative log likelihood represents the minimum number of bits required to encode the data under the model. Would indicate the best solution if we had infinite training data. When training data is limited, MDL does not always indicate the best network for classification tasks. Particularly when there were more than about 20 attributes. MDL might give better results for general tasks of doing inference in networks.

Other Scoring Functions Similar scoring functions, such as the Bayesian scoring function have similar problems finding the best network for the classification task. Cross-validation is a computationally expensive alternative, but may provide a better indication of performance. Potential solution: modify the scoring function to suit the classification task conditional log likelihood.

Conditional Log Likelihood The log likelihood can be decomposed as follows: N LL B D = i=1 log P B c i,a i 1,,a i N n = i=1 log P B c i a i 1,,a i N n i=1 log P B a i 1,,a i n The first term represents how well the network estimates the probability of the class given the attributes. The second term represents the joint distribution of the attributes. Only the first term affects classification performance, define the conditional log likelihood based on the first term: N CLL B D = i=1 log P B c i a i 1,, a i n Unfortunately, there is no known closed form solution to maximise the CLL for a fixed network structure. Need to use EM or gradient descent methods. Could define conditional MDL (CMDL) by replacing LL with CLL in the MDL equation. Evaluating CMDL this requires much more computation than MDL. CMDL B D = 1 log N B CLL B D 2

Empirical Results: Naive Bayes v. Bayesian Networks (with best MDL scores) Results for 22 different datasets. Separate test and training sets for the larger datasets. 5-fold Cross-validation for the smaller datasets.

Unrestricted Bayesian Network Summary Bayesian networks are a very powerful tool. The best network would perform no worse than the naive Bayes classifier. Exhaustively searching for the best network structure is intractable. Scoring functions do not always indicate the best network for the classification task. Scoring functions specialised for classification are harder to optimise for a fixed network structure.

Restricted Bayesian Networks Based on the naive Bayes network structure Every Attribute has class as a parent Allow attributes to be connected with correlation edges Two attributes need no longer be conditionally independent given the class

Learning the restricted Network Learning a restricted network, even when based on the naive Bayes structure is still an intractable problem. Essentially we are trying to learn a bayesian network over all the attributes. So add more restrictions: We will construct a directed acyclic spanning tree of the attribute graph e.g. any node may have at most one correlation edge pointing to it from another attribute. We call this the Tree Augmented Naive Bayes (TAN). Algorithm for construction of this network (Chow & Liu)

Construction of a maximal log likelihood TAN structure Compute the mutual information between each pair of variables. I X i ; X j = xi, x j P D x i, x j log P D x i, x j P D x i P D x j Measures the information gained about one attribute when knowing the value of another This will be zero for independent attributes For our purposes (classification) we introduce the conditional mutual information. I X i ; X j C = xi, x j,c P D x i, x j, c log P D x i, x j c P D x i c P D x j c

Construction (contd) Build fully connected undirected graph with a vertex for each attribute and set the weight between vertices to the mutual information of the two variables. Now build the maximum weighted spanning tree of the graph. MaxST is a connected tree where the sum of the weights of edges in the tree is greater or equal to the sum of weights of any possible such tree in the given network. Convert the undirected tree to a directed tree by choosing a root node and setting the direction of edges to be outward from it.

Time complexity of the construction algorithm Overall time complexity is O n 2 N Computing mutual information is O n 2 N while construction of the maximum spanning tree is O n 2 log n In general N > log (n), hence the above time complexity.

Adjusting the parameters When assigning the parameters x x to the network we estimate conditional frequencies of the form: P D X X For conditional mutual information we partition the data according to the possible values of X before computing probabilities. At least twice as many partitions as in the Naive Bayes, which partitions on the class variable only. This reduces the reliability of estimates where few data instances are available.

Adjusting parameters (contd) In order to deal with unreliable estimates due to fewer instances in a partition, introduce a smoothing factor with a bias towards the marginal probability of an attribute X: where s x x = P D x x 1 P D x = N P D x N P D x s and s is the smoothing parameter (see Dirichlet prior). Applying this to the existing TAN algorithm gives us the smoothed TAN algorithm

Experimental results Smoothed TAN performs at least as well and in many cases better than unsmoothed TAN Comparison of Naive Bayes, Unsupervised Bayesian Networks, TAN, C4.5 (Decision Tree) and Selective naive Bayesian classifier on 22 datasets TAN performs competitively with all other classifiers, and when performing better occasionally it does so with a large margin. For evaluation 5-fold cross validation is used with a majority of the data sets.

Comparison of TAN to C4.5 and Naive Bayes

THE END Questions?