Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger

Paper Summary The Naive Bayes classifier has reasonable performance compared to more sophisticated methods. Naive Bayes classifiers can be represented by Bayesian networks. The paper explores the application of Bayesian networks to classification tasks. This could lead to better performance, but is computationally expensive. Proposes the Tree Augmented Naive Bayes (TAN) form of restricted Bayesian networks that performs better than Naive Bayes in most cases. An efficient algorithm for learning TAN networks is provided Extensive empirical results are presented comparing different classification methods on 22 different datasets. TAN appears to have the highest overall performance.

Naive Bayes Classification Task: Determine {c,a 1,,a n } p C A 1 A n for data instances Assume attributes are conditionally independent given the class label c. Formally: N p C, A 1,, A n = p C i=1 p A i C This strong independence assumption is not true for many data sets. The conditional distribution of each attribute given the class is modelled. Continuous distributions such as Gaussians can be used, but discrete representations are used in this paper. Naive Bayes requires minimal storage and computation compared to more sophisticated methods.

Bayesian Networks Provide an efficient framework appropriate for representing independence assertions. Are directed acyclic graphs (DAGs) representing the joint probability distribution of a set of random variables (nodes). Edges represent direct correlations. Figure: Naive Bayes classifier as a Bayesian Network

Bayesian Networks for Classification Allow arbitrary connections between class C and attributes. Each node stores the conditional distribution of the corresponding random variable given its parents. For a fixed network structure, this is trivial to extract for discrete data when no data is missing calculate the frequencies in the data. To classify a data instance, use Bayes rule to calculate the posterior probability of each class. Choose the class with the highest value. P c A 1,, A n p A 1,, A n c p c Figure: general Bayesian network for classification

Learning Bayesian Networks Similar to unsupervised learning, since we are trying to learn the probability distribution of the data, while treating the class value like an attribute. Finding the best network structure is hard, the first requirement is a scoring criteria to determine which network is best. Log likelihood of the data: N LL B D = i=1 log PB c i,a 1 i,, a n i Parameters for a fixed network structure that maximise the log likelihood are easy to compute. Simply store the conditional probability of each variable given its parents.

Leaning Bayesian Networks (contd.) A fully connected network will always have the highest log likelihood for the training data, but overfitting tends to occur and the learned parameters will have extremely high variance (when trained on different datasets). This would not be a problem if very large amounts of training data were available Finding the best network structure is intractable there is no know polynomial time algorithm. Exhaustive search seems to be required. The number of possible network structures is exponential in the number of attributes. Greedy search over network structures is used in the paper, edges are added, deleted, or reversed in each step. Changes are kept if the scoring criteria improves.

Minimum Description Length Trade off between log likelihood and network complexity. Based on information theory represents the minimum number of bits needed to transmit the network parameters and the data. Defined as: MDL B D = 1 log N B LL B D 2 B - number of network parameters, N number of data instances. The first term represents the theoretical minimum number of bits needed to represent the parameters, and the negative log likelihood represents the minimum number of bits required to encode the data under the model. Would indicate the best solution if we had infinite training data. When training data is limited, MDL does not always indicate the best network for classification tasks. Particularly when there were more than about 20 attributes. MDL might give better results for general tasks of doing inference in networks.

Other Scoring Functions Similar scoring functions, such as the Bayesian scoring function have similar problems finding the best network for the classification task. Cross-validation is a computationally expensive alternative, but may provide a better indication of performance. Potential solution: modify the scoring function to suit the classification task conditional log likelihood.

Conditional Log Likelihood The log likelihood can be decomposed as follows: N LL B D = i=1 log P B c i,a i 1,,a i N n = i=1 log P B c i a i 1,,a i N n i=1 log P B a i 1,,a i n The first term represents how well the network estimates the probability of the class given the attributes. The second term represents the joint distribution of the attributes. Only the first term affects classification performance, define the conditional log likelihood based on the first term: N CLL B D = i=1 log P B c i a i 1,, a i n Unfortunately, there is no known closed form solution to maximise the CLL for a fixed network structure. Need to use EM or gradient descent methods. Could define conditional MDL (CMDL) by replacing LL with CLL in the MDL equation. Evaluating CMDL this requires much more computation than MDL. CMDL B D = 1 log N B CLL B D 2

Empirical Results: Naive Bayes v. Bayesian Networks (with best MDL scores) Results for 22 different datasets. Separate test and training sets for the larger datasets. 5-fold Cross-validation for the smaller datasets.

Unrestricted Bayesian Network Summary Bayesian networks are a very powerful tool. The best network would perform no worse than the naive Bayes classifier. Exhaustively searching for the best network structure is intractable. Scoring functions do not always indicate the best network for the classification task. Scoring functions specialised for classification are harder to optimise for a fixed network structure.

Restricted Bayesian Networks Based on the naive Bayes network structure Every Attribute has class as a parent Allow attributes to be connected with correlation edges Two attributes need no longer be conditionally independent given the class

Learning the restricted Network Learning a restricted network, even when based on the naive Bayes structure is still an intractable problem. Essentially we are trying to learn a bayesian network over all the attributes. So add more restrictions: We will construct a directed acyclic spanning tree of the attribute graph e.g. any node may have at most one correlation edge pointing to it from another attribute. We call this the Tree Augmented Naive Bayes (TAN). Algorithm for construction of this network (Chow & Liu)

Construction of a maximal log likelihood TAN structure Compute the mutual information between each pair of variables. I X i ; X j = xi, x j P D x i, x j log P D x i, x j P D x i P D x j Measures the information gained about one attribute when knowing the value of another This will be zero for independent attributes For our purposes (classification) we introduce the conditional mutual information. I X i ; X j C = xi, x j,c P D x i, x j, c log P D x i, x j c P D x i c P D x j c

Construction (contd) Build fully connected undirected graph with a vertex for each attribute and set the weight between vertices to the mutual information of the two variables. Now build the maximum weighted spanning tree of the graph. MaxST is a connected tree where the sum of the weights of edges in the tree is greater or equal to the sum of weights of any possible such tree in the given network. Convert the undirected tree to a directed tree by choosing a root node and setting the direction of edges to be outward from it.

Time complexity of the construction algorithm Overall time complexity is O n 2 N Computing mutual information is O n 2 N while construction of the maximum spanning tree is O n 2 log n In general N > log (n), hence the above time complexity.

Adjusting the parameters When assigning the parameters x x to the network we estimate conditional frequencies of the form: P D X X For conditional mutual information we partition the data according to the possible values of X before computing probabilities. At least twice as many partitions as in the Naive Bayes, which partitions on the class variable only. This reduces the reliability of estimates where few data instances are available.

Adjusting parameters (contd) In order to deal with unreliable estimates due to fewer instances in a partition, introduce a smoothing factor with a bias towards the marginal probability of an attribute X: where s x x = P D x x 1 P D x = N P D x N P D x s and s is the smoothing parameter (see Dirichlet prior). Applying this to the existing TAN algorithm gives us the smoothed TAN algorithm

Experimental results Smoothed TAN performs at least as well and in many cases better than unsmoothed TAN Comparison of Naive Bayes, Unsupervised Bayesian Networks, TAN, C4.5 (Decision Tree) and Selective naive Bayesian classifier on 22 datasets TAN performs competitively with all other classifiers, and when performing better occasionally it does so with a large margin. For evaluation 5-fold cross validation is used with a majority of the data sets.

Comparison of TAN to C4.5 and Naive Bayes

THE END Questions?