A Probabilistic Relaxation Framework for Learning Bayesian Network Structures from Data

Size: px

Start display at page:

Download "A Probabilistic Relaxation Framework for Learning Bayesian Network Structures from Data"

Wilfred Ray
6 years ago
Views:

1 A Probabilistic Relaxation Framework for Learning Bayesian Network Structures from Data by Ahmed Mohammed Hassan A Thesis Submitted to the Faculty of Engineering at Cairo University In Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE FACULTY OF ENGINEERING, CAIRO UNIVERSITY April 28, 2007

2 Abstract Graphical models have been very promising tools that can effectively model uncertainty, causal relationships, and conditional distributions among random variables. This work proposes a new probabilistic method for learning Bayesian network structures from data. In the proposed method the existence of an edge in the network is no longer considered as a hard or deterministic issue, but rather we assign a certain probability for the existence of each edge. The proposed method uses a global optimization approach, originally developed for clustering and classification problems, to find the set of edges probability that lead to the best network structure. The experimental results show that the proposed approach achieves very promising results compared to other structure learning approaches.

3 Acknowledgments All praises and thanks are due to Allah, The Most Gracious, The Most Merciful, for providing me with the strength, and patience to complete this work. I am much grateful to my supervisors, Dr. Amir Atiya, and Dr. Ihab Talkhan for their guidance, advice, and encouragement toward successful completion of this work. More thanks is due to Dr. Amir Atiya for his valuable effort, patience, and support. I would also like to thank our Department Chair, Dr. Nevin Darwish, for her helpful comments, and her help with formal issues. I also like to give my uttermost gratitude to my parents for their continuous encouragement. Finally, I would like to dedicate this work to my wife, and my sweet little daughter. i

4 Contents 1 Introduction 1 2 An Introduction to Graphical Models and Bayesian Networks Introduction Graphical Models Bayesian Networks An Introductory Example Bayesian Networks Learning Parameter Learning Full Observability Partial Observability Structure Learning Scoring Methods Search Algorithms Unknown Structure Partial Observability Inference Exact Inference Variable Elimination Pearl s Belief Propagation The Junction Tree Algorithm ii

5 2.6.2 Approximate Inference Sampling (Monte Carlo) methods Variational methods Loopy belief propagation A Case Study Areas of Applications for Bayesian Networks Computer Troubleshooting Medical Diagnosis Computer Vision Agriculture Information Processing Summary An Overview of Probabilistic Relaxation Methods Introduction Relaxation Labeling Deterministic Annealing Deterministic Annealing for Clustering Derivation of Deterministic Annealing Summary Structure Learning Approaches Introduction Constraint Based Methods A Low Order Independence Tests Based Approach A Mutual Information Based Approach Search and Score Methods iii

6 4.3.1 The K2 Algorithm A Genetic Algorithms Based Approach An Evolutionary Programming Based Approach A Simulated Annealing Based Approach The Sparse Candidate Algorithm The Ant Colonies Algorithm Summary A Probabilistic Framework for Learning Bayesian Networks Introduction Summary of the Approach Restricting the Search Space Representation Local Search Algorithm Evaluating PDAG s Caching Calculating Entropy An Illustrative Example Summary Experimental Evaluation Introduction Datasets Comparisons Performance Measures Results Summary iv

7 7 Conclusion and Future Work Conclusion Future Work Bibliography 84 v

8 List of Figures 2.1 A Directed Graphical Model (probabilities omitted) An Undirected Graphical Model (probabilities omitted) A Bayesian network for the credit card fraud problem A Bayesian network for the credit card fraud problem (with probabilities) The Sprinkler Network The most likely network structures without hidden variables The most likely network structures with hidden variables Example of a Bayesian network to illustrate genetic representation Summary of approach The effect of varying βon the optimization process The relation between entropy and iterations Varying the edge probability distribution against β An illustrative example (step 1) An illustrative example (step 2) An illustrative example (step 3) An illustrative example (step 4) An illustrative example (step 5) An illustrative example (final structure) vi

9 6.1 The ASIA network The ALARM network The INSURANCE network Comparison between the K2 and the proposed method for the ASIA network Comparison between the K2 and the proposed method for the ALARM network Comparison between the K2 and the proposed method for the INSURANCE network vii

10 List of Algorithms 1 Outline of the DA method Description of the algorithm presented in [13] The drifting phase of the method in [5] The thickening phase of the method in [5] The thickening phase of the method in [5] The K2 Algorithm An outline of the GA algorithm for learning Bayesian networks TLSA for learning Bayesian networks Outline of the Sparse Candidate Algorithm Outline of the ant colonies algorithm for learning Bayesian networks The probabilistic relaxation approach to learning Bayesian networks The local search algorithm The algorithm of scoring PDAG s viii

11 List of Tables 6.1 Results for the original graphs of ASIA, INSURANCE, and ALARM networks Results for the ASIA network Results for the INSURANCE network Results for the ALARM network ix

12 Chapter 1 Introduction Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering, uncertainty and complexity, and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms [33]. There are two kinds of graphical models: undirected graphical models, also known as Markov Random Fields (MRFs), and directed graphical models, also knows as Bayesian Networks (BNs). Bayesian networks are a graphical representation of a multivariate joint probability distribution that exploits the dependency structure of distributions to describe them in a compact and natural manner [46]. A BN is a directed acyclic graph, in which nodes correspond to domain variables, and edges correspond to direct probabilistic dependencies between them. The network structure represents a set of conditional independence assertions about the distribution. Informally, the existence of an edge between a variable A, and another variable B can lead to the indication that A causes B. The conditional independence assertions encoded in the network structure are the key to the ability of Bayesian networks to provide a general-purpose representation for complex probability distributions. However, designing BN s poses many challenges. In the majority of applications. the BN has to be learned from data, and this is a hard and computationally demanding problem. Learning of Bayesian networks from data has two main constituents: learning the network structure, 1

13 and learning the network parameters given its structure. Recent years have witnessed an ever increasing interest in the automatic induction of Bayesian network structures from data. There are two main approaches for learning the structure of Bayesian Networks. The first poses learning as a constraint-based problem. In this approach, the properties of conditional independence among attributes are estimated using several statistical tests. The second approach poses learning as an optimization problem where several standard heuristic search techniques, such as greedy hill-climbing and simulated annealing, are utilized to find high-scoring structures according to some structure fitness measure. Such local search procedures may sometimes work well, yet they often get stuck in a local maximum rather than finding a global one. Probabilistic relaxation methods have been successfully applied to different areas of computation. For example, the relaxation labeling technique [47] has been used for solving several computer vision problems, and the the deterministic annealing approach [49] has been used for solving clustering problems. Deterministic annealing (DA) is derived within a probabilistic framework from basic information theoretic principles (e.g., maximum entropy and random coding). The application-specific cost is minimized subject to a constraint on the randomness (Shannon entropy) of the solution, which is gradually lowered. DA is able to escape local minima, meanwhile, it is a deterministic method that is guaranteed to quickly reach a global minimum on the surface of the cost function without randomly wandering in the search space. In this work, we propose an approach for the automatic induction of Bayesian networks structures from data. The approach poses the problem in a probabilistic framework. This means the existence of an edge is not considered as a hard 0/1 issue, but rather we assign a probability p representing the existence of the edge. The approach depends on maximizing a network scoring function subject to a constraint on the randomness (Shannon entropy) of the solution. It proceeds iteratively while gradually lowering the level of randomness till it converges to a global solution making it immune to getting stuck at a local maximum. The approach presents a hybrid two-tier solution that restricts the network space by using a node order or by employing some statistical dependence measures. It 2

14 proceeds iteratively finding a better solution at different levels of entropy till it converges to a global solution. The proposed framework bears similarity to the relaxation labeling approach, and the deterministic annealing approach, and borrows several ideas from the later to guide the search process. The rest of the thesis is organized as follows. In Chapter 2, we present an introduction of the basic concepts of graphical models, and Bayesian networks. We first give a brief introduction to the main concepts. Then, we discuss the most popular methods for learning, and inference in Bayesian networks. Finally, we present a case study, and elaborate on the main areas of application for Bayesian networks. In Chapter 3, we present a review of probabilistic relaxation techniques. We present the relaxation labeling method, and the deterministic annealing approach. We further elaborate on how he deterministic annealing approach was proposed as an optimization method for clustering problems, and its derivation. Chapter 4 presents a an overview of different structure learning approaches. Different categories of learning methods are explained and several examples are discussed. The motivation, and details of the proposed approach is presented in Chapter 5. Results and details of the experimental environment is presented in Chapter 6. Finally discussion and conclusion are presented in Chapter 7. 3

15 Chapter 2 An Introduction to Graphical Models and Bayesian Networks 2.1 Introduction Graphical models [33, 32] are a combination of graph theory, and probability theory. They are very useful for dealing with uncertainty. Graphical models can be either directed or undirected. Directed graphical models are also called Bayesian networks, while undirected graphical models are called Markov random fields. This chapter will present an introduction to the important concepts dealing with graphical models in general, and Bayesian networks in particular. Then, an overview of the learning, and inference techniques will be introduced. This will be followed with a case study, and an overview of the areas of application of Bayesian networks. The chapter will then be concluded with a brief summary. 2.2 Graphical Models The following quotation, from [33], provides a very concise introduction to graphical models. Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout 4

16 applied mathematics and engineering, uncertainty and complexity, and in particular they are playing an increasingly important role in the design and analysis of machine learning algorithms. Fundamental to the idea of a graphical model is the notion of modularity - a complex system is built by combining simpler parts. Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data. The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms. Many of the classical multivariate probabilistic systems studied in fields such as statistics, system engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism examples include mixture models, factor analysis, hidden Markov models, Kalman filters and Ising models. The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism. This view has many advantages - in particular, specialized techniques that have been developed in one field can be transferred between research communities and exploited more widely. Moreover, the graphical model formalism provides a natural framework for the design of new systems. A graphical model is a method of illustrating, and representing conditional independencies among a set of variables. Two variables are conditionally independent if they have no direct impact on each others value. For example, A is conditionally independent of C given B is P (A B, C) = P (A B). Graphical models are graphs in which nodes represent random variables, and the lack of arcs represent conditional independence assumptions. Undirected graphical models, also called Markov Random Fields (MRFs), have a simple definition of independence: two (sets of) nodes A and B are conditionally independent given a third set, C, if all paths between the nodes in A and B are separated by a node in C. By contrast, directed graphical models (which cannot have directed cycles), also called Bayesian Networks or Belief Networks (BNs), have a more complicated notion of independence, which takes into account the directionality of the arcs. 5

17 Figure 2.1: A Directed Graphical Model (probabilities omitted) Figure 2.2: An Undirected Graphical Model (probabilities omitted) The notion of independence in directed graphical models is more complicated that undirected models, yet they do have several advantages. The most important advantage is that one can regard an arc from A to B as indicating that A causes B. This makes directed models capable of learning causal relationships, and hence can be used to gain understanding about a problem domain. A graphical model consists of a set of nodes N = 1, 2,..., n representing random values, a set of edges E encoding dependencies between variables, and a set P of probability distribution functions for each variable. figure 2.1 and figure 2.2 illustrate two examples of a directed and an undirected graphs respectively. 6

18 2.3 Bayesian Networks A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. Bayesian networks are a specific type of graphical model which is represented by a directed graph (DAG). A DAG is a graph that have all of the edges directed, and that has no cycles (i.e., there is no way to start from any node and travel along a set of directed edges in the correct direction and arrive back at the starting node). The Bayesian network illustrated in figure 2.1 consists of a set of nodes N = A, B, C, and a set of edges E = (B, A), (B, C). The edges in this network encode a particular factorization of the joint distribution of the random variables. We can see from the network that A is dependent only on B, and hence P(A B,C)=P(A B). Likewise, C is dependent only on B, and P(C A,B)=P(C B). B is also independent from both A,B. Hence the joint probability distribution can be expressed as follows: P (A, B, C) = P (A B)P (B)P (C B) (2.1) To generalize, for any set of random variables X i i = 1 : n, the joint probability distribution over all variables is: P (X 1,..., X n ) = n p(x i P a(x i )) (2.2) i=1 where P a(x i ) are the parents of node X i. 2.4 An Introductory Example A Bayesian network for a set of variables X = X1,..., Xn consists of a network structure S that encodes a set of conditional independence assertions about variables in X, and a set P of local probability distributions associated with each variable. Those two components represent the joint probability distribution for X. The network structure S is a directed acyclic graph. The nodes in S are in 7

19 one-to-one correspondence with the variables X. Xi is used to denote both the variable and its corresponding node, and P ai is used to denote the parents of node Xi in S as well as the variables corresponding to those parents. [25] outlines an example to illustrate how to construct a Bayesian network for the problem of detecting credit card fraud. The first step in building a Bayesian network is determining the variable that will be modeled. One possible choice of variables for our problem is Fraud (F), Gas (G), Jewelry (J), Age (A), and Sex (S). The variables represent the following: Fraud: represents whether or not the current purchase is fraudulent, Gas: represents whether or not there was a gas purchase in the last 24 hours, Jewelry: represents whether or not there was a jewelry purchase in the last 24 hours, Age: represents the age of the card holder, and Sex: represents the sex of the card holder. Different states of the variables are: Fraud: Yes - No Gas: Yes - No Jewelry: Yes - No Age: < >50 Sex: male - female There are several tasks that have to be addressed during the initial task of building a Bayesian network. Those tasks are not unique for Bayesian networks, but rather are common for several other approaches. [25] summarizes those tasks in the following points: 1. Identifying the goals of modeling. 8

20 Figure 2.3: A Bayesian network for the credit card fraud problem 2. Identifying possible observations that might be relevant to the problem. 3. Determining the subset of those observations that would be used for modeling. 4. Deducing a set of mutually exclusive variables from the observations. In the next step towards constructing a Bayesian network, we build a directed acyclic graph encoding the assertions of conditional independence between variables. Determining the structure of a Bayesian network may be learned from data, or deduced from prior knowledge. Several approaches of finding the structure of a Bayesian network require a node ordering. All the nodes of the network should be ordered such that each node s parent precedes it in the node ordering. Using the node ordering (F, A, S, G, J), and prior knowledge about the problem, we reach the following conditional independencies : p(a f) = p(a) p(s f, a) = p(s) p(g f, a, s) = p(g f) p(j f, a, s, g) = p(j f, a, s) (2.3) 9

21 Figure 2.4: A Bayesian network for the credit card fraud problem (with probabilities) Thus we obtain the structure illustrated in figure 2.3. The last step of constructing a Bayesian network is assessing the local probability distributions p(x i pa i ). In the credit card fraud example, all variables are discrete. Hence, one distribution for X i for every configuration of P a i need to be assessed. The credit card fraud Bayesian network after assessing local probabilities is illustrated in figure Bayesian Networks Learning A Bayesian network B is fully described by the graph structure (topology) of the network, and the parameters of each Conditional Probability Distribution (CPD). Both network structure and parameters can be learned from data, however one of them, probably the network structure, may be given. Learning the network structure in general is harder than learning its parameters; mainly due to the very large space of structures that one has to explore [4]. Another distinction is whether the data is fully observable or not. If the data is not fully observable (i.e. part of the data is missing), the learning process becomes harder. Those distinctions give rise to four cases: 10

22 Known structure, full observability, Known structure, partial observability, Unknown structure, full observability, and Unknown structure, partial observability. Each of the cases above give rise to a specific learning algorithm. The known structure, full observability case has a closed form solution using Maximum Likelihood Estimation. The known structure, partial observability case is usually solved using the Expectation Maximization algorithm. In the unknown structure, full observability case, a search through the structure space is conducted to find the best structure then the parameters are learned like the known structure, full observability case. The last case that deals with unknown structure, and partial observability is solved with the Structural Expectation Maximization algorithm, or with Expectation Maximization coupled with a search in the model space. In the following subsections, we will briefly elaborate on the different methods of learning Bayesian networks. In section 2.5.1, we discuss the parameter learning algorithms in the case of known structure. In section 2.5.2, we discuss the structure learning procedure given full data. Finally, the case of unknown structure, partial observability is discussed in Parameter Learning Full Observability In this section, we show how to learn the local probability distributions of a Bayesian network given data. We also make the assumption that the structure is already known. Hence, the goal of learning in this case is to find the values of the parameters of each CPD which maximizes the likelihood of the training data, which contains N cases (assumed to be independent). The normalized loglikelihood of the training set D is a sum of terms, one for each node: 11

23 L = 1 M M log P r(d m G) = 1 M m= n M logp (X i P a(x i ), D m ) (2.4) i=1 m=1 where P a(x i ) are the parents of X i. We notice that the log-likelihood scoring function decomposes according to the structure of the graph. This decomposition allows us to maximize the contribution to the log-likelihood of each node independently. This assumption is valid only when assuming that the parameters in each node are independent of the other nodes. Consider the water sprinkler network illustrated in figure 2.5. If we want to estimate the conditional probability table at node W using a set of training data, we can just count the number of times the grass is wet when it is raining and the sprinkler is on, #(W = 1; S = 1; R = 1), the number of times the grass is wet when it is raining and the sprinkler is off, #(W = 1; S = 0; R = 1), etc. Given these counts (which are the sufficient statistics for a multinomial distribution), we can find the maximum likelihood estimate of the of the conditional probability table as follows: P MAP (W = w S = s, R = r) = #(W = w, S = s, R = r) #(S = s, R = r) (2.5) where the denominator is #(S = s; R = r) = #(W = 0; S = s; R = r) + #(W = 1; S = s; R = r). If a certain event does not exist in the training set, it will get a 0 probability. This can be avoided by using a Dirichlet prior. To use a Dirichlet prior, we just need to add pseudo counts to the original counts. For example, the MAP estimate for W when using uniform Dirichlet prior becomes: P MAP (W = w S = s, R = r) = #(W = w, S = s, R = r) + 1 #(S = s, R = r) + 2 (2.6) 12

24 Partial Observability Figure 2.5: The Sprinkler Network In the partial observability case [7, 14, 17, 19], some of the nodes are hidden or not observed. We are still assuming that the structure is already known. The structure may be given, or learned using the methods mentioned in section Trying to find the values of the parameters of each CPD (Conditional Probability Distribution) which maximizes the likelihood of the training data like in the previous section is inappropriate due to the missing data. Exact computation of the posterior distribution for the parameters will be intractable. Thus, an approximation method has to be used for incomplete data. Several approximation methods may be used to solve this problem, those methods include Monte-Carlo methods, Gaussian approximation, and Expectation Maximization. One class of approximation methods is based on Monte Carlo methods. Monte Carlo methods are accurate, yet they may need long time to converge. One of the used Monte Carlo methods is called Gibbs sampling. To approximate the expectation of some function f(x), we do the following: 13

25 1. Choose an initial random state of all the variables in X 2. Pick some variable x i, and unassign it state 3. Compute the probability distribution of x i given the states of the rest of variables 4. Sample a state for x i based on the computed distribution. 5. Repeat the previous steps until convergence Another approximation that is usually used, and that is more efficient s Gaussian approximation [35]. The idea behind Gaussian approximation is that the posterior probability of parameters given data can be approximated as a multivariate Gaussian distribution in the case of large amounts of data. The EM (Expectation Maximization) algorithm can also be used to find a locally optimal Maximum Likelihood Estimate of the parameters. The main idea behind using EM is that if we knew the values of all nodes, learning would be straightforward, like the previous section. Hence in the Expectation step, we compute the expected values of all the nodes using an inference algorithm. Then in the maximization step, we assume that the expected values were observed, and go on with learning the parameters. In other word, we replace the observed counts of the events with the number of times we expect to see each event. Consider the water sprinkler problem as an example. To estimate the probability of the variable W taking the value w, given that the variables S, and R took the value s, and r respectively, we write: P (W = w S = s, R r) = E#(W = w, S = s, R = r)/e#(s = s, R = r) (2.7) where E#(e) is the expected number of times event e occurs in the whole training set, given the current guess of the parameters. The EM algorithm proceeds by computing the expected counts, maximizing the parameters given the counts, and then recomputing the expected counts, etc. This iterative procedure is guaranteed to converge to a local maximum of the likelihood surface. 14

26 2.5.2 Structure Learning A Bayesian network B over a set of random variables X i i = 1 : n is a directed acyclic graph that represents the joint probability distribution over all X i s P (X 1,..., X n ) = n p(x i P a(x i )) (2.8) i=1 where P a(x i ) are the parents of node X i. The Bayesian network B is a pair < G, θ >. G is a directed acyclic graph G < V, E >, where the set of nodes V = {X 1, X 2, X 3,, X n } represents the random variables, and E is the set of edges encoding the dependence relations among the variables. θ represents a set of parameters for the variables in V, that defines its conditional probability distribution. For each variable X i V, we have a family of conditional distributions P (X i P a(x i )). Those conditional distributions allow us to recover the joint distribution over V. A well-known example of a Bayesian network is illustrated in figure 2.5. The problem of learning a Bayesian Network Structure can be stated as follows: Given a training set T of instances of (X 1,..., X n ), find a DAG G that best matches T. A common approach is to introduce a scoring function that measures how the DAG fits the data, and use it to search for the best network. The most commonly used scoring functions are the Bayesian scoring metric and the Minimal Description Length (MDL) metric. Both metrics are described in details in [8] and [38] respectively. There are two main approaches for learning the structure of Bayesian Networks. The first poses learning as a constraint satisfaction problem. In this approach, the properties of conditional independence among attributes are estimates using several statistical tests. The second approach poses learning as an optimization problem where several standard heuristic search techniques, such as greedy hillclimbing and simulated annealing, are utilized to find high-scoring structures according to some structure fitness measure. Such local search procedures may sometimes work well, yet they usually get stuck in local maximum rather than finding a global one. 15

27 Scoring Methods The Bayesian metric, also called the K2 metric is based on applying Bayesian principles. It assumes that a complete database D of sample cases over a set of attributes that accurately models the network is given. Given a dataset generated according to the given Bayesian network, we compute from the given network structure B S and a set of conditional probabilities B P the probability of the dataset, i.e., we compute P (D B S, B P ). Integrating over all possible sets of conditional probabilities B P for the given structure B S, we get P (B S, D) P (B s, D) = P (D B s, B p )P (B s )d Bp (2.9) B p This expression can be rewritten as: P (B s, D) = P (G) n q i i=1 i=1 (r i 1)! (N ij + r i 1)! r i k=1 N ijk! (2.10) where r i is the number of possible values of the variable x i, q i is the number of possible configurations (instantiations) for the variables in P a(x i ), N ijk is the number of cases in D in which variablex i has its kth value and P a(x i ) is instantiated to its jth value, and N ij = r i i=1 N ijk. Assuming a uniform prior for P (G) and using log(p (G, D)) instead of P (G, D), we get the K2 metric: F B (B s : D) = n F B (x i, P a(x i ) : N xi,p a(x i )) (2.11) i=1 where N xi,p a(x i ) are the statistics of the variable x i and P a(x i ) in D. q i F B (x i, P a(x i ) : N xi,p a(x i )) = (log( (r ri i 1)! (N ij+ri 1)! ) + j=1 k=1 log(n ijk!)) (2.12) The MDL metric is based on the MDL principle which is based on the idea that the best model that represents a set of data items is the model that minimizes: 16

28 the length of the encoding of the model, and the length of the encoding of the data given the model. To apply this principle to Bayesian networks, we need to specify how the encoding of the network itself and the raw data given the network can be performed. A simple approximation of the MDL metric which is also equivalent to the The Bayesian information criterion (BIC) is: L(B s, D) = log(p (B s )) + n q i r i i=1 j=1 k=1 N ijk log N ijk N ij 1 2 n q i (r i 1)logN i=1 (2.13) where the variables r i, x i, q i, P a(x i ), N ijk, and N ij are as defined above. Assuming a uniform prior for P (B s ), we get: L(B s, D) = n q i r i i=1 j=1 k=1 N ijk log N ijk N ij 1 2 n q i (r i 1)logN (2.14) i=1 In the rest of this section, we will briefly elaborate on some of the details of the search-score methods of learning Bayesian networks structures from data Search Algorithms Having defined a scoring function, we still need a way to search for the highest scoring graph. Unfortunately, the number of DAG s on n variables is superexponential in n. (There is no closed form formula for this, but to give you an idea, there are 543 DAGs on 4 nodes, and O(1018) DAGs on 10 nodes). The usual approach is therefore to use local search algorithms (e.g., greedy hill climbing, possibly with multiple restarts) or perhaps branch and bound techniques, to search through the space of graphs. Algorithms must also make sure that the graphs have no directed cycles. The fact that the scoring function is a product of local terms makes local search more efficient, since to compute the relative score of two models that differ by only a few arcs (i.e., neighbors in the space), 17

29 it is only necessary to compute the terms which they do not have in common, the other terms cancel each other when taking the ratio Unknown Structure Partial Observability The final and hardest case is the one with unknown structure and partial observability. In this case we not only don t have the network structure, but also we have hidden nodes and/or missing data. The main problem here is that the marginal likelihood is intractable, hence direct method would not work fine. One approach to handle this is to use a Laplace approximation to the posterior of the parameters [7], this leads to the following expression: logp (D G) logp (D G, θ G) d logm (2.15) 2 where M is the number of samples, θ G is the ML estimate of the parameters (computed using EM) and d is the dimension of the model. This is called the Bayesian Information Criterion (BIC), and is equivalent to the Minimum Description Length (MDL) approach. The first term is just the likelihood and the second term is a penalty for model complexity. BIC is a decomposable score, and this may give rise to using local search algorithm like hill climbing. The problem with using such local search is computational complexity as we have to run EM at each step to compute θ G. [18] proposes a method for learning both the network and structure from the data. This algorithm is called the Structural Expectation Maximization (SEM) algorithm. SEM suggests doing the local search steps inside of the M step of EM. [43] give a summary of the structural EM algorithm as follows: 1. add a new node to the network representing a hidden variable, 2. for this given set of nodes, find the best network connections, 3. continue as long as the network keeps improving. 18

30 2.6 Inference After constructing the Bayesian network from prior knowledge, data, or both, we are usually interested in determining various probabilities from the model. The main goal of inference is to estimate the values of hidden nodes, given the values of the observed nodes. This is usually used for either diagnosis, or prediction. In diagnosis, we observed the effects, and try to infer the hidden causes. While as in prediction, we observed the causes, and try to predict the effects [45, 28]. For example, consider the water sprinkler network in figure 2.5, and suppose we observe the fact that the grass is wet, which we denote by W = 1. There are two possible causes for this: either it is raining, or the sprinkler is on. To determine which of the two causes is more likely, we can use Bayes rule to compute the posterior probability of each of them. Bayes rule states that: posterior = conditional likelihood prior likelihood (2.16) For the water sprinkler examples, this may be rewritten as: P (S = 1 W = 1) = = c,r P (S = 1, W = 1) P (W = 1) P (C = c, S = 1, R = r, W = 1) = = (2.17) and P (R = 1 W = 1) = P (R = 1, W = 1) c,s P (C = c, S = s, R = 1, W = 1) = P (W = 1) =

31 = (2.18) where P (W = 1) = c,s,r P (C = c, S = s, R = r, W = 1) = (2.19) is a normalization constant, equal to the likelihood of the data. Hence, we see that it is more likely that the grass is wet because it is raining than because of the sprinkler Exact Inference Computing posterior estimates using Bayes rule is computationally intractable, especially when the number of variables is large, or the variables are continuous. Several researchers have developed probabilistic inference algorithms for Bayesian networks that uses conditional independence assumptions encoded in the graph to speed up exact inference Variable Elimination They key idea behind the variable elimination algorithm is to push the sums in as far as possible. Consider the water sprinkler network again. To calculate P (W = w), we may write: P (W = w) = c P (C = c, S = s, R = r, W = w) (2.20) s r P (W = w) = P (C = c) P (S = s C = c) c s r P (R = r C = c) P (W = w S = s, R = r) (2.21) If we pushed the sums in as far as we can, we get: 20

32 P (W = w) = P (C = c) P (S = s C = c) c s P (R = r C = c) P (W = w S = s, R = r) (2.22) When we perform the innermost sum, we get: r P (W = w) = c P (C = c) s P (S = s C = c) T 1(c, w, s) (2.23) where T 1(c, w, s) = r P (R = r C = c) P (W = w S = s, R = r) (2.24) Performing the second team, we get: P (W = w) = c P (C = c) T 2(c, w) (2.25) where T 2(c, w) = s P (S = s C = c) T 1(c, w, s) (2.26) Choosing a summation (elimination) ordering to minimize this is NP-hard, although greedy algorithms work well in practice Pearl s Belief Propagation Belief propagation is the action of updating the beliefs in each variable when observations are given to some other variables[46]. Belief propagation proceeds as follows. Let e be the set of values for all observed variables. For any variable X, 21

33 e can be split into two subsets: e X, which represents all of the observed variables that are descendants of X, and e + X, which represent all of the other observed variables. The impact of the observed variables on X can be represented by the following two values: λ(x) = P (e X X) (2.27) π(x) = P (X e + X ) (2.28) λ(x)and π(x)are actually vectors whose elements are associated with each of the discrete values for X: λ(x) = [λ(x = x 1 ), λ(x = x 2 ),..., λ(x = x n )] (2.29) π(x) = [π(x = x 1 ), π(x = x 2 ),..., π(x = x n )] (2.30) The posterior probability of X given observations can be given as: P (X e) = α.λ(x).π(x) (2.31) The vector λ(x)is computed by: Similarly π(x)is computed as: λ(x) = λ(v).p (v X) (2.32) c children(x) v c π(x) = P (X y).π(y) (2.33) y parents(x) The values of λ s, and π s are passed between nodes in any fashion. 22

34 The Junction Tree Algorithm [9] gives an introduction to a more general method of inference in Bayesian networks, namely the junction tree algorithm. Pearl s belief propagation method faces problems with doing inference in DAG s due to the cycles when directionality is removed. The junction tree algorithm transforms the DAG of nodes to a tree of cliques. Any given node may be a member of one or more cliques in the tree. The junction tree algorithm proceeds as follows: 1. Moralize the Bayesian network. 2. Triangulate the moralized graph. 3. Let the cliques of the triangulate graph be the nodes of a tree (the junction tree). 4. Propagate the λand πvalues throughout the junction tree to do inference. More details about the junction tree algorithm may be found at [9] Approximate Inference Exact inference in an arbitrary Bayesian network is NP-complete. The source of difficulty lies in the existence of undirected cycles in the Bayesian network. Undirected cycles are cycles in the graph when we ignore edge direction. Exact inference is also hard when the number of variables is large, or when continuous variables exist. For the cases where exact inference is intractable, approximate inference techniques are used. Here we provide a brief summary of some popular approximate inference methods Sampling (Monte Carlo) methods The simplest kind is importance sampling, where we draw random samples x from P (X), the (unconditional) distribution on the hidden variables, and then weight the samples by their likelihood, P (y x), where y is the evidence. A 23

35 more efficient approach in high dimensions is called Markov Chain Monte Carlo (MCMC), and includes as special cases Gibbs sampling and the Metropolis- Hastings algorithm Variational methods Variational methods provide an approach for the design of approximate inference algorithms [34, 33]. The simplest example of variational methods is the mean field approximation, which exploits the law of large numbers to approximate large sums of random variables by their means. In particular, we essentially decouple all the nodes, and introduce a new parameter, called a variational parameter, for each node. We then iteratively update these variational parameters so as to minimize the cross-entropy (KL distance) between the approximate and true probability distributions. Updating the variational parameters becomes a proxy for inference. Variational methods have been successfully used for inference and learning of graphical models and Bayesian networks[31] Loopy belief propagation In loopy belief propagation [63], we apply Pearl s belief propagation algorithm to the original graph, even if it has loops (undirected cycles). When loops are present, The network is no longer singly connected. When we ignore the loops and allow nodes to continue communication, messages may circulate around the loop. It was noticed that the process may reach a semi stable state after a few iterations. Pearl suggested using this method as an approximate inference technique. 2.7 A Case Study In this section, we discuss a case study investigating factors that influence the intention of high school students to attend college. The case study was originally presented in [25]. The application comes originally from a study by [55]. [55] measured the following variables for 10,318 Wisconsin high school seniors: 24

36 Sex (SEX): male, female; Socioeconomic Status (SES): low, lower middle, upper middle, high; Intelligence Quotient (IQ): low, lower middle, upper middle, high; Parental Encouragement (PE): low, high; and College Plans (CP): yes, no. The data consists of 10,318 records. Each record represents some particular configuration of the variable mentioned above. For example, a record may contain SEX=male, SES=low, IQ=low, PE=low, and CP=yes. [25] first analyzed the data assuming no hidden variables. Exhaustive search was used for structure learning. For structure priors, structures where SEX and/or SES had parents, and/or CP had children were excluded, and all other network structures were assumed equally likely. Because the data set was complete, parameter learning techniques discussed in was used. The two most likely network structures that we found after an exhaustive search over all structures are shown in figure 2.6. [25] analyzed the results and came up with some interesting notes: The causal influence of socioeconomic status and IQ on college plans. From either graph we conclude that sex influences college plans only indirectly through parental influence. The two graphs differ only by the orientation of the arc between PE and IQ. Either causal relationship is plausible. The most suspicious result is the suggestion that socioeconomic status has a direct influence on IQ. The last result suggesting that socioeconomic status has a direct influence on IQ is the most suspicious result. To question this result, [25] considered new models obtained assuming that there is a hidden variable H that is never observed. Among the models considered, the one with the highest posterior probability is shown in figure 2.7. The model suggests that some hidden variable is affecting 25

37 Figure 2.6: The most likely network structures without hidden variables both IQ, and SES. This hidden variable may correspond to some measure of parent quality. 2.8 Areas of Applications for Bayesian Networks Graphical models, in general, and Bayesian networks in particular have been employed in several real world applications. In this section, we will try to highlight some of the areas were Bayesian networks have been employed. The list of applications is by no mean complete. It is just an attempt to elaborate on some of the domains where Bayesian networks are used Computer Troubleshooting The Windows PC operating system uses a system based on Bayesian networks for troubleshooting printing problems. Bayesian networks are used to help users 26

38 Figure 2.7: The most likely network structures with hidden variables troubleshoot printing problems. The user is required to indicate his observations about the problem through a graphic interface. The Bayesian network based system then tries to figure out the most probable cause for the problem Medical Diagnosis Several medical applications employ Bayesian networks. The main idea behind it is to model the diseases as the causes, and the symptoms as the effects. By observing the symptoms, a Bayesian network can give the probability of different diseases that could be causing those symptoms. Examples of such systems include: a system that helps in the diagnosing of congenital heart diseases, a system for obtaining a preliminary diagnosis of neuromuscular diseases on the basis of elektromyografie findings, a system that assists community pathologists with the diagnosis of lymphnode pathology, and a system for insulin dose adjustment of diabetes patients. 27

39 Computer Vision Several image processing, and computer vision applications, that employs Bayesian networks, have been developed. Examples of those applications include: Bayesian networks for interpretation of images, Bayesian networks for 3D inference from 2D data, Bayesian networks for segmentation of computed radio graphs, and Bayesian networks for image compression Agriculture Examples of agriculture systems that use Bayesian networks include: a system which helps in verification of the parentage of Jersey cattle through blood type identification, and a system for mildew control in winter Information Processing Several information retrieval systems that use Bayesian networks have been developed. NASA had another system used when launching space shuttles for filtering an displaying information on the propulsion system. 2.9 Summary Graphical models, and Bayesian networks are great tools for modeling uncertainty. A Bayesian network may be represented by a directed graph encoding the Independence assumptions between the variables, and conditional probability table encoding the local probability distribution for each variable. Several algorithms and techniques exist for learning both the structure, and parameters of a 28

40 Bayesian network given the data. The goal of modeling with Bayesian networks is to efficiently encode the joint probability distribution of several variables. Researchers have developed several algorithms for both exact, and approximate inference in Bayesian networks. 29

41 Chapter 3 An Overview of Probabilistic Relaxation Methods 3.1 Introduction Probabilistic relaxation methods have been successfully applied to various areas of computation. Two probabilistic relaxation methods have been widely used in different areas, relaxation labeling, and deterministic annealing. Relaxation Labeling [47] is an image treatment methodology. Its goal is to associate a label to image features (points, edges,..etc). Features are associated with all labels with a certain probability for each label. The labeling process proceeds iteratively adjusting those probabilities till it reaches the best label assignment. The deterministic annealing approach to clustering and its extensions has demonstrated substantial performance improvement over standard supervised and unsupervised learning methods in a variety of important applications including compression, estimation, pattern recognition and classification, and statistical regression [49]. Deterministic annealing (DA) is derived within a probabilistic framework from basic information theoretic principles (e.g., maximum entropy and random coding). The application-specific cost is minimized subject to a constraint on the randomness (Shannon entropy) of the solution, which is gradually lowered. DA is able to escape local minima, meanwhile, it is a deterministic 30

42 method that is guaranteed to quickly reach a global minimum on the surface of the cost function without randomly wandering in the search space. The rest of this chapter will proceed as follows: Section 3.2 briefly introduces the relaxation labeling technique. In section 3.3, we introduce an introduction to the deterministic annealing (DA) approach. Section discusses how the DA approach is used for solving clustering problems. The derivation of the DA approach is presented in Section Finally, we conclude with a summary in Section Relaxation Labeling Many areas of computer vision have benefited from relaxation labeling techniques. Relaxation techniques have also been applied to other computational areas, particularly to the solution of simultaneous nonlinear equations. The basic elements of the relaxation labeling method are a set of features belonging to some object and a set of labels. In computer vision terminology, features denote points, edges and surfaces. The labeling scheme used is probabilistic in the sense that it assigns a certain probability or weight for each feature / label pair. For each feature, weights or probabilities are assigned to each label in the set giving an estimate of the likelihood that the particular label is the correct one for that feature. The labeling process starts with an initial, and perhaps arbitrary, assignment of probabilities for each label for each feature. The algorithm then proceeds iteratively adjusting those set of probabilities according to some relaxation schedule. This process is repeated until the labeling method converges or stabilizes. This occurs when little or no change occurs between successive sets of probability values. Relaxation methods have also been applied to several computer vision problems like: Edge linking: The probabilities of edge points lying on particular edges are determined by considering neighboring edge points. Different labels 31

43 are used for each edge, and a relaxation schedule is then used to find the appropriate label for each edge point. Line labeling techniques: Line labeling techniques assign a certain class to each line (occluding, concave or convex). Probabilities can be assigned to each type of labeling fairly easily. Then, relaxation techniques are applied to find the best set of those probabilities. Segmentation: Segmentation can be viewed as the process of labeling regions of image as belonging to recognized physical entities such. Region labels may be assigned with probabilities to allow the application of relaxation techniques. 3.3 Deterministic Annealing The deterministic annealing approach to clustering and its extensions has demonstrated substantial performance improvement over standard supervised and unsupervised learning methods in a variety of important applications including compression, estimation, pattern recognition and classification, and statistical regression [50]. Deterministic annealing (DA) has three important features: 1. the ability to avoid many poor local optima; 2. applicability to many different structures/architectures; and 3. the ability to minimize the right cost function even when its gradients vanish almost everywhere, as in the case of the empirical classification error. DA is derived within a probabilistic framework from basic information theoretic principles (e.g., maximum entropy and random coding). The application-specific cost is minimized subject to a constraint on the randomness (Shannon entropy) of the solution, which is gradually lowered [50]. 32

44 Simulated annealing or stochastic relaxation is a known technique for avoiding local minima of non convex optimization problems. In simulated annealing, a sequence of random moves is generated. The decision whether to accept or reject the move depends on the probability of the resulting configuration, and hence the current temperature. Thus, moves that do not tend to reduce the cost may be accepted. This makes the process able to escape local minima. However, but this requires very slow schedules that are not realistic for many practical applications. An excellent review of the theory and applications of simulated annealing can be found in [40]. The concept of deterministic annealing was originally proposed for solving the clustering problem. Unlike simulated annealing, DA is a deterministic process that is strongly based on principles of information theory. In the same time, it also strongly motivated by the physical analogy. The analogy to statistical physics is based on the fact that DA is an annealing process that avoids many shallow local minima of the specified cost and, at the limit of zero temperature, produces a nonrandom (hard) solution. The annealing process is equivalent to computation of Shannon s rate-distortion function, and the annealing temperature is inversely proportional to the slope of the curve. DA tries to enjoy the best of two different worlds. On the one hand it is deterministic, DA does not wander randomly in the search space trying t make incremental progress. On the other hand, it is still an annealing method and aims at the global minimum, and tries not to get stuck at a local minimum. One can view DA as replacing stochastic simulations by the use of expectation. An effective energy function, which is parametrized by a pseudo temperature, is derived through expectation and is deterministically optimized at successively reduced temperatures [50]. The DA algorithm can be stated as follows: 33

45 Algorithm 1 Outline of the DA method Inputs: 1. Input data points X={x_{1}...x_{n}} 2. Output classes (labels) Y={y_{1}...y_{n}} 3. A distance function d(x,y) that measures distance between x, and y 4. Annealing Schedule S Outputs: 1. The best assignment configuration, C, for X Procedure: 1. Initializeβ = 0 2. Define H = x 3. Define D = x y p(x, y)logp(x, y) y p(x, y)d(x, y) 4. Loop until H(C) < T hresh (a) Let F = H βd (b) Search for C max that maximizes F (c) C = C max (d) Increase β 5. End loop Deterministic Annealing for Clustering Deterministic annealing (DA) is a global optimization technique originally proposed as a clustering algorithm. It has the advantages of being able to escape poor local optima, and being applicable to several problem architectures [51, 52, 53, 50, 42]. DA is derived within a probabilistic framework from basic information theoretic principles (e.g., maximum entropy and random coding). The application-specific cost is minimized subject to a constraint on the randomness (Shannon entropy) of the solution, which is gradually lowered. It starts out with a high degree of exploration of the space that gradually gives way to honing 34

46 in on the minimum. Unlike the concept of simulated annealing, DA is a purely deterministic method. The deterministic annealing (DA) approach was originally presented as a solution for clustering optimization problems and its extensions. We will start by introducing DA with the simplest nontrivial problem instance in order to obtain a clear understanding of the essentials. We therefore start with a simple clustering problem that seeks the optimal partition of several data points into a prescribed number of subsets, that minimizes the average cluster variance or the mean squared error (MES). Let us denote the cost function by D, where D is defined as: D = x p(x, y)d(x, y) (3.1) y where x is a source vector, y is its best reproducing cluster, p(x, y) is the probability that the source vector x is assigned to the cluster y, and d(x, y) is the Euclidean distance between x and the center of the cluster y. The iterative procedure of DA is monotone non increasing in the cost function. Hence, convergence to a local minimum of the cost function is ensured Derivation of Deterministic Annealing In this derivation of deterministic annealing, a probabilistic framework for clustering is defined. In this framework, the partitions are randomized by assigning inputs to clusters in probability. This probability is called the association probability. Assigning inputs to clusters in probability bears similarity to fuzzy clustering, where each data point has partial membership in clusters. This formulation is different from fuzzy clustering because it is purely probabilistic. Hence, no tools or methods from fuzzy sets theory will be utilized in this derivation. The cost function, or the expected clustering distortion can be written as: D = x p(x, y)d(x, y) (3.2) y 35

47 D = x p(y x)d(x, y) (3.3) y where p(x, y)is the joint probability distribution, and the conditional probability p(y x) is the association probability relating input vector x with code vector y. Minimizing D with respect to {y, p(y x)}would produce a hard clustering solution. Instead of directly minimizing D, we recast the optimization problem as minimizing D subject to a specified level of randomness. We measure the level of randomness by Shannon entropy H = x p(x, y)logp(x, y) (3.4) y The problem can now be redefined as an optimization problem, where we want to maximize: F = H βd (3.5) where β = 1/T, is the Lagrange multiplier, D is the cost function given by 3.2, and H is the Shannon entropy given by 3.4. We start out with large temperature and gradually lower it during the course of iterations. When T is large, we mainly attempt to maximize the entropy. As T gets lower, we trade entropy for reduction in the cost function, and as T approaches zero, we minimize the cost function directly to obtain a hard solution. 3.4 Summary Various areas of computation have used probabilistic relaxation methods. Two probabilistic relaxation methods have been widely used in different areas, relaxation labeling, and deterministic annealing. The deterministic annealing approach to clustering achieved remarkable performance over other standard clustering techniques. Deterministic annealing (DA) is derived within a probabilistic 36

48 framework from basic information theoretic principles, yet bears obvious analogy with statistical physics. DA is able to escape local minima and reach a global solution quickly without randomly wandering in the search space. Unlike simulated annealing, DA is a deterministic method that replaces stochastic simulations by the use of expectation. 37

49 Chapter 4 Structure Learning Approaches 4.1 Introduction The Bayesian Network B is a pair < G, θ >. G is the directed acyclic graph G < V, E >, where the set of nodes V = {X 1, X 2, X 3,, X n } represents the random variables, and E is the set of edges encoding the dependence relations among the variables. θ represents a set of parameters for the variables in V, that defines its conditional probability distribution. For each variable X i V, we have a family of conditional distributions P (X i P a(x i )). Those conditional distributions allow us to recover the joint distribution over V. The problem of learning a Bayesian Network Structure can be stated as follows: Given a training set T of instances of (X 1,..., X n ), find a DAG G that best matches T. A common approach is to introduce a scoring function that measures how the DAG fits the data, and use it to search for the best network. The most commonly used scoring functions are the Bayesian scoring metric and the Minimal Description Length (MDL) metric. Both metrics are described in details in [8] and [38] respectively. There are two main approaches for learning the structure of Bayesian Networks. The first poses learning as a constraint satisfaction problem. In this approach, the properties of conditional independence among attributes are estimates using several statistical tests. The second approach poses learning as an optimization problem where several standard heuristic search techniques, such as greedy hill- 38

50 climbing and simulated annealing, are utilized to find high-scoring structures according to some structure fitness measure. In this chapter, we will briefly explain, and give examples on the two mentioned approaches. In section 4.2, we will discuss the constraint based methods. Section 4.3 will discuss the search and score methods. Finally, we will conclude with a summary in section Constraint Based Methods In the first category of Bayesian networks structure learning methods, learning is posed as a constraint satisfaction problem. Constraint satisfaction problems are those problems where one must find states that satisfy a number of constraints or criteria. Those algorithms try to discover conditional dependence/independence relationships between different attributes in the data. Later on, they attempt to construct the network that represents most of the independencies discovered from the data. Discovering i/ independencies usually utilizes statistical based test like the χ 2 test, or information theory based test like the mutual information metric. Examples of this approach include[22, 21, 46, 5, 12, 13]. Constraint Satisfaction methods have the disadvantage that repeated independence tests are sensitive to failures and lose statistical power. In the following subsections, we will take two of the learning algorithms that belong to the constraint-based category as an example and give a brief description of them A Low Order Independence Tests Based Approach An approach that belongs to the family of independence-based (also called constraintbased) algorithms is presented in [13]. In this approach, conditional independence tests of low order are used as much as possible. The algorithm works as follows: 39

51 Algorithm 2 Description of the algorithm presented in [13]. 1. First, zero-order, and first-order conditional independence tests are performed. 2. The results of those tests are used to construct a prior graphical model by removing, from a complete graph, edges connecting nodes that were found independent using the tests. 3. The problem is now reduced to looking for subgraphs of the prior graph that reflects as much as possible of the independence statements. 4. Using zero-order, and first-order conditional independence tests again, the algorithm tries to remove more edges from the graph. 5. Finally, the algorithm uses quadratic or less number of higher order independence tests to further remove edges from the graph. 6. To reduce the computation complexity, the algorithm uses a predefined order of nodes. The order can be provided by an expert, or automatically learned from data A Mutual Information Based Approach Another method for learning Bayesian networks from data using constraint-based methods is presented in [5]. In this method, the computation of mutual information of attribute pairs is used to guide the construction process. The algorithms requires a node ordering of the attributes, and consists of three phases: drifting, thickening, and thinning. The three phases are best described by the following algorithm listings from [5]. 40

52 Algorithm 3 The drifting phase of the method in [5] 1. Initiate a graph G(V, E) where V = allthenodesof adataset, E =. Initiate two empty ordered set S, R. 2. For each pair of nodes (v i, v j ) where v i, v j V, compute mutual information I(v i, v j ). For the pairs of nodes that have mutual information greater than a certain small value e, sort them by their mutual information from large to small and put them into an ordered set S. 3. Get the first two pairs of nodes in S and remove them from S. Add the corresponding arcs to E. (the direction of the arcs in this algorithm is determined by the previously available nodes ordering.) 4. Get the first pair of nodes remained in S and remove it from S. If there is no open path between the two nodes (these two nodes are d-separated given empty set), add the corresponding arc to E; Otherwise, add the pair of nodes to the end of an ordered set R. 5. Repeat step 4 until S is empty. Algorithm 4 The thickening phase of the method in [5] 1. Get the first pair of nodes in R and remove it from R. 2. Find a block set that blocks each open path between these two nodes by a set of minimum number of nodes. Conduct a CI test. If these two nodes are still dependent on each other given the block set, connect them by an arc. 3. Go to step 1 until R is empty. Algorithm 5 The thickening phase of the method in [5] 1. For each arc in E, if there are open paths between the two nodes besides this arc, remove this arc from E temporarily and repeat step 2 from the thickening phase. Conduct a CI test on the condition of the block set. If the two nodes are dependent, add this arc back to E; otherwise remove the arc permanently. 4.3 Search and Score Methods In the second category, structure learning is posed as optimization problem. In this approach, a statistically motivated score that describes the quality of the 41

53 structure, or its fitness to the training data, is defined. Exhaustive searching for the best network structure is NP-Hard [6]. Hence a stochastic optimization method is typically employed to search for the best network structure according to the scoring function. There are two scoring metrics that have been widely used in the literature. The Bayesian score[8, 26] is equivalent to the marginal likelihood of the model given the data. The BIC (Bayesian Information Criterion) score which is also equivalent to the Minimum Description Length (MDL) of a model[38, 60] is based on the model s likelihood given the data and its complexity. This metric explicitly adds a complexity penalizing factor to the likelihood. Most of the algorithms used for learning structures are stochastic optimization algorithms. Some examples include the K2 algorithm[8], the Structure EM algorithm[18], the Hill-Climbing algorithm [?, 54], the Simulated Annealing algorithm[61], the Sparse Candidate algorithm[18], The Ant Colonies algorithm[11] and so on. Other algorithms use evolutionary algorithms[41], and Genetic Algorithms[44, 39, 56]. As the space of learning Bayesian networks structure is exponential, some preprocessing steps may be applied to restrict the search space and hence make the learning process easier. There are several types of such space restriction steps:[5, 8, 1, 27, 59] use an ordering among the variables in the model.[59, 37] make use of the fact that some variable are already causally connected.[10, 25] use information about the structure of the model to be recovered.[1, 57, 58] combine conditional independence and scoring metrics to find the best structure. In the following subsections, we will take select several examples of the learning algorithms that belong to the search and score category and give a brief description of them The K2 Algorithm K2 [8] is one of the most popular and most frequently used structure learning algorithm. K2 uses the K2 metric for scoring networks. K2 metric is based on applying Bayesian principles. It assumes that a complete database D of sample cases over a set of attributes that accurately models the network is given. Given 42

54 a dataset generated according to the given Bayesian network, we compute from the given network structure B S and a set of conditional probabilities B P the probability of the dataset, i.e., we compute P (D B S, B P ). The K2 metric is given by: F B (B s : D) = n F B (x i, P a(x i ) : N xi,p a(x i )) (4.1) i=1 where N xi,p a(x i ) are the statistics of the variable x i and P a(x i ) in D. q i F B (x i, P a(x i ) : N xi,p a(x i )) = (log( (r ri i 1)! (N ij+ri 1)! ) + j=1 k=1 log(n ijk!)) (4.2) The K2 algorithm assumes that an ordering on the variables is available and that, a priori, all structures are equally likely. It searches, for every node, the set of parent nodes that maximizes the K2 metric. A listing of the different steps of the K2 algorithm is given in the following algorithm listing: 43

55 Algorithm 6 The K2 Algorithm 1. For i = 1 to n Do (a) π i = 0 (b) P old = g(i, π i ) (c) P roceed = T RUE (d) While(P roceed) AND π i < u Do i. Let Z be the node in P red(x i ) π i, that maximizes g(i, π i {Z}) ii. P new = g(i, π i {Z}) iii. If (P new > P old ) Then A. P old = P new B. π i = π i {Z} iv. Else A. P roceed = F ALSE (e) Write( Parents of X i are π i ) 2. End K A Genetic Algorithms Based Approach A structure learning algorithm based on genetic algorithms is presented in[39]. The Bayesian network structure is usually represented by an n n connectivity matrix C, where the element c ij denotes that an edge exists between node i, and node j. The individual of the population for the genetic representation is given by: c 11 c 12...c 1n c 21 c 22...c 2n...c n1 c n2...c nn For example, the simple network in figure 4.1 has the following connectivity matrices and and the following strings and respectively. Cross over, and mutation operators are used to generate new structures from 44

56 Figure 4.1: Example of a Bayesian network to illustrate genetic representation the existing structures. The algorithms uses a predefined node ordering, and rejects individuals that do not adhere to the ordering. The algorithms proceeds as follows: Algorithm 7 An outline of the GA algorithm for learning Bayesian networks 1. Generate an initial population at random 2. Loop (a) Select parents from the population (b) Apply the mutation operator to get new children (c) Apply the cross-over operator to get new children (d) Add good children to the population (e) Reduce the population by removing bad parents 3. Output the best individuals found An Evolutionary Programming Based Approach Evolutionary Programming (EP) was first proposed as an evolutionary algorithm to artificial intelligence, it has been recently applied to many numerical and combinatorial optimization problems successfully. EP is different from classical GA in several aspects: Firstly, there is no constraint on the representation. In classical GA, individuals have to be represented by a fixed-length binary strings. While in 45

57 EP, the representation simply chosen according to the problem. Secondly, EP may apply mutation operators only and ignore cross-over operators. It may only define new operators. Thirdly, the mutation operators change a predefined aspect of the instances, rather than just flipping a bit in the string representation of GA. A structure learning algorithm based on EP was presented in [41]. In this approach, network structures are simply represented as matrices. Based on the chosen representation, several operators are defined: Intersection: results in a child having the common edges in its parents. Union: results in a child having the union of the edges of its parents. Simple Mutation: randomly adds or removes an edge from the structure. Mutual Information Guided Mutation: randomly adds or removes an edge from the structure according to its mutual information. Arc Reversion: randomly selects an edge and reverses its direction. Parent Shift: randomly selects an edge and change its starting point. Child Shift: randomly selects an edge and change its ending point. To assure that the resulting children are valid network structure, three repair operators DAG Repair, Max-Parents Repair, and Partial Order Repair are introduced A Simulated Annealing Based Approach Simulated annealing [29, 30] is an optimization technique that is usually used in search problems. It was originally adapted from the physical process of annealing. In such a process, physical substances are molten, or changed to a state of higher energy, then gradually cooled to get a solid state, or lower energy. It is desirable to reach a state of minimal energy; however, there is a probability that a transition to a higher energy state is made, given by : 46

58 ρ = e E/kT (4.3) where E is the positive change in the energy level, T is the temperature, and k is Boltzmann s constant. Therefore, the probability of a large energy increase is lower than a smaller increase, and the probability also decreases as the temperature declines. The annealing process is very sensitive to the cooling rate, called the annealing schedule. A rapidly cooled substance will exhibit large solid stable regions (but not necessarily the lowest energy content, hence a local minimum), while a slower schedule will need a long time to converge. Learning Bayesian structure from data can be viewed as a search problem in the network space. Hence, Simulated Annealing can be employed to search for the network structure. The drawback of the algorithm is that after a large number of iterations, the temperature drops to a low degree and the local optimizer reaches its stable state. Hence, the search will stop at a local optimized solution. Another variation of simulated annealing that was used for learning Bayesian network structures is two-level simulated annealing (TLSA) [62, 61]. The following listing from [61]describes the algorithm: The Sparse Candidate Algorithm Another algorithm for learning Bayesian network structures that is called the Sparse Candidate Algorithm is presented in [20]. The general idea behind the algorithm is quite straightforward. Statistical cues from the data are used to restrict the search space. To restrict the search space, the set of possible parents for each node is restricted to a specified set. After the space restriction, a search method is employed to find the best network structure according to some quality measure that obeys the constraints imposed by the space restriction. Space restriction processes are usually very risky. The algorithm might fail to reach the best network, if any mistakes occurred in the space restriction phase. To solve this problem, the sparse candidate algorithm uses the network found in the search stage to find better candidate parents. Better networks can be then 47

59 Algorithm 8 TLSA for learning Bayesian networks 1. Set T to its initial value. 2. Set x old to an initial feasible solution. 3. Compute x old. 4. Set f old = f(x old ); x best = x old and f best = f old 5. Repeat (a) For i = 1tom (m is the number of iterations) i. x new = perturbation(x old ) ii. f new = f(x new) iii. Generate a random number r iv. if(f new < f old ) or (r e f(x old x new T ) ) then A. x old = x new B. f old = f new C. if(f old f best ) then x best = x new f best = f new (b) T = p T 6. Until(stopping criteria is met) 48

60 searched for with respect to the new restriction. The algorithm proceeds iteratively in this manner until convergence. The algorithm is best described in the following algorithm listing from [20]: Algorithm 9 Outline of the Sparse Candidate Algorithm Input: 1. A data set D = {x 1,..., x N } 2. An initial network B 0 3. A decomposable score Score(B D) = i Score(X i P a B (X i ), D) 4. A parameter k Output: 1. A network B Procedure: 1. Loop for n = 1, 2,... until convergence (a) Restrict Based on D and B n 1 select for each variable X i a set Ci n( Cn i k) of candidate parents. This defines a directed graph H n = (X, E), where E = {X j X i i, j, X j }Ci n(note that H nis usually cyclic) (b) Maximize Find network B n = G n, θ n maximizing Score(B n D) among networks that satisfy G n H n (i.e., X i, P a G n (X i) ) C n i 2. Return B n S The Ant Colonies Algorithm Ant algorithms [16, 15] are based on the cooperative behavior of real ant colonies, which are able to find the shortest path from a food source to their nest. While walking, real ants deposit a chemical substance called pheromone on the ground. Ants can smell pheromone and, when choosing their way, they tend to choose, in a probabilistic way, paths marked by strong pheromone concentrations. In the absence of pheromone, ants choose randomly, but after a transitory period shortest 49

61 paths will be more frequently visited and pheromone will accumulate faster on them, which in turn causes more ants to use these paths. This positive feedback effect means that all the ants will eventually use the shortest path. So, although a single ant is capable of building a solution (i.e., a path), the optimal solution comes about solely as a result of the cooperative behavior of the ant colony. An approach based on Ant Colonies optimization for learning Bayesian networks is presented in [11]. The problem is represented with a graph. The states of the problem are dags with n nodes. Thus, a state G i will be a directed acyclic graph. The ant incremental construction of the solution starts from the empty graph G_0 and proceeds by adding edges to the graph incrementally till it reaches the final solution. Edges to include in the graph are selected using a heuristic decomposable metric f which might be any of the decomposable Bayesian networks structure metrics discussed in Let us denote the gain of adding an edge x i x j by η ij which is given by: η ij = f(x i, P a(x i ) {x j }) f(x i, P a(x i )) (4.4) An outline of the algorithm, from [11], is presented in the following listing: 50

62 Algorithm 10 Outline of the ant colonies algorithm for learning Bayesian networks 1. Initialization: (a) for i = 1 to n do: P a(x i ) = φ (b) for i = 1 and j = 1 to n do: if(i j) then η ij = f(x i, x j ) f(x i, φ) 2. Loop: (a) repeat i. Select two indexes i and j ii. if(η ij > 0) then P a(x i ) = P a(x i ) {x j } iii. η ij = iv. for all x a Ancestors(x i ) {x i } and x b Descendants(x i ) {x i } do: η ab = v. for k = 1 to n do: if(η ik > ) then η ik = f(x i, P a(x i ) {x k }) f(x i, P a(x i )) vi. τ ij = (1 ρ).τ ij + ρ.τ 0 (b) until i, j(η ij 0 or η ij = ) 4.4 Summary The problem of learning Bayesian network structures from data has received a great deal of attention from several researchers. Different methods have been proposed for solving that problem. The methods can be divided into two different categories. The first poses learning as a constraint satisfaction problem. In this approach, the properties of conditional independence among attributes are estimates using several statistical tests. The second approach poses learning as an optimization problem where several standard heuristic search techniques, such as greedy hill-climbing and simulated annealing, are utilized to find high-scoring structures according to some structure fitness measure. In this chapter we described some the approaches belonging to both categories. We began with the constraint-based methods, and gave a brief description of 51

63 two methods that belong to this category. We then moved to the search and score methods, and elaborated on several approaches belonging to this category like: the K2 algorithm[8], the Hill-Climbing algorithm, the Simulated Annealing algorithm[61], the Sparse Candidate algorithm[18], and The Ant Colonies algorithm [11]. 52

64 Chapter 5 A Probabilistic Framework for Learning Bayesian Networks 5.1 Introduction In the proposed method, we introduce a new structure learning algorithm for Bayesian networks based on the concept of probabilistic relaxation. In the proposed framework, we assume that the existence of an edge between two nodes is no longer a hard decision. Rather, an edge does always exist with some probability. The problem of learning Bayesian networks structure is now transformed to finding the best probability assignment for each possible edge. The algorithm advances in this soft manner without producing any intermediate hard solutions. Taking hard decisions about the existence of edges will only take place to produce the final solution. The proposed approach is a two tier approach. In the first tier, we apply ideas that restrict the search space for the purpose of focusing the search and reducing the computational requirements. Several techniques may be used for this purpose and will be outlined hereunder. In the second tier we apply the proposed probabilistic relaxation technique to the structure learning problem. The rest of the chapter will proceed as follows. Section 5.2, we will present an outline of the proposed method. In Section 5.3, we will discuss some of the ideas used to restrict the search space. The network representation will be discussed in 53

65 Figure 5.1: Summary of approach Section 5.4. The details of the local search algorithm, and the network evaluation will be presented in Section 5.5, and Section 5.6 respectively. In Section 5.7, we present the caching mechanism employed, while in Section 5.8, we elaborate on how to measure the level of randomness (Entropy). Finally, we present a s summary in Section Summary of the Approach The proposed approach suggest solving the problem of learning Bayesian networks structures from data within a probabilistic relaxation framework. The approach depends on maximizing a network scoring function subject to a constraint on the randomness (Shannon entropy) of the solution. It proceeds iteratively while gradually lowering the level of randomness till it converges to a more global solution making it immune to getting stuck at a local maximum. As a preprocessing step, the approach tries to restrict the search space using domain knowledge and/or statistical dependency test, figure

66 Below, an outline of the algorithm will be presented. Detailed discussion of several challenges that needed to be addressed will follow in the following sections Algorithm 11 The probabilistic relaxation approach to learning Bayesian networks Goal: 1. Optimizing the edge probabilities P (e ij ) s to find the best network B Input: 1. A dataset T 2. A decomposable scoring function D 3. A function H() that calculates solution entropy 4. A threshold on entropy T H Output: 1. A network B Procedure: 1. Restrict the search space 2. Create an initial network B init 3. For each possible edge e ij P (e ij ) = Loop until H(B) < T H (a) B score = Score(B) (b) F = H + βd (c) Search for B max that maximizes F (d) B = B max (e) Increase β 5. foreach edge e ij if P (e ij ) > 0.5 add e ij to the final network We notice from the above algorithm listing that we maximize F = H + βd instead of maximizing D directly like ordinary methods. This procedure is re- 55

67 peated for several iterations. The output of each iteration is used as an initial solution for the next iterations. As iterations advance, the value of β is increased. At β = 0, we are mainly maximizing randomness. As β gets higher, we trade randomness maximization with scoring function maximization. At very high values of β, we merely maximize the scoring function directly like ordinary methods. The effect of varying β on the optimization process is illustrated in figure 5.2. At the first iteration, the level of randomness or the entropy is rather high. As iterations proceede, the entropy begins to drop. When the entropy drops below a predefined threshold, the iterations stop. This entropy-iteration relation is illustrated in figure 5.2. Alternatively, we can can maximize F = D+T H and start with a very high value of T and gradually lower it. This notation is equivalent to the F = H + βd notation. However, there are minor differences between them. The first notation is easier to understand and interpret. This notation gives room to the representation of the problem as constrained optimization of D with constraints on H, and this makes T the Lagrangian multiplier. Meanwhile, the second notation has the advantage that the free parameter β can be initialized with zero, and infinitely increased till the method converges to a solution. While as in the first notation, some large value for T must be selected, and care must be taken such that the method converges before T reaches zero. 5.3 Restricting the Search Space Restricting or reducing the search space is a crucial step in the structure learning problem due to the exponential nature of the search space. The search space is very large even for a small number of variables. This rises from the fact that an increase in the number of nodes leads to a super exponential increase in the amount of possible structures. Finding the optimal Bayesian network in the search space of all possible structures was shown to be NP-hard. Even when using optimization methods to get the best possible structure, the search space remains considerably large. Hence, comes the importance of employing some techniques to reduce the search space. Several approaches may be employed to 56

68 Figure 5.2: The effect of varying βon the optimization process Figure 5.3: The relation between entropy and iterations 57

69 accomplish this purpose. Some of them will be described in the few coming paragraph. The first approach suggests using an independence measure to filter out edges linking independent nodes. Filtering out edges based on an independence test is rather risky, because any correct edges wrongly excluded here will never appear in the final solution. Hence, we suggest using two different independence tests and only filter out edges that fail to pass a rather high threshold in both tests. We use two of the most frequently used tests for this purpose: the χ 2 test and the mutual information test. Some variants of the mutual information test that may be also employed are described in [20], for example the discrepancy mutual information test and the shielding mutual information test [20]. Another approach is to allow the search algorithm to select a non-candidate edge for altering with some low probability. The second approach assumes that an ordering of the nodes is present. In many applications ordering can be deduced using domain knowledge, for example using knowledge about the causal relationships of variables. The parent of any node must be preceding it in the ordering. This allows us to reduce the number of candidate edges by half. The last approach allows the algorithm to make use of any domain knowledge suggesting that some nodes are independent. This knowledge is usually made available by the help of domain experts. Domain knowledge may be respected by the algorithm by removing edges between independent nodes from the candidate set. 5.4 Representation Before moving on to describe the details of the algorithm, we will elaborate on the representation of the network. The most convenient way of representing a graph is the matrix representation. The Bayesian network will be represented as an n n matrix, where n is the number of variables. Each position G(i, j) will hold a real number between 0 and 1 that corresponds to the probability of 58

70 existence of an edge between node i and node j. Initially all candidate edges will have a probability of 0.5. Entries corresponding to edges removed from the candidate set, by the search space restriction phase, will be set to 0. This matrix will be called a Probabilistic Directed Acyclic Graph (PDAG). Along with the matrix, there will be a vector with the indexes of the candidate edges. Initially this vector will contain all possible edges. Later on. edges will be removed from the candidate edges vector in the search space restriction phase. Take as an example the sprinkler network of figure 2.5. If we decided to restrict the search space based of the node order (C,R,S,W), the initial PDAG will be as follows: (5.1) 5.5 Local Search Algorithm The local search algorithm used is a simple greedy hill-climbing algorithm augmented with a simple Tabu list[23, 24]. In its simplest form, a Tabu list contains recently visited solutions (less than n moves ago). Instead of applying the best local change, the algorithm selects the best local change resulting in a solution not existing in the Tabu list. The local search procedure terminates when a certain number of local changes fail to yield an improvement over the current solution. The local moves performed by the search algorithm select one of the candidate edges and alter its probability. The probability altering is accomplished by sampling a new value from the edge probability distribution. Edge distribution follows a softmax distribution parametrized by the temperature T, where T = 1/β. Parametrizing the distribution by the temperature makes it more discriminating as the temperature is lowered. This parametrization assures that the lower the temperature, the nearer the probabilities are to 0 or 1. This cab be obviously noticed from figure

71 Figure 5.4: Varying the edge probability distribution against β Let u be a random number sampled from a zero-mean Gaussian distribution, then the edge probability is obtained as follows: P (e) = e 1 T u e 1 T u + e 1 T (1 u) (5.2) The local search procedure is described in the following listing: 60

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall