Improving Acyclic Selection Order-Based Bayesian Network Structure Learning

Size: px

Start display at page:

Download "Improving Acyclic Selection Order-Based Bayesian Network Structure Learning"

Dominick Howard
5 years ago
Views:

1 Improving Acyclic Selection Order-Based Bayesian Network Structure Learning Walter Perez Urcia, Denis Deratani Mauá 1 Instituto de Matemática e Estatística, Universidade de São Paulo, Brazil wperez@ime.usp.br,denis.maua@usp.br Abstract. An effective approach for learning Bayesian network structures in large domains is to perform a local search on the space of topological orderings. As with most local search approaches, the quality of the procedure depends on the initialization strategy. Usually, a simple random initialization is adopted. Perez and Mauá developed initialization heuristics that were empirically shown to improve the overall performance of order-based structure learning. Recently, Scanagatta et al. proposed replacing the search for a directed acyclic graph in order-based learning with a procedure that considers also order-incompatible structures. Their procedure covers a larger space of structures without small computational overhead, which often leads to improved performance. As with standard order-based learning, Scanagatta et al. recommended initializing their algorithm with a randomly generated ordering. A natural improvement for this approach would be then to consider better initialization heuristics. In this work we propose a new initialization heuristic that takes into account the idiosyncrasies of Scanagatta et al. s approach. Experiments with real-world data sets indicate that the combination of this new heuristic and Scanagatta et al. s orderbased search outperforms other order-based methods. 1. Introduction Bayesian Networks are compact representations of multivariate probability distributions [Pearl 1988]. They are defined by two components: an acyclic directed graph (DAG), and a collection of local conditional probability distributions of each variable given its parents. Manually specifying a Bayesian network is an error-prone and time-consuming task, and practitioners often resort to automatically learning (i.e., inferring) the model from a data set of observations. This learning is often performed in two steps. In the structure learning step, one obtains a DAG (called the structure) representing the relations between variables in the data. Then, in the parameter learning step, the associated conditional probabilities are estimated assuming the learned DAG to be true. When the data are complete (i.e., there are no missing values), the latter step amounts to obtaining relative frequencies from data, and can thus be performed efficiently. An effective approach for Bayesian network structure learning is to search the space of DAGs guided by a score function [Cooper and Herskovits 1992, Lam and Bacchus 1994, Friedman et al. 1999, Chickering 2002]. Motivated by the fact that the search over DAGs is tractable when an ordering over the variables is fixed and the node in-degree is bounded [Buntine 1991], Tessyer and Koller (2005) proposed searching in the space of topological orderings by associating each ordering with a compatible SBC ENIAC-2016 Recife - PE 169

2 DAG. Their order-based search (OBS) performs a local search in the topology of complete variable orderings where two orderings are neighbors if they differ in at most one comparison. Each ordering is associated with a compatible DAG found by searching for score-maximizing parents of each node that respect that ordering and the in-degree bound. As with most local search approaches, the choice of an initial solution (i.e., an ordering) is crucial to the quality of the solution produced by OBS. The search is usually initialized with an ordering sampled uniformly at random. While this guarantees a fair coverage of the search space, it can lead to poor local optima, slow convergence and ultimately poor performance. To alleviate these issues, we have recently advocated the use of informed heuristics for the generation of initial solutions [Perez and Mauá 2015]. We developed two heuristics based on the solution of the relaxed version of the structure learning problem where cycles are permitted. This relaxed solution can be obtained efficiently by greedily and independently selecting the parents of each node so as to maximize the score. The first heuristic generates an ordering by a depth-first search (DFS) traversal of the relaxed solution where the in-degree of nodes is used to break ties. The second heuristic considers a weighted version of the relaxed solution where the weight of an arc measures the decrease in the overall score incurred by removing that arc. A standard greedy algorithm is used to find the minimum-cost Feedback Arc Set (FAS) and obtain a DAG. This DAG is finally used to generate a consistent topological ordering. Empirical results with a large collection of data sets showed that both heuristics improve the performance of OBS, with the FAS-based heuristic consistently outperforming the DFS-based heuristic. Recently, Scaganatta et al. (2015) proposed a different approach for associating a DAG with an ordering. Their approach, called acyclic selection order-based search (ASOBS), differs from OBS as follows: given a fixed ordering, a DAG is selected by iterating from the greatest to the smallest node, selecting the score-maximizing parent set for the current node that does not introduce cycles in the incumbent DAG. Thus, the associated DAG does not need to respect the given ordering. They reported improved performance over OBS in a data set of real world domains. In this work, we develop a new initialization heuristic that is based on a relaxed version of the optimization solved by ASOBS. Thus, unlike the FAS- and DFS-heurstics, our heuristic is tailored to ASOBS. We compare the heuristics and search techniques on six real-world data sets containing from 64 to 1556 variables. The results suggests that this new heuristic very often finds higher scoring DAGs than previous heuristics, including the standard implementation of ASOBS (which uses a random generation of orderings). The rest of this document is organized as follows. We begin in Section 2 with a review of the Bayesian network structure learning problem, and the OBS. We then briefly explain ASOBS in Section 3. In Section 4, we describe the initialization heuristics we developed in previous work. Our new heuristic is presented in Section 5. An empirical analysis of the heuristics is shown in Section 6. We conclude the paper in Section Order-Based Bayesian Network Structure Learning A Bayesian network specification contains a DAG G = (V, E), where V = {X 1, X 2,..., X n } is the set of (categorical) random variables, and a collection of conditional probability distributions P (X i P a G i ), i = 1,..., n, where P a G i are the parents SBC ENIAC-2016 Recife - PE 170

3 of X i in G and P (X i ) = P (X i ). The Bayesian network is assumed to induce a joint probability distribution over all the variables through the equation P (X 1,..., X n ) = n P (X i P a G i ), i=1 The number of parameters required to specify a Bayesian network with DAG G is size(g) = n i=1 (r i 1) X j P a r G j, where r k denotes the number of states variable i X k can assume. A score function sc(g) assigns a real-value to any DAG indicating its goodness in representing a given data set. 1 Typical score functions combine a data likelihood component with a model complexity penalization to prevent overfitting [Akaike 1974, Schwarz 1978, Suzuki 1996, Heckerman et al. 1995]. Notable examples are the Bayesian Information Criterion (BIC) [Schwarz 1978], the Akaike Information Criterion (AIC) [Akaike 1974], Minimum Description Length (MDL) [Suzuki 1996] and the Bayesian Dirichlet score (BD) [Heckerman et al. 1995]. In particular, the BIC and MDL scores were shown to outperform other scores at recovering the true structure [Liu et al. 2012]. We assume the score function is decomposable, meaning that it can be written as a sum of local scores sc(g) = n i=1 sc i(p a G i ). Most score functions used in structure learning satisfy these properties, such as the BIC, MDL, and BD [Chickering et al. 2004]. For example, the BIC score function, is given by BIC(G) = LL(G) ln N n 2 size(g) = N ijk ln N ijk + (r i 1) r j, N i=1 k j ij X j P a G i where LL(G) is the data log-likelihood, N ijk is the number of instances where attribute X i takes its kth value and its parents take the jth configuration (for some arbitrary fixed ordering of the configurations of the parents values), and N ij = k N ijk. The Bayesian network structure learning problem is to find G that satisfies sc(g ) = max G is a DAG n sc i (P a G i ). (1) Since a directed graph is acyclic if and only if it admits a topological ordering, Equation (1) can be rewritten as sc(g ) = max < i=1 n max sc i(y), (2) Y {X j <X i } i=1 where the first maximization is performed over the space of (total) orderings of the nodes V. The search space of this formulation is considerably smaller than the search space in Equation (1) [Teyssier and Koller 2005]. When the in-degree of G is assumed to be bounded by an integer k, the inner maximizations can be performed by systematic search 1 The dependence of the scoring function on the data set is usually left implicitly, as for most of this explanation we can assume a fixed data set. We assume here that the data set contains no missing values. SBC ENIAC-2016 Recife - PE 171

4 in time O(n k ) [Scanagatta et al. 2015]. In particular, for the BIC and MDL score functions, the (local) score-maximizing parent set of any variable (under any fixed order) has at most log N variables, where N is the size of the data set [de Campos and Ji 2011]. Hence, a small bound k is often a reasonable simplification. Pruning rules can also be used to speed up the parent set selection [de Campos and Ji 2011, Yuan and Malone 2013]. Tessyer and Koller (2005) used a local search procedure to solve the outer optimization in Equation (2). Define the score of an ordering < as sc(<) = n max sc i(y). Y {X j <X i }: Y k i=1 At each iteration, Tessyer and Koller s local search evaluates all candidate orderings < that differ from < in a single comparison, and either selects the one with the highest increase in score (if it exists) or halts (if none is found). Given two orderings < and < that differ in a single comparison such that X < Y and Y < X, the relative increase in score is sc(< ) sc(<) = max Y {Xj < X}: Y k sc i (Y) + max Y {Xj < Y }: Y k sc i (Y) ( maxy {Xj <X}: Y k sc i (Y) + max Y {Xj <Y }: Y k sc i (Y) ). Thus, the selection of the best neighbor ordering can performed efficiently, which makes OBS highly effective. 3. Acyclic Selection Order-Based Search As discussed, given an ordering < over the variables, OBS performs parent set selection by independently searching, for each variable X, for the score-maximizing subsets of variables smaller than X. While this decomposition of the parent selection step for each variable ensures linear time (for bounded in-degree), it imposes unnecessary constraints and reduces the effectively searched space. To see this, consider a simple example where three variables are ordered such that A < B < C. Suppose that the best parent set for B is the empty set. Then clearly considering B as a candidate parent set of A still leads to a valid solution (a DAG), while increasing the coverage of DAGs. In other words, once a choice of parent set is made for some of the variables, the constraint imposed by the order on the remaining variables can be partially relaxed. Acyclic selection order-based search (ASOBS) builds on this idea to improve the performance of OBS [Scanagatta et al. 2015]. Instead of maximizing parent sets independently, ASOBS selects parents sequentially, from the highest to the smallest ordered variable. The procedure is more precisely described as follows. Fix an ordering X 1 < X 2 < < X n, and let G n be an empty graph. Then, for i =n to i = 1 ASOBS searches for the best parent set for X i that does not induce cycles in G i+1, and sets G i to be G i+1 with the selected parents of X i. This procedure can be implemented in linear time, thus matching the asymptotic performance of OBS [Scanagatta et al. 2015]. Scanagatta et al. showed that ASOBS outperforms OBS on a large collection of real-word data sets. 4. Informed initialization heuristics The generation of a good initial solution is crucial for avoiding convergence to poor local maxima in order-based structure learning. Traditionally, this is attempted by randomly generating initial orderings using standard techniques such as the Fisher-Yates algorithm SBC ENIAC-2016 Recife - PE 172

5 [Knuth 1998]. While this guarantees a good coverage of the search space when sufficiently many restarts are performed, in large domains it can lead to poor solutions and require many iterations until a local optimum is reached. In previous work, we proposed using the information provided by the relaxed version of the problem H k = arg H n max sc i(y), (3) Y {X j <X i }: Y k i=1 to guide the generation of initial solutions (variable orderings) [Perez and Mauá 2015]. To this aim, we developed two heuristics, which we review next Depth-First Based Heuristic The idea behind this heuristic is to perform a depth-first search in the graph Hk using the in-degree of child nodes to select which child to visit next. To understand why this may generate good orderings, consider a graph Hk with nodes X i and X j such that X i is the single parent of X j and has no parents. Then, there is an optimal ordering starting with X i (this can easily be shown by contradiction). We can delete X i from the graph and repeat the argument to conclude the existence of an optimal ordering starting with X i < X j. Now consider a case where there are two or more selectable nodes (by the previous explanation) in graph Hk. Instead of picking a random selectable node we define the goodness of a node by: goodness(x i ) = P a H j unvisited (4) X j Ch H i unvisited where Ch H i is the set of X i s children and unvisited is the set of unvisited nodes. Small values of goodness mean that removing X i from the graph will make more nodes to be selectable. Ties are resolved by picking one of the best selectable nodes uniformly at random. The heuristic is more precisely described in the pseudo-code in Algorithm 1. For example, in the graph in Figure 1, we can safely constrain the orderings to start with A, since it has no parents, and remove it from the graph. At this time, we have three selectable nodes B, C and F, each one with same in-degree, but with different goodness value. Since F has the least goodness value, we select it. Performing previous steps repeatedly we obtain that the candidate optimal orderings are A, F, C, E, B, D and A, F, C, E, D, B. Note that this is a significant decrease from the full space of 6! = 720 possible orderings. This difference is likely to increase considerably as the number of variables increases, and as the best parent set becomes sparser Feedback-Arc Set Heuristic The DFS-based approach can be seen as removing edges from a graph Hk so as to make it a DAG (more specifically, a tree), and then extracting a consistent topological ordering. The selection of an edge to remove considers only the qualitative information of the graph. An arguably better approach is to use the score function to assess the relevance of each edge, and to consider the removal of edges globally (not only in a local neighborhood). The relevance of an edge X j X i in a graph H is assessed by W ji = sc i (P a H i ) sc i (P a H i \ {X j }), (5) SBC ENIAC-2016 Recife - PE 173

6 Algorithm 1: DFS-Based ordering generation. Function: DFS( Graph H ) 1 unvisited all nodes 2 L 3 while unvisited is not empty do 4 O unvisited nodes ordered by unvisited in-degree and goodness 5 B best nodes from O 6 if B has more than one node then 7 select a node X r from B uniformly at random 8 else 9 select the unique node X r from B 10 end 11 L L {X r } 12 unvisited unvisited \ {X r } 13 end 14 return L 83 B 153 D F A 87 C 227 E Figure 1. An example of a parent set graph. The weight W ji represents the cost of removing X j from the set P a H i, and it is always a positive number if H is the best parent set graph (Equation (3)) since P a H i maximizes the score for X i. A small value of W ji suggests that the parent X j is not very relevant to X i. For instance, in the weighted graph in Figure 1, the edge C D is less relevant than the edge B D, which in turn is less relevant than the edge A D. The feedback-arc set heuristic (FAS) penalizes orderings which violate an edge X i X j in H by their associated cost W ij. Given a directed graph H = (V, E), a set F E is called a Feedback Arc Set (FAS) if every (directed) cycle of H contains at least one edge in F. In other words, F is an edge set that if removed makes the graph H acyclic [Demetrescu and Finocchi 2003]. If we assume that the cost of an ordering of H is the sum of the weights of the violated (or removed) edges, we can formulate the problem of finding a minimum cost ordering of H as a Minimum Cost Feedback Arc Set Problem (min-cost FAS): given the weighted directed graph H with weights W ij, find a min-cost FAS F such that F = arg min W ij. (6) H F is a DAG X i X j E SBC ENIAC-2016 Recife - PE 174

7 The min-cost FAS problem have been proved to be NP-complete [Gavril 1977], but there are efficient and effective approximation algorithms [Eades et al. 1993, Eades and Lin 1995, Demetrescu and Finocchi 2003] like the one shown in Algorithm 2 with complexity O(nm), where m is the number of edges on the graph. Algorithm 2: Minimum Cost FAS approximation Input : Graph H Output: Feedback Arc Set F 1 F 2 while there is a cycle C on H do 3 W min arg min (u,v) C W uv 4 for (u, v) C do 5 W uv = W uv W min 6 if W uv = 0 then 7 F = F + {(u, v)} 8 end 9 end 10 end 11 for (u, v) F do 12 if (u, v) does not build a cycle then 13 H = H + (u, v) 14 F = F \ {(u, v)} 15 end 16 end In short, the FAS heuristic is: take the weighted graph Hk with weights W ij as input, and find a(n approximate) min-cost FAS F ; remove the edges in F from Hk and return a topological order of the DAG Hk F (this can be done by performing a depth-first search traversal starting at root nodes). 5. A New Initialization Heuristic for ASOBS: A Best-First Based Approach Whereas the DFS- and FAS-based heuristics provide a significant improvement on the quality of solutions found by OBS in fixed amount time, they generate only a marginal gain in performance when ASOBS is adopted. One possible explanation is that ASOBS performs parent set selection under a (dynamically chosen) variable ordering, hence biasing the search towards specific orderings can actually hurt performance. Motivated by the previous explanation, we propose the Best First-Based initialization heuristic (BFT) described in Algorithm 3. The algorithm takes a collection of possible candidate parent sets for each variable (e.g., the subsets of all variables other than X i with cardinality at most a given k, for each X i ). The heuristic initially labels all nodes as non visited (Line 1), and represents an ordering as a list L (initially empty; Line 2). Say that a parent set is valid if it does not contain visited nodes. Then the loop in Lines 3 to 12 generates an ordering as follows. In Line 4, the best valid parent set bestscore visited k = max P ak {X 1,...,X n}\visited sc k (P a k ) is selected for each non-visited variable, and then ranked by their score in decreasing order (Line 5). Then a variable is SBC ENIAC-2016 Recife - PE 175

8 generated with probability proportional to its ranking (Lines 6 to 9), and the ordering and the set of visited nodes are updated (Lines 10 and 11, respectively). Algorithm 3: BestFirst-Based ordering generation. Function: BestFirst( Candidate Parent Sets C i ) 1 visited 2 L 3 for r = n to 1 do 4 S {(X i, bestscore visited i ) X i visited} 5 Sort S decreasingly by bestscore visited i 6 for j = 1 to S do 7 P rob j = 1 j 8 end 9 X r random variable using probability distribution P rob r 10 L[r] X r 11 visited = visited {X r } 12 end 13 return L The time complexity of the procedure is dominated by the selection of the best parent set for each variable in Line 4. Assuming we have pre-calculated scores of parent sets bounded by in-degree k and using the efficient implementation of bestscore visited i developed by [Malone 2012] based on bitsets, then the complexity of the procedure is as follows. Line 4 is performed in worst-time O(nk), line 5 in time O(n log n). The loop from line 6 to 8 is O(n) and lines 9 to 11 are performed in constant time. Since it is necessary to repeat all steps at each iteration, the overall complexity is O(n 2 k). 6. Experiments, Results and Discussion We evaluate the performance of (acyclic selection) order-based structure learning algorithms with different initialization heuristics, including the standard random generation (RND), the depth-first search (DFS) and feedback arc set (FAS) based heuristics [Perez and Mauá 2015], and the new heuristic (BFT). We use a collection of real-world data sets whose characteristics are shown in Table 1. The algorithms were implemented in C++, using a few utilities from the URLearning 2 package for learning Bayesian networks. 3 For each data set we generated all parent sets in increasing cardinality, with a time limit of 2 minutes per variable. We used pruning rules in order to discard parent sets that are provably suboptimal (in the optimal ordering) [de Campos and Ji 2011]. This reduces significantly the number of candidate parent sets and improves performance. The number of non-pruned candidate parent sets generated is shown in column Nps in Table 1. We ran each search algorithm (ASOBS or OBS) with 100 different initial orderings obtained by different heuristics. The maximum and average relative scores 2 Available at 3 Implementation available at SBC ENIAC-2016 Recife - PE 176

9 Dataset n N N ps Kdd M Tretail M Msweb M Cr M Bbc M Ad M Table 1. Data sets characteristics. n: number of variables, N: number of instances and Nps: number of unpruned parent sets BFT RND DFS FAS BFT RND DFS FAS Figure 2. Maximum Best Score Figure 3. Average Best Score for each data set/algorithm appears in Table 2. The relative score of a DAG G is RC(G) = (sc(g) sc( ))/ sc( ), where sc( ) is the score of an empty DAG. Looking at the the results in Table 2 we see that BFT outperforms the other approaches w.r.t. the maximum best score in 5 out of 6 data sets. The ad data set, for example, where FAS outperforms BFT has a relatively small search space (see its Nps in Table 1), which might explain the superior performance of FAS+ASOBS. W.r.t. average best scores, FAS and BFT perform fairly similarly, and are superior to other approaches. Again, we see that FAS outperforms BFT on the Ad data set. Overall, ASOBS+BFT achieves higher scores than OBS+FAS and the improvement is more noticeable as the number of variable increases. To verify whether the performance differences are statistically significant, we performed a Friedman Test, which is a non-parametric hypothesis test with multiple comparison correction [Demsar 2006]. The computed p-values for the test are and for the maximum and average best score, respectively. Hence, there is a statistically significantly better method for the first criterion at significance level α = To obtain insight into the relative performance of the methods, we performed a post-hoc Nemenyi Test, which performs pairwise comparisons with multiple comparison correction [Demsar 2006]. The results for maximum and average best score are depicted in Figures 2 and 3, respectively. Each point represents the average ranking of an approach, and the intervals represent confidence intervals. In these figures, a method A is considered statistically significantly better than a method B if A has a smaller average ranking than B and their intervals do not overlap. We see from Figure 2 that w.r.t. the maximum best score BFT ranks better on average than other methods, but this difference is statistically significant only when compared to DFS (since confidence intervals overlap with other heuristics). Figure 3 shows that, w.r.t. the average best score, BFT and FAS rank better than other methods, but these differences are not statistically significant. SBC ENIAC-2016 Recife - PE 177

10 Data set Kdd Tretail Msweb Cr52 Bbc Ad Heuristic ASOBS OBS Max. Best Score Avg. Best Score Max. Best Score Avg. Best Score RND ± ± DFS ± ± FAS ± ± BFT ± ± RND ± ± DFS ± ± FAS ± ± BFT ± ± RND ± ± DFS ± ± FAS ± ± BFT ± ± RND ± ± DFS ± ± FAS ± ± BFT ± ± RND ± ± DFS ± ± FAS ± ± BFT ± ± RND ± ± DFS ± ± FAS ± ± BFT ± ± Table 2. Performance of order-based search algorithm (the best method for each data set/criterion is shown in bold) 7. Conclusions and Future Work Learning Bayesian networks from data is a notably difficult problem, and practitioners often resort to approximate solutions. A state-of-the-art approach for large domains is order-based structure learning, which performs a local search in the space of variable orderings. Acyclic selection is an scalable order-based algorithm that aims to be the new state-of-the-art approach. As with many local search approaches, the quality of the solutions produced by order-based learning strongly depends on the initialization strategy. In this work, we proposed a new informed heuristic for generating initial solutions for order-based structure learning. The heuristic, called BFT (from Best-First), mimics the way that acyclic selection performs structure search given an initial order. Experiments with 6 real-world data sets containing from 64 to 1556 variables demonstrate that the new heuristic often significantly improves the accuracy of acyclic selection order-based search. In summary, these results indicate the advantage of using informed approaches to generating initial orderings in large domains. In the future, we intend to compare the heuristics with a much larger collection of data sets, in order to obtain a more clear (and statistically significant) picture of the relative performance. Our new informed initialization heuristic can also be used in other methods that search the space of orderings such as tabu search [Glover 1989], simulated annealing [Granville et al. 1994] or data perturbation [Elidan et al. 2002]. This is also left as SBC ENIAC-2016 Recife - PE 178

11 future work. 8. Acknowledgement We thank Mauro Scanagatta for kindly providing us with the data sets. This work benefited from the computing resources provided by the Superintendência de Informação of Universidade de São Paulo. The first author was partially supported by CAPES. The second author was partially supported by the São Paulo Research Foundation (FAPESP) grant #2016/ References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6): Buntine, W. (1991). Theory refinement on Bayesian networks. In Proceedings of the 7th Annual Conference on Uncertainty Artificial Intelligence, pages Chickering, D. M. (2002). Learning equivalence classes of Bayesian-network structures. Journal of Machine Learning Research, 2: Chickering, D. M., Heckerman, D., and Meek, C. (2004). Large-sample learning of Bayesian networks is NP-Hard. Journal of Machine Learning Research, 5(1): Cooper, G. F. and Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4): de Campos, C. P. and Ji, Q. (2011). Efficient structure learning of Bayesian networks using constraints. Journal of Machine Learning Research, 12: Demetrescu, C. and Finocchi, I. (2003). Combinatorial algorithms for feedback problems in directed graphs. Information Processing Letters, 86(3): Demsar, J. (2006). Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1 30. Eades, P. and Lin, X. (1995). A new heuristic for the feedback arc set problem. Australian Journal of Combinatorics, 12: Eades, P., X.Lin, and Smyth, W. F. (1993). A fast and effective heuristic for the feedback arc set problem. Information Processing Letters, 47(6): Elidan, G., Ninio, M., Friedman, N., and Schuurmans, D. (2002). Data perturbation for escaping local maxima in learning. In Eighteenth National Conference on Artificial Intelligence, pages Friedman, N., Nachman, I., and Peér, D. (1999). Learning Bayesian network structure from massive datasets: The sparse candidate algorithm. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages Gavril, F. (1977). Some NP-complete problems on graphs. In Proceedings of the 11th Conference on Information Sciences and Systems, pages Glover, F. (1989). Tabu search - part I. Operations Research Society of America Journal on Computing, 1(3): SBC ENIAC-2016 Recife - PE 179

12 Granville, V., Krivanek, M., and Rasson, J. P. (1994). Simulated annealing: A proof of convergence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6): Heckerman, D., Geiger, D., and Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Journal of Machine Learning Research, 20(MSR-TR-94-09): Knuth, D. (1998). The Art of Computer Programming 2. Boston: Adison-Wesley. Lam, W. and Bacchus, F. (1994). Learning Bayesian belief networks: An approach based on the MDL principle. Computational Intelligence, 10(4):31. Liu, Z., Malone, B., and Yuan, C. (2012). Empirical evaluation of scoring functions for bayesian network model selection. BMC Bioinformatics, 13 (Suppl. 15(S14). Malone, B. M. (2012). Learning optimal Bayesian networks with heuristic search. PhD thesis, Mississippi State University, Mississipi, USA. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann. Perez, W. and Mauá, D. D. (2015). Initialization heuristics for greedy Bayesian network structure learning. In Proceedings of the 3rd Symposium on Knowledge Discovery, Mining and Learning, pages Scanagatta, M., de Campos, C. P., Corani, G., and Zaffalon, M. (2015). Learning Bayesian networks with thousands of variables. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, pages Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2): Suzuki, J. (1996). Learning Bayesian belief networks based on the minimum description length principle. In Proceedings of the Thirteenth International Conference on Machine Learning, pages Teyssier, M. and Koller, D. (2005). Ordering-based search: A simple and effective algorithm for learning Bayesian networks. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, UAI 2005, pages Yuan, C. and Malone, B. (2013). Learning optimal Bayesian networks: A shortest path perspective. Artificial Intelligence, 48: SBC ENIAC-2016 Recife - PE 180

BAYESIAN NETWORKS STRUCTURE LEARNING

BAYESIAN NETWORKS STRUCTURE LEARNING Xiannian Fan Uncertainty Reasoning Lab (URL) Department of Computer Science Queens College/City University of New York http://url.cs.qc.cuny.edu 1/52 Overview : Bayesian