Survey of contemporary Bayesian Network Structure Learning methods

Size: px

Start display at page:

Download "Survey of contemporary Bayesian Network Structure Learning methods"

Kerry Gordon
6 years ago
Views:

1 Survey of contemporary Bayesian Network Structure Learning methods Ligon Liu September 2015 Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 1) September / 38

2 Bayesian Network Definition Let V be a set of variables. A Bayesian Network is comprised of a discrete structure part and a continuous parameter part: structural part: a Directed Acyclic Graph (V, E), V being random variables, E V V parameter part: the conditional probability of every variable given its parents in the DAG. Example The Y-shaped Bayesian Network: V={0,1,2,3}, E={(0,2),(1,2),(2,3)} Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 2) September / 38

3 Bayesian Network Example Conditional Probability Table x0,x1 x2 Pr x0 Pr x1 Pr x2 x3 Pr Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 3) September / 38

4 BN Structure Learning Problem Counted indexed relation dataset (V, R, c) Scoring function s(v, R, c, E), abbrev. s(v, E) usually s(v, E) is required to be decomposable: s(v, E) = S(v, Pa(v)) v V Find the DAG (V, E) that maximize s(v, E) Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 4) September / 38

5 Clusters of surveyed articles Conditional Independence(C.I.) constraint-based algorithms [16], [1], [11], [12][20]) Ordering-based search[10], [19], [15], Branch and bound[4][14], Parent Graph shortest path [22], [5, 7, 6] Integer Linear Programming and LP relaxation based approximate algorithms [8] [9], [17, 18], [2, 3] Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 5) September / 38

6 Conditional Independence and its testing Definition Let P be a distribution over variable set V, X, Y, M V, X and Y is said to be Conditional Independent given M if P(X, Y M) = P(X M) P(Y M) Conditional Independence conclusions can be tested or inferred from known conditional independences. Testing(for discrete variables), e.g. χ 2 test G 2 test Monte Carlo permutation test Inferring, e.g. Semi-graphoid rules: (1) Symmetry CI(A, B C) CI(B, A C) (2) Decomposition CI(A, B D C) CI(A, B C) (3) Weak union CI(A, B C D) CI(A, B C D) (4) Contraction Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 6) September / 38

7 Conditional Independence and Bayesian Network DAG Notation Let (V,E) be a Directed Acyclic Graph, v V, the parent set of v is denoted as Pa(v), i.e. Pa(v) = {u (u, v) E} Lemma P(v V {v}) = P(v Pa(v)) Definition Let (V, E) be a DAG, vertexes u, v V is said to be d-separated given M, if after all colliders(including collider sets) in M be replaced by bidirected edges between their parents, all directed paths from u to v or from v to u in (V,E) does not pass through M. On a Bayesian Network DAG, vertexes u, v being d-separated by M indicates u v M in the BN probability distribution. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 7) September / 38

8 Early Conditional Independence-based algorithms C.I. based algorithms are based on the following facts: d-separation on Bayesian Network DAG Conditional Independence Existence of an undirected edge u v can be inferred from at least one of many conditional independences. Testing on more C.I. triples (u, v, M) may increase confidence. C.I. tests are computationally expensive to perform on datasets. Minimize number of tests. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 8) September / 38

9 Early Conditional Independence-based algorithms SGS (the first C.I.-based algorithm to learn BN) PC(PC*, Stable- and Conservative- PC) Grow-Shrink, IAMB, SRS Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 9) September / 38

10 PC Algorithm brief PC is an iterative algorithm to learn Bayesian Network from C.I. tests. With graph edges E as a variable, Start with complete undirected graph Edge elimination: For each two variables u, v, do C.I. tests, startingăwith size 0 (unconditional) Ø, then size 1 condition sets {i}, {j},..., then size 2 condition sets {i, j}, {i, k}, {j, k},..., larger condition sets until conditional independence i j M is found, V {u, v}. Eliminate any edge between two variables that are conditionally independent given any condition set. For any pair of variables, PC algorithm test against conditional sets with variables in any path between the pair. Directing the edge by unshielded collider rule and loop removal rule Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 10) September / 38

11 Edge direction rules Unshielded Collider Rule: If two variables u,w are not directly connected but are connected as u v w, orient v w as v w to avoid forming unshielded collider u v w Loop Removal Rule: If two variables, u and v connected both by an undirected edge and by a directed path, orient the undirected edge as u v Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 11) September / 38

12 Advantages of PC based algorithms 1 Fast speed. On sparse graphs, the computation time of PC is polynomial time. 2 Compared to SGS, C.I. constraint propagation by semi-graphoid rules saved a lot of C.I. testings. In addition, if parallel machine is available, it is possible to do redundant C.I. testings to improve confidence[1], 3 By computing independence of smaller conditional testing M first, the conditional independence test has higher confidence for high dimensional dataset. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 12) September / 38

13 Robustness of PC based algorithms The robustness of C.I. based algorithms is doubted by researchers[11],[1]. Factors that will undermine robustness of PC in high dimensional datasets: Sampling loss when marginal dataset is relatively small w/regarding to graph complexity, local C.I. tests are usually less accurate also called non faithfulness of the C.I. relations to the distribution. C.I. testing order when earlier independence test happens to have lower confidence, they can prevent tests generating higher confidence contradictory C.I. results. Two algorithms, Conservative-PC[11] and Stable-PC[1], are invented to overcome the instability over C.I. testing order. They use redundant CI testing to detect unfaithfulness and a voting mechanism to find the most likely CI. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 13) September / 38

14 Markov blanket Definition Let (V, E) be a DAG. The Markov Blanket of v V, denoted by MB(v), is the set of vertexes not d-separated with v by any variables. i.e., the set of nodes composed of v s parents, children, and children s parents in the DAG. Theorem v is d-separated from V {v} MB(v) by MB(v) Definition Let (V, E) be a DAG. The Moral Graph (V, F) of (V, E) is formed by connecting nodes that have a common child, and then making all edges in the graph undirected. i.e. F = {{u, v} (u, v) E or (v, u) E or w : (u, w), (w, u) E} Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 14) September / 38

15 Markov blanket Corollary Let (V, E) be a DAG, (V, F) be the Moral Graph of (V,E), the Markov Blanket of v V is the neighbors of v in (V, F). DAG Moral Graph Markov Blanket of E Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 15) September / 38

16 Grow-Shrink and IAMB algorithms Definition Let (V, E) be a directed graph. The Markov Blanket of v V is the set of vertexes not d-separated with v by any variables. i.e., the set of nodes composed of v s parents, children, and children s parents. i.e. A Markov Blanket M is a minimum subset of V that satisfies: U V {v} M : v U M Obvious: finding every variable s Markov Blanket is equivalent of finding the DAG s Moral Graph Grow-Shrink algorithm IAMB algorithm greedy ordering of condition sets of Grow-Shrink Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 16) September / 38

17 Multiple Markov Blankets Since Bayesian Network decomposition is usually not unique for a data distribution, one variable may have multiple different Markov Blankets. Like for both M 1 and M 2 : Definition U V {v} M : v U M Let (V, E) be a directed graph. A variable u V is called Strongly Relevant with v V if and only if S V {v, u} : P(v S) P(v S {u}) A variable u V is called Weakly Relevant with v V if and only if S V {v, u} : P(v S) P(v S {u}) and u is not Strongly Relevant with v. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 17) September / 38

18 Selection via Represent Sets Algorithm Definition Let V be the variable set, a representative set of v V consists of a variable u in v s Markov blanket and u s corresponding correlated features. Proposition u is strongly relevant with v, if and only if u belongs to the set of parents and children of variable v in a faithful Bayesian Network. SRS Algorithm Step 1: G v Get PC(v) (PC means Parent&Child) for u in G v : G u {u} Get PC(u) Step 2: Search a group of strongly relevant variables Parent Child sets {G i } {G u u SR(v)}, such that i G i is a best Markov Blanket under the given measure. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 18) September / 38

19 Decomposable Scoring Function Let V be the variables, scoring function s(v, E): P(V V) R s is called decomposable if and only if s(v, E) = v V S(v, Pa(v)), where Pa(v) = {u (u, v) E} Commonly used decomposable scoring functions: Log-Likelihood(AIC,BIC), BD(e,eu) Define BN learning as finding E for V that maximizes s(v, E) Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 19) September / 38

20 Ordering Search Given an order O of variables, if scoring function s is decomposable, the best DAG satisfying O can be found in polynomial time to the number of variables, simply by finding best parents among smaller-order variables from sink to source.[15] Modern ordering search algorithms use propagation of constraints inferred from scoring function s properties and background knowledge to reduce search space. Algorithm: branch n bound search, A heuristic search Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 20) September / 38

21 Dynamic Programming of Parent Sets Lemma [13]Let v V, Q V, v Q. max P Q S(v, P) = max(s(v, Q), max u Q max P Q {u} S(v, P)) Which enables DP for propagation of argmax P Q S(v, P) for all subsets Q of V {v}. This is the step all dynamic programming algorithms use to get optimal parent sets of every variable. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 21) September / 38

22 The 2006 ordering search algorithm[15] Variable Set: V = {1,..., N}. Variable i s parent candidate set: Pa(v) V {v} 1. Calculate the local scores for all n 2 n 1 different (v, Pa(v))-pairs. [s(v, Pa(v)) v V, Pa(v) V {v}] 3. Find the best sink from all 2 n variable sets. [sink(w) = argmax s W skore(w, s) W V] 2. Find optimal smaller-by-1 parent set Pα(v, G) Pa(v) for all G V {v} [Pα(v, G) v = 1,..., N, G V {v}] Pα(v, G) = Pa(v) argmax v G s(v, G {v}) 4. Using the best sink, find a best ordering of the variables. O i = sink(v N j=i+1 {O j}) 5. Compute the best network using above best parents, best sink, best orde Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 22) September / 38

23 Ordering by Sink Score Lemma [15]Let W V, k is the last variable(called sink) in the optimal order of W if and only if k = argmax k W (max P W {k} S(k, P) + S(W {k})) Which enables using DP for computation of optimal sink. max P W {k} S(k, P) + S(W {k}) is called sink score. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 23) September / 38

24 Example of Optimal Parents and Optimal Sink Add graphic example Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 24) September / 38

25 Optimization techniques AD-Tree [19] if U U, and s(v, U ) s(v, U), remove U from candidates [19] Partition parent sets by size reduce space to 2 n ( 3 4 )p n O(1), p is degree of parents [14] Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 25) September / 38

26 Structural constraints Optimal DAGs under common scoring functions (MDL, BDeu) have common structural constraints[4] that can be used to prune. Hard limits of incoming degrees Corollary Using BIC or AIC as criterion, the optimal graph (V, E) has at most log 2 N parents per node. Optimal parent set score has upper bounds various heuristics Optimal parent set has upper bounds Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 26) September / 38

27 Upper bound of optimal parent set[4] Theorem Let N be total count of (V, R, c), L U = u U L u. With BIC or AIC as score function, if LPa(v) > N w log( L v ) L v, any proper superset of Pa(v) is not 1 the parent set of vertex v in an optimal structure. Theorem Given a BD score and two parent sets Pa (v) and Pa(v) for a node v such that Pa (v) Pa(v), let K vj = LPa(v) p j, if S(v, Pa (v)) > K vj K vj 2 j=1 then Pa(v) is not an optimal parent set of v. K vj K vj =1 f(k vj, α vjk k ) + j=1 log α vjk α vj Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 27) September / 38

28 Upper bound of optimal parent set score[22, 21] Theorem Given a BD score S and two parent sets Pa (v) and Pa(v) for a node v such that Pa (v) Pa(v), let K vj = LPa(v) p j, if S(v, Pa (v)) > K vj K vj 2 j=1 then Pa(v) is not an optimal parent set of v. K vj K vj =1 f(k vj, α vjk k ) + j=1 log α vjk α vj Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 28) September / 38

29 Order Graph Definition Let V = 1,..., n be the indexset of variables, the order graph (V, E) is defined by a graph with vertex set V being V s powerset, edge set E= {(X, Y) X, Y P(V), X Y, X + 1 = Y }. Obviously, any order graph is DAG. Example Order graph of V = 1, 2, 3, 4 Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 29) September / 38

30 Shortest Path formation of Optimal Parent Set Problem Let S(v, Pa(v)) be the scoring function item for v V and its parents. Finding optimal BN is equivalent to finding shortest path on Order Graph (V, E) from Ø to V, if we define length of edge (X, Y) to be: Advantages: d(x, Y) = min Pa X S(Y X, Pa) Shortest Path on directed graph G has well studied algorithms (Dijkstra, BFBnB, A etc) Generally does not require pre-generation of all graph data, vertexes and edges can be computed dynamically. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 30) September / 38

31 Shortest Path Example Add an example of shortest path on order graph <==> optimal parent set Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 31) September / 38

32 A best-first search algorithm A heuristics-enhanced variation of Dijkstra algorithm, use priority function to decide the next step of search finding Shortest Path from vertex x to y on (V, E), with the length of each edge d(u, v) (u, v) E computable in a fixed time cost. The priority function on vertex v V: f(v) = d(x, v) + h(v, y) d(x, v): already computed distance from x to v h(v, y) is the heuristically estimated distance from v to y Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 32) September / 38

33 One A* heuristic function for d(x, Y) = min Pa X S(Y X, Pa) Definition Let (V, E) be an order graph of vertex set V, U V, heuristic function used in [22], denoted by h(u), is defined by h(u) = min Pa V {v} S({v}, Pa) v V U Remark: h(u) is acquired by using the best parent set for each vertex in V U, regardless if the graph is DAG. Theorem h(u) is monotonic. [22] Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 33) September / 38

34 Integer valued Multisets(imsets) Definition Let V be a set of integers, A Integer Valued Multiset (imset) is a mapping from P(V) to the set of integers Z. Example Let a,b,c be integers, an example imset u with V={a,b,c}: u = δ {b} δ {a,b} δ {b,c} + δ {a,b,c} δ : Kronecker delta imset defined on the following page. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 34) September / 38

35 Arithmetic notation of imsets Definition Let V be a set of integers and U V. The U Kronecker delta imset, denoted by δ U, is defined by 1 X = U δ U (X) = 0 X U Definition Let V be a set of integers, a and b are imsets: P(V) Z The same for minus. (a + b)(x) = a(x) + b(x) Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 35) September / 38

36 DAG to {0,1} to [0,1] Family Variable Vector {φ vu = 1 if Pa(v) = U, 0 otherwise} Standard Imset u (V,E) = δ V δ Ø + (δ Pa(v) δ {v} Pa(v) ) Characteristic Imset v V W V c (V,E) (U) = 1 u (V,E) (W) U W Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 36) September / 38

37 Linear Program of Family Variable Vector Family Variable Vector {φ vu = 1 if Pa(v) = U, 0 otherwise} Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 37) September / 38

38 References Diego Colombo and Marloes H Maathuis. Order-independent constraint-based causal structure learning. The Journal of Machine Learning Research, 15(1): , James Cussens. Integer programming for bayesian network structure learning James Cussens, David Haws, and Milan Studeny. Polyhedral aspects of score equivalence in bayesian network structure learning. arxiv preprint arxiv: , Cassio P De Campos and Qiang Ji. Efficient structure learning of bayesian networks using constraints. The Journal of Machine Learning Research, 12: , Xiannian Fan, Brandon Malone, and Changhe Yuan. Finding optimal bayesian network structures with constraints learned from data. Ligon Liu (CUNY) Survey on Bayesian Network Structure Learning (slide 38) September / 38

SCORE EQUIVALENCE & POLYHEDRAL APPROACHES TO LEARNING BAYESIAN NETWORKS

SCORE EQUIVALENCE & POLYHEDRAL APPROACHES TO LEARNING BAYESIAN NETWORKS David Haws*, James Cussens, Milan Studeny IBM Watson dchaws@gmail.com University of York, Deramore Lane, York, YO10 5GE, UK The Institute