Bayesian Belief Networks for Data Mining. Harald Steck and Volker Tresp. Siemens AG, Corporate Technology. Information and Communications

Size: px

Start display at page:

Download "Bayesian Belief Networks for Data Mining. Harald Steck and Volker Tresp. Siemens AG, Corporate Technology. Information and Communications"

Noah Carter
6 years ago
Views:

1 Bayesian Belief Networks for Data Mining Harald Stek and Volker Tresp Siemens AG, Corporate Tehnology Information and Communiations Munih, Germany fharald.stek, Abstrat In this paper we present a novel onstraint based strutural learning algorithm for ausal networks. A set of onditional independene and dependene statements (CIDS) is derived from the data whih desribes the relationships among the variables. Although we impliitly assume that there exists a perfet map for the true, yet unknown, distribution, there does not need to be a perfet map for the CIDSs derived from the limited data. The reason is that the distribution of limited data might dier from the true probability distribution due to sampling noise. We derive a neessary ondition for the existene of a perfet map given a set of CIDSs and utilize it to hek for inonsistenies. If an inonsisteny is deteted, the algorithm nds all Bayesian networks with a minimum number of edges suh that a maximum number of CIDSs is represented in eah of the multiple solutions. The advantages of our approah are illustrated using the alarm network data set. 1 INTRODUCTION A Bayesian belief network represents onditional independenes in the underlying probability distribution of the data in the form of a direted ayli graph (DAG). The DAG: A! B C states, for example, that A and C are independent if B is not known, but A and C beome dependent as soon as the state of B is xed. If all onditional independenes in the probability distribution are represented in the DAG and vie versa, then the Bayesian network is said to be a perfet map of the probability distribution, and is alled a also with Tehnial University of Munih, Department of Computer Siene, Munih, Germany ausal network [Pearl 1988]. Typially a Bayesian network is onstruted from expert knowledge although reently there has been an inreasing interest in learning the struture of Bayesian networks from data for data mining appliations. The basi idea is to display probabilisti dependenes and independenes derived from the data in a onise way by a Bayesian network whih has the potential of providing muh more information about a domain than visualizations solely based on orrelations and distane measures. Bayesian networks an be used to display ausal models whih an be understood by humans more intuitively than models with undireted edges like Markov networks. Model unertainty is a serious problem in strutural learning of Bayesian network models and omes into play if there does not exist a perfet map of the probability distribution of the data. If a limited data set is given, its probability distribution might dier from the true one due to sampling noise. For strutural learning we assume that the true probability distribution is suh that there exists a perfet map. It is important to present to the user truefully the strutural unertainties sine otherwise he or she might draw inorret onlusions from the presented struture. In a Bayesian approah, strutural unertainty an be represented to the user by presenting multiple solutions whih have obtained a high sore. The disadvantage is that one an never be sure that one has found all solutions with a high sore. Furthermore, it is very diult for a user to study multiple solutions visually and to draw meaningful onlusions. In addition, the omputationally ost of nding the K-best solutions is even higher than for nding only the network with the best sore. Therefore we have foused on an approximate way of strutural learning without losing model unertainty in the result. In the next setion, we give a short summary of strutural learning. Then we present a proposition whih serves as a neessary ondition for the existene of a perfet map given a set of onditional independene

2 and dependene statements (CIDS). This proposition will be used to hek for inonsistenies among the estimated CIDSs whih then might lead to multiple solutions. Then we present the algorithm utilizing this proposition when heking the CIDSs for onsisteny and omment on its omplexity. We lose with results of the omputer experiments. 2 STRUCTURAL LEARNING There are two main approahes to strutural learning of Bayesian belief networks. In the Bayesian approah [Cooper and Herskovits 1992, Hekerman et al. 1994, Hekerman 1995] a global ost funtion { the posterior probability of the (entire) network { is maximized. This approah requires an involved searh for the best struture and is therefore omputationally very expensive and not diretly appliable to data mining appliations. In this paper we pursue a onstraint based approah similar to the ones in [Wermuth and Lauritzen 1983, Fung and Crawford 1990, Spirtes et al. 1993, Suzuki 1996, Cheng et al. 1997]. The basi idea of those algorithms is to derive a set of CIDSs from the data without taking into aount the Bayesian network struture. The Bayesian network is then onstruted from the CIDSs in a later step. This is done in standard algorithms by removing an edge in the Bayesian network whenever the orresponding pair of variables is found onditional independent. There an, however, our inonsistenies among the CIDSs derived from limited data when reonstruting the entire network (up to equivalene), i.e. not all CIDSs an be represented in a perfet map simultaneously. These inonsistenies reet model unertainty, i.e. there might exist more than one Bayesian network model desribing the probability distribution of the data well (aording to some measure). This indiates how reliable one partiular struture learned from the database is. These multiple solutions have usually many edges in ommon and dier in the presene of a few edges only whih we all inonsistent edges, sine they are related to inonsistenies among the CIDSs. The set of inonsistent edges an further be divided into subsets suh that the presene or absene of edges belonging to dierent subsets is independent of eah other in the multiple solutions. Suh a subset we all an ambiguous region. Hene, we visualize the set of solutions in a single graph (f. Figure 3), in whih the edges ommon to all networks are skethed by solid lines, whereas the edges belonging to the same ambiguous region are depited by dashed lines of the same style. The possible strutures in eah of the ambiguous regions annot be depited in suh a graph. They are hene stated in the gure aption. The diretions of the edges are speied in a later stage by this kind of onstraint based approah. These possibly multiple solutions for the network strutures do not belong to the same equivalene lass, sine they dier in the presene or absene of ertain edges. So the multiple solutions have to be distinguished from networks belonging to the same equivalene lass. The advantage of our onstraint based approah is that it an systematially onstrut the set of all Bayesian networks. Furthermore, it an take advantage of the fat that the multiple solutions typially have many edges in ommon and dier only in a few inonsistent edges whih an be partitioned into independent ambiguous regions. Regarding the derivation of the CIDSs from the data set, the omputation time of this approah relative to the global Bayesian approah is appreiably shorter for sparse network strutures, i.e. when the probability distribution exhibits many onditional independenes. For dense networks, this may not hold, but Bayesian network models might not be the most suitable model in suh a situation, anyway. The omputation time an additionally be redued by arrying it out in parallel. This an be realized in a simple master and slave sheme. 3 NECESSARY PATH CONDITION The algorithm we depit in this paper makes use of the following proposition as a neessary ondition for the existene of a perfet map given a set of CIDSs derived from the data. Essentially, the proposition requires that if two variables 1 a, b only beome independent by onditioning on an additional variable, then there must be a path onneting them via in the graph. Otherwise the dependene without onditioning on would be unexplained. If there were no neessary ondition at all, then one would arrive at the SGS or PC algorithm [Spirtes et al. 1993]. Proposition: Let V be the set of variables and P a probability distribution with the perfet map G: If two variables a; b 2 V are onditionally independent in P given a set of variables S V nfa; bg and they are dependent given any (proper) subset S 0 S, then there is no edge between a and b in G and there exist (undireted) paths between eah s 2 S and a not rossing b as well as between eah s 2 S and b not rossing a. 1 For brevity, we use nodes and variables synonymously and denote both with small letters, sine there is a one-toone orrespondene.

3 (B) learned networks (equivalene lasses) (C): SGS and PC algorithm (I) (II) (A) a b original network D(a,b),D(a,b {}), D(a,), I(a, {b}), D(b,),D(b, {a}) D(a,b),D(a,b {}), D(a,), I(a, {b}), D(b,), I(b, {a}) D(a,b),D(a,b {}), I(a,), I(a, {b}), I(b,), I(b, {a}) (I): (II): (III): a a a b b b a or a b b (III) 10(D): proposed 100 algorithm (I) sample size (II) (III) sample size Figure 1: From the original network (A) we randomly generate data sets of various sizes (sample size between 10 and 10000). The distribution of the variables is given by b = " b, a = m ab b + " a and = m b b + " with the parameters m ab = 3, m b = 0:3 and with Gaussian distributed noise " i for i 2 fa; b; g of unit variane and zero mean. For omparison with the SGS and PC algorithms we used the test on vanishing partial orrelations with a signiane level of 0:01 in this example in order to derive the onditional independene and dependene statements (CIDSs) from the data sets. For details see setion (3). Proof: This proposition follows immediately from the denitions of the perfet map and of the d-separation (see for instane [Pearl 1988]). Given two variables a and b and a set S V nfa; bg, assume that there is a variable v 2 S suh that there is no (undireted) path between a and v not rossing b in the perfet map G. Then, if a and b are d-separated given S, they are also d-separated given Snfvg. Hene, onsidering the probability distribution P, if a and b are independent onditional on S, then they are also independent onditional on Snfvg. QED. Neessary Path Condition: In general one annot expet that an arbitrary set of CIDSs an be represented in a perfet map, sine a perfet map implies ertain relations among the CIDSs to hold. Therefore it is desirable to have neessary and suient onditions at hand for eiently heking the CIDSs for onsisteny, i.e. if all the CIDSs an simultaneously be displayed in one perfet map. Sine we like to retain the property of the learning algorithm to sequentially learn the skeleton of the Bayesian network, i.e. rst the presene of its edges, then the orientations of the edges and nally the parameters of the model, a suient ondition for the existene of a perfet map when onstruting the skeleton annot be available. The above proposition provides, however, a neessary ondition for the existene of a perfet map whih means that it an be used to hek for inonsistenies among the CIDS in the sense that if an inonsisteny is found there annot exist a perfet map representing all the CIDSs. In networks involving a large number of variables, there might be several estimated onditional independene statements (CIS) for eah pair of variables a, b. Therefore we propose the following Neessary Path Condition to be used to hek for inonsistenies in the algorithm: For eah absent edge [a; b] in the network there has to be represented at least one { not all { CIS I(a; bjs) { with the orresponding D(a; bjs 0 ) for all S 0 S { in the perfet map aording to the above proposition. This is less strit than the proposition itself, but makes the algorithm more robust, sine an inonsisteny is only deteted if there is an edges for whih no CIDSs an be found onsistent. Example (f. Figure 1): In Figure 1 it is shown in a toy model of three variables a, b and that it is essential to hek the set of CIDSs derived from limited data on onsisteny with the Bayesian network model, sine eah onditional independene statement (CIS) or onditional dependene statement (CDS) is derived from the data without taking into aount the model at all. In (B) we ompare the resulting networks onstruted from (possibly inonsistent) sets of CIDSs by two dierent algorithms: First, the SGS and PC algorithms (dashed arrows) whih assumes an edge absent in the model whenever a onditional independene is derived from the data without heking for onsisteny. Seond, the proposed algorithm (solid arrows) heking for onsisteny. In the rst senario (f. network (I) in (B)), assume that the only onditional independene statement de-

4 rived from the data is I(a; jfbg) (together with the (onditional) dependene statements D(a; b), D(a; ), D(b; ), D(a; bjfg), D(b; jfag)). Aording to the proposition this requires that there must exist a path between a and b as well as between b and, whereas the edge [a; ] 2 must be absent. Apparently, there exists a perfet map, namely omprising the edges [a; b] and [b; ], i.e. the set of CIDSs is onsistent. In the seond senario (f. networks (II) in (B)), assume the two CISs I(a; jfbg) and I(b; jfag) { and the CDSs D(a; b), D(a; ), D(b; ), D(a; bjfg) { are found from the data. The edge [a; b] is required by these CIDSs, but the CIS I(a; jfbg) together with D(a; ) additionally requires the presene of the edge [b; ] and the absene of the edge [a; ], whereas the other CIS requires just the ontrary. Hene, there annot exist a perfet map fullling all the CIDSs simultaneously. For example, the set of CIDSs orresponding to a perfet map ontaining only the edge [a; b] (as found by the SGS or PC algorithm) omprises I(a; ) and I(b; ) whih violates both D(a; ) and D(b; ). If one assumes now that the inonsistenies among the CIDSs are solely due to sampling noise in the limited data set, it makes sense to searh for all possible Bayesian networks with a minimum number of edges eah of whih onstruted from a maximal onsistent subset of the CIDSs derived from the data. In this example, it is apparent that exatly one of the two CISs annot be represented in a perfet map. Consequently, one nds two possible network strutures: the edge [a; b] is ommon to both, and they ontain edge [a; ] or [b; ], respetively. Neither of the two networks represents all the CISs whih were derived from the data. Note that the orret ausal network, i.e. the perfet map of the { in general unknown { true set of CIDSs is among the multiple solutions in this example. Finally, the set of CIDSs omprising the unonditional CISs I(a; ) and I(b; ) are onsistent and lead to a network with only the edge [a; b] being present (f. network (III) in (B)), i.e. the original network struture ould not be reovered from these CIDSs whih tend to be derived from small data sets. In (C) and (D) the resulting network strutures depending on the sample size is shown. While our approah yields multiple solutions (f. (II) in (D)) for "medium sized" data sets (sample size between a. 70 and a. 700), the SGS or PC algorithms nd the network omprising the single edge [a; b] (f. (C)). 2 In this notation for undireted edges, the order of the variables does not matter, i.e. [a; b] = [b; a]. 4 ALGORITHM Before we go into details of the proposed algorithm, we give a short overview of its main steps. After the set of onditional independene and dependene statements (CIDS) has been derived from the data the proposed algorithm applies the Neessary Path Condition presented above in order to hek, if a perfet map an exist given the set of CIDSs. If inonsistenies are deteted, the algorithm searhes for all possible Bayesian networks as desribed in detail in setion 4.3. For eah of the multiple solutions, the diretions of the edges are xed in the last step. All these networks ontain the same number of edges. Among those are all the edge resulting from the the SGS algorithm 3. Eah of the minimal solutions found by the algorithm represents an equivalene lass, and is a andidate for being a perfet map of the (unknown) true probability distribution of the data set. 4.1 DERIVING THE CIDSs The algorithm builds up the set of onditional independene and dependene statements from a given data set in the rst step. This an eiently be done by relying on asymptoti results whih appear reasonably aurate in our experiments, for example by alulating the Bayesian Information Criterion or a test statisti for eah pair of variables given a subset of the remaining variables. Considering independene tests it is apparent that CDSs derived from the data are more reliable than the CISs. This is beause the null hypothesis of independene an only by falsied but not veried by a test. This indiates that network strutures learned by a onstraint based approah tend to remove too many edges, in partiular given small data sets. As will be shown below, this tendeny of removing too many edges an largely be redued by heking the CIDSs for onsisteny. For onveniene we do not denote the onditional dependene statements, i.e. if a statement is not in the set of CISs (and annot be inferred from it by ombining some of the CISs aording to the Faithfulness and Markov Conditions as desribed 3 In this paper, we fous on how to onstrut the skeleton of a Bayesian network from the CIDSs rather than how to derive the CIDSs themselves. The CIDS derived by the SGS and PC algorithms an dier, sine the latter failitates heuristis. Here, we ompare our algorithm with the SGS algorithm, beause the experiments arried out here are based on the CIDSs derived by the SGS algorithm. Of ourse, our algorithm an also be applied to the CIDSs derived by the PC algorithm, and the same statements apply as for the SGS algorithm.

5 Figure 2: The alarm network ontains 37 variables and 46 edges. The numbering of the variables is hosen as in [Cheng et al. 1997]. in [Spirtes et al. 1993]), it is understood that this implies a (onditional) dependene. 4.2 RULES For eah given CIS of the form I(a; bjs) with D(a; bjs 0 ) 8S 0 S the proposed Neessary Path Condition requires the absene of the edge [a; b] and the presene of the paths between a and eah variable s 2 S as well as between b and eah variable s 2 S. We represent this onstraint on the absene or presene of ertain edges by a rule. Suh a rule is of the form X Y, where X denotes an edge and Y is a (possibly empty) set of edges. Aording to the Neessary Path Condition, the above CIS with the CDSs is translated into a rule as follows: X = [a; b] and Y S = f[a; s]; [b; s]g. It an be interpreted in the s2s way that edge X an only be absent in the Bayesian network, if the edges in the set Y are present. Sine the above proposition requires ertain paths rather than ertain edges to be present, this is absorbed in the fat that new rules an be generated by substituting rules into eah other as follows: Given two rules X Y and W Z then a new rule an be generated for edge X, if the edges W 2 Y and X =2 Z, namely X (Z [ YnfW g). Therefore an edge an be absent, if there is at least one rule fullled (f. Neessary Path Condition). One the set of rules is derived from the set of CIDSs, the assoiated perfet map an be onstruted. If there is a Bayesian network suh that for all edges a rule is fullled, then there might exist a perfet map assoiated with the estimated probability distribution. If there are some edges for whih none of the given rules an be fullled by a Bayesian network, we all the set of rules and the set of CIDSs inonsistent. In this ase, there does not exist a perfet map assoiated with the estimated probability distribution. 4.3 MULTIPLE SOLUTIONS Our algorithm nds multiple solutions, when there does not exist a perfet map of the estimated CIDSs aording to the Neessary Path Condition. If inonsistenies among the CIDS are found by the algorithm, it makes sense to onsider these inonsistenies to be present due to sampling noise in the limited data set and to retain the assumption that there exists a perfet map assoiated with the (unknown) set of (true) CIDSs. Therefore the algorithm searhes for all possible minimum subsets of the set of CIDSs whih are onsistent in the above sense, i.e. the algorithm searhes for all possible network strutures with a minimum number of edges suh that for a maximum number of edges a rule is fullled. Eah of these possible networks is a andidate for being the perfet map of the (unknown) set of true CIDSs. It turns out that the edges of the resulting networks an be divided into three main groups: If there is no rule X Y for an edge X, then no estimated CIS was found, and hene this edge is present in the network. If there an be rendered a rule X Y for edge X suh that Y is empty or ontains only edges whih are present in the perfet map, then we all it a onsistent edge whih is absent. An edge for whih no suh rule an be generated indiates that there does not exist a perfet map given the set of CIDSs. Suh edges we all inonsistent and they might be present or absent in some of the multiple solutions. The multiple solutions dier only in the presenes of the the latter edges. As we will see from the experiments (in setion 5), the inonsistent edges an usually be further subdivided: They an be partitioned into sets of edges whose presenes depend on eah other, whereas edges belonging

6 Figure 3: This graph skethes the multiple solutions learned from a data set of size with the signiane level 0:01 before the diretions of the edges are added. Solid lines denote edges whih are present in all the possible network strutures. The multiple solutions dier in the edges belonging to the 4 ambiguous regions whih are depited in dierent line styles. The possible strutures for eah region are as follows: In region (A) either edge [11; 12] or edge [12; 32] is present. In region (B) either [14; 27] or [27; 33] is present. In region (C) either [15; 22] or [22; 35] is present. In region (D) either the single edge [18; 26] or the two edges [6; 18] and [3; 26] are present; here, the only minimum struture is the one omprising the single edge [18; 26]. Hene, there are two minimum strutures in eah of the regions (A), (B) and (C), and one minimum struture in region (D). Therefore, the overall number of multiple solutions is edges have orretly been identied whih are present in all the multiple solutions. Additional 4 edges have been found due to the Neessary Path Condition implemented in our algorithm. The networks among the multiple solutions whih are losest to the original one ontain 45 (out of 46) orret edges, and only one edge is missing, namely either [15; 22] or [22; 35]. The resulting network of the SGS algorithm 3 ontains the same 41 edges, whih are ommon to all multiple solutions of our algorithm, so that 5 edges are missing. to dierent sets do not depend on eah other. Eah of suh a set we all an ambiguous region. There might be several of suh regions. Tehnially speaking, the algorithm nds all the edges belonging to the same ambiguous region in the following way: Only the rules for inonsistent edges are onsidered. If for two inonsistent edges X and W there an be rendered rules X Y and W Z suh that X 2 Z and W 2 Y, then they are grouped into the same ambiguous region, beause it might not be possible to fulll both of those rules simultaneously in a Bayesian network. This an also be seen as searhing for yles in a direted ayli graph (DAG) whih is generated from the rules: Eah node of that graph represents an inonsistent edge of the Bayesian network, and for eah rule X Y edges are present pointing from eah node Y 2 Y to node X in that DAG. Our algorithm takes advantage of the fat that the multiple solutions dier only in the ambiguous regions and that they are independent of eah other. Sine eah of suh a region usually ontains only a few edges, searhing for all possible strutures suh that the number of onsistent CISs is maximum an be done very ef- iently: Simply by arrying out an exhaustive searh in eah ambiguous region separately. The number of dierent network strutures is then given by the produt of the number of dierent strutures in eah ambiguous region. 4.4 FINDING DIRECTIONS OF EDGES Constraint based algorithms of this kind have the property that they an be split up into several steps. First, the algorithm nds the (undireted) edges whih are present in the Bayesian network. We have foused on that part in this paper. In the seond step, diretions are added to those edges whih an be derived from the data so that the equivalene lasses are identied. This an be done like in [Spirtes et al. 1993], for example. The fat that a data set is of limited size might, however, give rise to additional inonsistenies among the estimated CIDSs regarding the diretions of the edges. This inreases additionally the number of multiple solutions whih might dier in the diretions of their edges, although they have the same edges in ommon. We do not present any details on that issue in this paper.

7 4.5 COMPLEXITY Calulating the Bayesian information riterion or a test statisti for eah pair of variables given every subset of the remaining variables is intratable for large numbers of variables. Sine not all of those omputations are usually neessary for onstruting the Bayesian network from data, the omplexity of the problem an be redued by applying heuristis (see for instane [Spirtes et al. 1993]). Carrying out the omputations for all pairs of variables in asending size of the onditioning set and up to a ertain maximum order only, redues the omplexity greatly, i.e. it beomes polynomial in jv j. For CISs of high orders whih are not omputed from the data set it is assumed that exatly those are true whih an be inferred from the CISs of lower orders aording to the Faithfulness and Markov Conditions as desribed in [Spirtes et al. 1993]. Calulations relying on asymptoti results might yield more unreliable results for higher orders, anyway. Conditioning only on neighbors of the pair of variables (in the undireted graph) is an additional heuristi to speed up the derivation of the CIDSs. These heuristis require the algorithm of keeping trak of the network struture when deriving the CIDSs. After nishing all alulations of a ertain order of the onditioning set, a new intermediate version of the network an be built up based on the results available so far. Then, this version an be used to deide, if arrying out omputations on higher orders is neessary, and if so, what the neighbors of eah node are. Finding all possible strutures of the network from the set of rules is, in prinipal, intratable for a large number of variables, too. As we found in our experiments, however, there are edges whih are present in all the multiple solutions and only a few edges whih belong to ambiguous regions. Sine the strutures in dierent ambiguous regions are independent of eah other, all the possible strutures an be found for eah ambiguous region separately, whih speeds up alulation very muh. Hene, the problem is exponential in the number of rules involved in the largest ambiguous region of the network. In the experiments we found that the size of the ambiguous regions is muh smaller than the total number of edges. Therefore, nding all the possibilities in eah ambiguous region beomes tratable. For example, in our alarm network experiments, it took less than a minute on a Sun UltraSPARC-II to generate all the possible strutures given the set of CIDSs. It turned out in our experiments that almost all the alulation time is onsumed for deriving the CIDSs from the data and only a small fration is needed for onstruting all the multiple solutions. 5 EXPERIMENTS The alarm network [Beinlih et al. 1989] has evolved as kind of a benhmark for strutural learning of Bayesian belief networks. We used the alarm network from Norsis Corp. [Netia Alarm Network] to generate randomly data sets of various size. The variables are disrete and their number of states ranges from two to four. The size of the sample data was varied between and 1000, whih is muh smaller than the number of ongurations in the joint state spae. orret edges # edges # sample size (x1000) (b) 5 (a) sample size (x1000) Figure 4: (a) The number of orret edges learned from data depends on the sample size. The networks among the multiple solutions whih have the most edges in ommon with the original network ontain almost all the orret edges even for small sample sizes (dots). The number of those edges is signiantly loser to the orret number of 46 edges than is the number of edges whih are ommon to all the learned multiple solutions (triangles). The latter edges are idential with the edges found by the SGS algorithm 3. (b) The dierene of the two urves in (a) is the number of edges whih are simultaneously present in the ambiguous regions of any one of the multiple solutions (dots). The number of edges missing in all the multiple solutions rises for smaller sample sizes (squares), but stays at smaller values than the number of edges being present in the ambiguous regions due to the proposed Neessary Path Condition. In ontrast, the sum of those edges (dots + squares) is missing in the network resulting from the SGS algorithm 3. The number of erroneously added edges stays at small values (rosses) due to a small signiane level of 0:01. Eah point represents the mean of 5 experiments of the alarm network.

8 multiple solutions # ambiguous regions # sample size (x1000) sample size (x1000) of view, the estimated probability distribution of a larger data set is expeted to be more similar to the (unknown) true distribution so that the struture of the ausal network an uniquely be determined. As the size of the data set shrinks, a unique ausal network annot be derived any more and the number of possible networks of the data inreases. Hene, the number of multiple solutions found by the algorithm indiates in a way, if the size of the data set is suiently large for learning the network struture with ertainty. As the size of the data set dereases, not only inreases the number of multiple solutions, but so does also the number of ambiguous regions, whereas the number of edges involved in eah ambiguous region inreases only slowly. Figure 5: As the size of the data set dereases, the number of multiple solutions inreases and so does the number of ambiguous regions (f. inset). Eah point represents the mean of 5 experiments of the alarm network with a signiane level of 0:01. In order to derive the onditional independenes, we used the likelihood ratio test with an asymptoti 2 distribution as desribed in [Spirtes et al. 1993] for disrete variables. A typial result of strutural learning from a data set of limited size is skethed in Figure 3. For this data set of size the algorithm deteted that there does not exist a perfet map suh that all the CIDSs derived from the data an be fullled aording to the Neessary Path Condition. The multiple solutions dier in the presene of the edges belonging to 4 ambiguous regions, for eah of whih the possible strutures are depited in the aption. The networks among the multiple solutions of our algorithm whih are losest to the original network ontain 45 orret edges (out of 46) whereas the SGS algorithm 3 yields a network with only 41 orret edges. In partiular for fairly small data sets the dierenes in the solutions of those two algorithms inrease. In our approah, the number of missing edges grows muh more slowly with the sample size dereasing than they do in the resulting network of the SGS algorithm, beause the number of edges present in the ambiguous regions inreases (f. Figure 4). The number of edges erroneously present stays partiularly small due to the fat that our algorithm allows to use a small signifiane level when applying onditional independene tests. The number of multiple solutions depends on the size of the data set (f. Figure 5). From a frequentist point 6 CONCLUSIONS Strutural learning of ausal networks based on this kind of onstraint based approah an be split up into several steps whih are arried out sequentially. First, it is learned whih (undireted) edges are present in the network, then their diretions are xed and eventually the values of the parameters are adapted to the data set. In this paper we fous on the rst step, deiding whih edges are present and absent in the situation that only a limited amount of data is available. The proposition presented here serves as a neessary ondition for the existene of a perfet map given a set of onditional independene and dependene statements (CIDS). It essentially states that if an edge is absent in the perfet map, ertain other paths are required to be present. The proposed algorithm heks the set of CIDSs on onsisteny with the Bayesian network model aording to the Neessary Path Condition. If inonsistenies are found, it is assumed that they are solely due to sampling noise in the limited data set and that there nevertheless exists a perfet map of the true, yet unknown, probability distribution. Therefore, the algorithm searhes for all network strutures whih ontain a minimum number of edges and represent a maximum number of onsistent CIDSs. This results in multiple solutions. It turned out in our experiments that all the multiple solutions have many edges in ommon. There are also some edges (whih we all inonsistent edges) in whih the strutures of the multiple solutions dier from eah other. Usually, they an be grouped together in what we all an ambiguous region. In eah of whih the possibly multiple strutures an be found eiently, sine the ambiguous regions are independent of eah other and usually involve only a small number of edges. The

9 overall number of multiple solutions is given by the produt of the number of dierent minimum strutures in eah of the ambiguous regions. We found in the experiments that the number of multiple solutions inreases when the size of the data set dereases. Furthermore, also the number of ambiguous regions rises with a dereasing number of data sets. The multiple solutions of our algorithm ontain all the edges whih are present in the network found by the SGS algorithm 3 [Spirtes et al. 1993] as well as some additional edges, sine an estimated CIS does not neessarily lead to the absene of the orresponding edge in the Bayesian network. When the size of the data set dereases, the overall number of orret edges in the resulting networks of our algorithm drops muh more gradually than it does in the network found by the SGS algorithm. Depending on the properties (e.g. size) of the data set, we found in our experiments that the number of edges present in the networks learned by our approah an be larger than in the result of the SGS or PC algorithms by 0? 30%. In Bayesian approahes like [Cooper and Herskovits 1992, Hekerman et al. 1994, Hekerman 1995], a ost funtion for the entire network is evaluated. This an therefore be alled a global approah. Constraint based algorithms like [Spirtes et al. 1993, Suzuki 1996, Cheng et al. 1997] whih remove all edges for whih a onditional independene statement an be derived from the data do not take into aount the struture of the Bayesian network at all and an therefore be onsidered loal. The algorithm presented here is also onstraint based, but heks the set of CIDSs for onsisteny by requiring the presene of ertain paths for eah edge being absent in the network. Therefore, our approah is not ompletely loal, but takes into aount the neighborhood of eah edge. The size of suh a neighborhood an vary as it depends on the number and lengths of the required paths, for example. From an appliation point of view we might state that a suient number of data for arriving at a unique solution is rarely ever available. If the number of data is very small, one annot really expet too muh to start with. In the intermediate region however, our new approah nds out that this is struture whih ould not be uniquely identied and provides a lear statement about its unertainty. Without displaying this unertain struture muh is lost in the interpretation of the data. In onlusion, we believe that Bayesian networks will play an inreasing role in data mining appliations where they are apable of displaying eiently the important dependenes in a domain. The onstraint based approah is superior the global Bayesian approah in terms of learning speed. We have extended the onstraint based approah to truefully display strutural ambiguities whih we feel is an important step towards gaining the aeptane of the user. Aknowledgements We would like to thank Reimar Hofmann for fruitful disussions. This work has been supported by an Ernst-von-Siemens grant and is part of the PRONEL projet (PRObabilisti NEtworks and Learning) funded by the EU (Esprit Projet number 28932, partners: Siemens, HUGIN, Aalborg University, TIGA). Referenes [Pearl 1988] J. Pearl. Probabilisti Reasoning in Intelligent Systems, Morgan and Kaufman, San Mateo, [Wermuth and Lauritzen 1983] N. Wermuth and S. Lauritzen. Graphial and reursive models for ontingeny tables, Biometrika 72, pp , [Cooper and Herskovits 1992] G. F. Cooper and E. Herskovits. A Bayesian Method for the Indution of Probabilisti Network from Data, Mahine Learning, Vol. 9, pp , July [Hekerman 1995] D. Hekerman. A Tutorial on Learning Bayesian Networks, Tehnial Report MSR-TR-95-06, Mirosoft Researh, [Hekerman et al. 1994] D. Hekerman, D. Geiger and D. M. Chikering. Learning Bayesian networks: the ombination of knowledge and statistial data, Tehnial Report MSR-TR-94-09, Mirosoft Researh, [Spirtes et al. 1993] P. Spirtes, C. Glymour and R. Sheines. Causation, Predition, and Searh, Springer, Leture Notes in Statistis 81, 1993; TETRAD.BOOK/book.html [Cheng et al. 1997] J. Cheng, D. A. Bell, W. Liu. An Algorithm for Bayesian Belief Network Constrution from Data, AI & STAT, 1997; Learning Belief Networks from Data: An Information Theory Based Approah, CIKM, [Fung and Crawford 1990] R. M. Fung and S. L. Crawford. Construtor: A System for the Indution of Probabilisti Models, AAAI-90 Proeedings. Eighth National Conferene

10 on Artiial Intelligene, MIT Press, 1990, Vol. 2, pp [Suzuki 1996] J. Suzuki. Learning Bayesian Belief Networks Based on the Minimum Desription Length Priniple: An Eient Algorithm Using the B & B Tehnique, Proeedings of the international onferene on mahine learning, Bally, Italy, [Beinlih et al. 1989] I. A. Beinlih, H. J. Suermondt, R. M. Chavez and G. F. Cooper. The ALARM monitoring system: A ase study with two probabilisti inferene tehniques for belief networks, Proeedings of the Seond European Conferene on Artiial Intelligene in Mediine, pp , London,UK, [Netia Alarm Network] Alarm Network from the Network Library of Norsys Software Corp.,

Extracting Partition Statistics from Semistructured Data

Extracting Partition Statistics from Semistructured Data Extrating Partition Statistis from Semistrutured Data John N. Wilson Rihard Gourlay Robert Japp Mathias Neumüller Department of Computer and Information Sienes University of Strathlyde, Glasgow, UK {jnw,rsg,rpj,mathias}@is.strath.a.uk