Comparing Case-Based Bayesian Network and Recursive Bayesian Multi-net Classifiers

Size: px

Start display at page:

Download "Comparing Case-Based Bayesian Network and Recursive Bayesian Multi-net Classifiers"

Edwin Fisher
5 years ago
Views:

1 Comparing Case-Based Bayesian Network and Recursive Bayesian Multi-net Classifiers Eugene Santos Dept. of Computer Science and Engineering University of Connecticut Storrs, CT 06268, USA FAX:(860) Phone:(860) Ahmed Hussein Dept. of Computer Science and Engineering University of Connecticut Storrs, CT 06268, USA FAX:(860) Phone:(860) Abstract Recent work in Bayesian classifiers has shown that a better and more flexible representation of domain knowledge results in more accurate classifiers. We have recently examined a new type of Bayesian classifiers called Case-Based Bayesian Network (CBBN) classifiers. The basic idea is to partition the training data into semantically sound clusters. A local BN classifier is then learned independently from each cluster. Such a flexible organization of domain knowledge can represent dependency assertions among attributes more accurately and more relevantly than possible in traditional Bayesian classifiers (i.e., BN and BMN classifiers), hence improving classification accuracy. RBMNs also provide a more flexible representation scheme than BNs and generalize BMNs. Briefly, a RBMN is a Decision Tree (DT) with component BNs at the leaves. In this paper, we further explore our CBBN classifiers by comparing them to RBMN classifiers. RBMNs partition the data using a DT induction algorithm. By contrast, CBBNs rely on a flexible strategy for clustering that handles outliers, therefore, allowing more freedom to search for the best way to cluster the data and represent the knowledge. Our experimental results show that CBBN classifiers perform significantly better than RBMN classifiers. Keywords: Bayesian Classifiers, Data Clustering, Learning Bayesian Networks, Knowledge Representation, Uncertainty. I. INTRODUCTION Recent work in Bayesian classifiers has shown that better and more flexible representations of the dependency assertions among attributes can significantly improve classification accuracy. For Bayesian Network (BN) classifiers [2], [5], the basic idea is to exploit learning algorithms such as MDL score algorithm [7] and the information gain test algorithm [1] to learn more expressive relationships between attributes from the training data. For example, the Tree Augmented naive Bayes (TAN) classifier relaxes the structure of the naive Bayes classifier by approximating the interactions between the attributes using a tree-like structure. In addition, the Bayesian network Augmented naive Bayes (BAN) classifier extends TAN by allowing the attributes to form an arbitrary graph, rather than only a tree. Another type of BN classifier that permits even more flexible structure is the General Bayesian Network (GBN) classifier which treats the classification node as an ordinary node and identifies a relevant attribute subset around it determined by its Markov blanket. These classifiers have shown a considerable improvement in classification accuracy over naive Bayes [2], [5]. Bayesian Multi-Net (BMN) classifiers [3], [5] are a generalized form of the augmented naive structure (i.e., TAN and BAN) in the sense that they allow different relationships among attributes for different values of the class variable C. A BMN classifier consists of the prior probability distribution of the class variable C and a set of local networks, each corresponding to a value of C. Although the structures of these classifiers are strictly more expressive than BN classifiers (i.e., TAN and BAN), experiments have shown that BMN classifiers perform as well as BN classifiers and that neither approach clearly dominates [3], [5]. Recursive Bayesian Multi-Nets (RBMNs) [8] is a knowledge representation scheme that partitions the domain knowledge according to a Decision Tree (DT) induction algorithm with component BNs at the leaves. Thus, they constitute a more flexible tool than BNs and provide a structured and specialized domain knowledge as alternative component BNs are learned for every decision path 1. Moreover, RBMNs generalize the idea behind BMNs by offering the possibility of having decision paths with conjunctions of many attributevalue pairs describing the partitions rather than only having the class attribute-value pair. The only constraint is that these decision paths must be represented by a DT. In recent work [9], we have introduced Case-Based Bayesian Networks (CBBNs), a Bayesian knowledge representation scheme that allow for even better representation of the dependency assertions among attributes than possible in BNs and BMNs. In particular, CBBNs can capture and encode case-dependent relationships which is a generalization of hypothesis (i.e., class)-specific relationships defined in BMNs. The basic idea behind CBBNs is to intelligently partition the training data into semantically sound clusters of data. Each cluster is described by an assignment of its most important and descriptive attributes. This is called an index. A local BN can then be learned independently from the data in 1 A decision path in a DT is a set of edges from the root to a leaf described by a set of attribute-value pairs, see Fig.1.

2 Fig. 1. An example of replication problem Fig. 2. An example of case-dependent relationships each cluster conditioned on its index. In order to validate our CBBN approach, we have compared CBBN classifiers to BN classifiers and BMN classifiers. Our empirical results have shown that CBBN classifiers have considerably improved classification accuracy over BN and BMN classifiers. In this paper, we further explore our CBBN knowledge representation approach by comparing CBBN classifiers to RBMN classifiers. Our motivations behind this comparison are as follows: First, while RBMNs partitioning of knowledge is restricted to the use of a DT induction algorithm which may not fit the inherent (i.e., the best) partitioning of the knowledge, our CBBNs relax this restriction by permitting the use of any appropriate clustering methodology that discovers the best way to partition the data. Thereby, knowledge can be better represented and more accurate classifiers can be constructed. Second: the indices used by CBBNs provide a solution to some potential representational shortcomings of DTs, namely replication and fragmentation problems described in the next section. II. REPRESENTATIONAL PROBLEMS OF DTS Replication problems occur in a DT when two identical subtrees are represented. An example is depicted in Fig.1 where the subtree containing attributes A 3 and A 4 is duplicated. Another aspect of replication problem, called fragmentation, can occur when the data contains attributes with too many possible values. In this case, a DT will quickly fragment the data into many subsets. The effect of these problems is that extra, needless partitions of data are constructed. This means that a large amount of data is required to build the tree. Furthermore, we may end up with small partitions where the data in each partition is not statistically sufficient to learn a local BN. For example, to build a RBMN using the DT in Fig.1, we have to learn seven local BNs for all decision paths. We will show that the indexing scheme used in CBBNs can avoid these problems since it is more robust in handling don t cares in partition descriptions than decision paths. III. CASE-BASED BAYESIAN NETWORKS Suppose that a domain is given by a database described by a set of categorical attributes. We use a clustering algorithm to discover meaningful patterns represented by different clusters of data. Each cluster is then characterized and discriminated from other clusters by a unique assignment to its most relevant and descriptive attributes. This is called an index. Intuitively, each cluster represents a piece of the domain knowledge described by the context of its index. These clusters can also be viewed as a set of conditionally independent cases with each case mapped to an index that describes the context of the knowledge relevant to that case. Because of the independence nature of the cases, the knowledge associated with each case can be represented separately by a BN. This representation of the independent cases implies that the relationships among the corresponding attributes might be different for different cases. Thus, instead of assuming fixed relationships between attributes for the whole domain, as in traditional BNs, these relationships can vary according to each different context of each case in the same domain. This conclusion is crucial, since it means that two variables X and Y might be directly dependent (X Y ) in case C i and independent in case C j. Moreover, X Y might occur in case C i while X Y occurs in case C j. Even if the relationships in different cases are the same, the parameters that represent the strength of these relationships might be different. As an example, consider a database of customers applying for a loan in a bank. Such a database might have two different clusters (i.e., cases) where each cluster represents a group of customers. The first group includes those customers who have a good balance in their checking and saving accounts. The decision to grant a loan to these customers might not be influenced by whether a customer has a guarantor, whether he has properties, or whether he is a citizen, but it might be highly affected by his residency time and somewhat by his credit history. The second group might include those customers who do not have sufficient balance in their checking account and with poor credit history. For this group the situation is different since the bank decision will be highly affected by whether they have properties, whether they have a guarantor, whether they are citizens, as well as their residency time. The bank decision and its requirements may be considered as domain variables in a BN that have case dependent relationships among them, see Fig.2. Another example is cyclic knowledge. In a database of patients suffering from diabetes a doctor can distinguish between two groups of patients; those who have just started taking a

3 specific medication and found that it causes an improvement where the glucose level starts to decrease, and those who have been taking the medication for a while and are excited by the improvement thus causing them to increase the dose of that medication or even taking an additional one. In this example, the relation between medication level and health improvement is not pure unidirectional, but is case-dependent cyclic. Obviously, CBBNs subsume BMNs in that they exploit the inherent clusters in the data to partition the domain knowledge instead of relying on a simple partitioning of the data restricted to the classes which is useful only when the relationships among attribute is very different for different classes. On the other hand, CBBNs generalize RBMNs which partitions the data using a DT induction algorithm. These algorithms are restricted to the use of statistical tests (e.g. information gain) to select the divisive attributes. However, this partitioning may not fit the inherent partitions of the data. An advantage of the partitioning methodology used in CBBNs over those used in BMNs and RBMNs is that it allows for handling outliers (noise) in the data. We argue that such a flexible partitioning of knowledge can be more expressive than simple or restricted partitioning employed by BMNs and RBMNs since it allows for more natural clustering in the underlying data. Consequently, it can reveal more relevant and more accurate relationships among attributes. IV. CBBN CLASSIFIERS In this section, we describe algorithms to learn and test CBBN classifiers. A. Formal Definitions Suppose D is a data set described by a set of categorical attributes V = {A 1, A 2,..., A n }. Let R(A i ) be the domain of an attribute A i (i.e., the set of all possible values/states of A i ). Definition 1: A case C j is a vector [c j,1, c j,2,..., c j,n ] where c j,i R(A i ) {x} where x is a unique symbol not in the range of any rv called don t care. Definition 2: Given two cases C 1 = [c 1,1, c 1,2,..., c 1,n ] and C 2 = [c 2,1, c 2,2,..., c 2,n ], C 1 and C 2 are said to be compatible iff one of the following condition holds: (1) c 1,i = c 2,i (2) c 1,i =x or (3) c 2,i =x i = 1, 2,..., n. Otherwise, C 1 and C 2 are said to be mutually exclusive. Definition 3: a data object D = [d 1, d 2,..., d n ] is said to be covered by a case C j iff one of the following condition holds: (1)d i = c j,i or (2)c j,i =x i = 1, 2,..., n. Definition 4: The index of a case C j, called I j, is a set of pairs (A i, c j,i ) such that (A i, c j,i ) I j iff c j,i x (i.e., I j = {(A i, c j,i ) : c j,i x}). B. Learning CBBN Classifiers From Data The learning process is accomplished in the following two steps: 1. Clustering and indexing: a clustering algorithm is used to partition D into a set of clusters Q = {Q 1, Q 2,..., Q k } characterized by a set of mutually exclusive cases C = {C 1, C 2,..., C k } respectively. Each case C j is mapped to an index I j I = {I 1, I 2,..., I k } as defined previously. In order to generate such an indexing scheme, algorithm A shown below begins by initializing C with don t care values for all elements of each vector C j. For a particular cluster Q j, the algorithm computes the probability distribution for each attribute A i, (i.e., the frequencies of its possible values estimated from the data in this cluster). The algorithm proceeds to determine the value of each attribute that has the maximum frequency and assigns this value to A i in C j (i.e., c j,i ) if its frequency exceeds an indexing threshold α. The resulting case is then used as a description of the objects in Q j, thus the algorithm moves all objects that are not covered by C j from Q j to the outliers cluster Q o. This procedure is repeated for each cluster. The algorithm then visits the outliers cluster to check for possible coverage of its objects by the cases describing the other clusters. These objects are retrieved from the outliers to be placed in a cluster if the objects are compatible to (covered by) the cluster s description case. In order to achieve mutual exclusion between the above cases, algorithm B checks each two cases for the mutual exclusion condition (at least one common attribute is assigned differently). If they do not satisfy this condition, it searches for the don t care attribute value in both cases that can be assigned differently in both of them such that a minimum number of objects is rejected from both clusters by the new cases. The algorithm then updates the members of all clusters, including the outliers cluster, according to the new mutually exclusive cases. Finally, to produce the index of each case (cluster), the algorithm simply discards any don t care attributes in each case. Algorithm A: Clustering and Indexing Input: D: data set described by categorical attributes A 1,..., A n, C k: number of clusters α: indexing threshold Output: Q: set of k clusters Q 1,Q 2,..., Q k I: set of indices I 1, I 2,..., I k Q o : possible outliers cluster Notation: R(A i ): the domain of the attribute A i a j,i : a i which maximizes P (A i = a i Q j ) P j,i : P (A i = a j,i Q j) c j,i : the value of attribute A i in C j including x Begin Call a clustering algorithm on D to form the set Q For each cluster Q j Initialize C j with don t care values For each attribute A i Compute P (A i = a i Q j ) a i R(A i ) Find a j,i and P j,i If (P j,i > α) then put c j,i=a j,i

4 Move the objects of Q j not covered by C j to Q o For each cluster Q j Move from Q o objects covered by C j back to Q j Call Algorithm B for mutually exclusive vectors in C For each cluster Q j and using its updated C j Move objects of Q j not covered by C j to the Q o For each cluster Q j and using its updated C j Move from Q o the objects covered by C j back to Q j End Algorithm B: check and fix Input: Q: a set of k data clusters C: a set of k n-dimensional vectors (cases) Output: I: a set of indices Notation: c i,t : the value of the attribute A t in C i a i,t : a t which maximizes P (A t = a t Q i ) u i : no. of uncovered objects by C i in Q i u j : no. of uncovered objects by C j in Q j Begin For i = 1 to k 1 For j = i + 1 to k If (C i and C j are not mutually exclusive) then For each attribute A t (t = 1, 2,..., n) If (a i,t = a j,t = x ) then Find a i,t and a j,t If (a i,t! = a j,t ) then c i,t = a i,t and c j,t = a j,t Find u i and u j Compute s t = u i + u j Retrieve the original state of C i and C j Find the attribute A p that minimizes s t Put c i,p = a i,p and c j,p = a j,p For each updated C j Discard any don t care attribute to form I j End 2. Learning: we apply a BN learning algorithm to learn a local BN classifier B i, where i {1, 2,..., k}, from the data objects in each indexed cluster produced by algorithms A and B. This local classifier is defined over a subset V i V. If V (I i ) is the set of the attributes in I i then V i = V V (I i ). We also learn a BN classifier, B o, from the outliers cluster defined over the whole set V. The set of local classifiers together with the indices constitute a CBBN classifier. C. Testing CBBN Classifiers We test the newly learned CBBN classification model on a given test data set T. Basically, we map each test object (a 1, a 2,..., a n ) in T to an index in I by comparing the attributes assignment in both of them. We then compute P (C a 1, a 2,..., a n ) from the local classifier characterized by that index and assign to C the value that maximizes P. Because TABLE I INDICES GENERATED BY CBBNS cluster index Q 1 {(A 3,1),(A 4,1)} Q 2 {(A 3,1),(A 4,0)} Q 3 {(A 1,1),(A 3,0)} Q o default of the mutual exclusion property of our indexing scheme, an object can map to at most one local classifier B i. If an object cannot be mapped to any index in I, we map it to B o as the default classifier. Finally, we compute the accuracy by comparing the predicted values of C found above to its true values in T. D. CBBN Classifiers vs. RBMN Classifiers As we mentioned, RBMN classifiers rely on a DT induction algorithm to partition the training data set into subsets with each subset characterized by a decision path in the DT. A BN classifier can also be learned at each leaf of the tree without considering the attributes involved in the tests on the decision path leading to the leaf. However, our CBBN classifiers generalize RBMN classifiers in three aspects: 1) They permit more flexibility in partitioning the data using any appropriate clustering algorithm instead of being restricted to a DT induction algorithm. Obviously, algorithms A and B presented above work independent of the clustering algorithm being used to partition the data. This allows us to choose a clustering algorithm appropriate to the underlying data set. Moreover, we can optimize the input parameters of a specific clustering algorithm to obtain an optimal clustering scheme that best partition and represent the domain knowledge. 2) The partitioning approach used for CBBN classifiers has the ability to handle outliers in a data set. We argue that this ability play an important role in improving classifiers performance since these outliers are likely to be incorrectly classified in RBMN classifiers. 3) The indexing approach used in CBBNs is a generalization of the decision paths used in RBMNs since it allows for a more robust handling of don t care in context where certain attributes are not descriptors of a given cluster. This might be helpful to avoid the potential replication and fragmentation problems in DTs. For example, an alternative partitioning scheme that might be suggested by our approach to that shown in Fig.1 is the clustering scheme shown in Table I. This scheme defines only four clusters instead of seven leaves in the DT. By using don t cares in the descriptive cases of the clusters, partitions P 2 and P 5 in the DT are included in cluster Q 1 and partitions P 3 and P 6 are included in cluster Q 2.

5 A. Experiment Settings V. EMPIRICAL EVALUATION We have learned classifiers of different structures (i.e., naive, TAN, BAN, BAN*, GBN and GBN*) from a set of twenty-five benchmark databases. These classifiers have been built based on RBMN approach and based on our CBBN approach. Moreover, the structures of local classifiers have been learned using different learning algorithms. In particular, we used the MDL score algorithm to learn BAN and GBN, and CBL2 algorithm [1] to learn BAN* and GBN*. For TAN classifier, we used Chow and Liu algorithm [4] to learn a treelike structure. The data sets were obtained from the UCI machine learning repository ( In all data sets, objects with missing attribute values have been removed and numerical attributes have been categorized. When comparing a CBBN classifier to an RBMN classifier, we learned corresponding types of structure from exactly the same training set and test them on exactly the same test set. For DT induction, we used the well-known induction algorithm C4.5 with the divisive attribute at each node selected based on the measure of the sum of information gains over all attributes rather than only the class attribute as in supervised learning. For data clustering in CBBN, we used the clustering algorithm, k-modes [6], that extends the popular clustering algorithm, k-means, to categorical domains. The biggest advantage of this algorithm is that it is scalable to very large data sets in terms of both number of records and number of clusters. Another advantage of k-modes algorithm is that the modes provide characteristic descriptions of the clusters. These descriptions are important in indexing clusters in our CBBN model. The k-modes algorithm, as many clustering algorithms, requires that the user specify the number of clusters k. Furthermore, the user also has to specify the indexing threshold α. Running algorithm A several times with different values of k and α leads to a sequence of clustering schemes with more granular and less separated clusters. The answer to the question of which clustering scheme is preferable is not trivial. Many data mining approaches implicitly require the user to alternate between running the algorithm, modifying the parameters (i.e., k, α), and choosing the results which seem best. In our work, we have determined an acceptable range of k for each data set. More specifically, k can take integer values between k min = 2 and k max which is the maximum number of clusters estimated such that each cluster has a number of objects sufficient to learn a BN classifier. The threshold α lies in the range ]0, 1]. We alternatively change k and α within these ranges and select the values that give better classification accuracy. However, optimizing the values of k and α for the best possible results is an issue that we plan to consider in future work. B. Classification Accuracy In order to compare CBBN classifiers vs. RBMN classifiers, we have considered their average improvement in accuracy and the winning count over all data sets. Our hypothesis is that the flexible partitioning of knowledge employed by CBBNs and their more sophisticated indexing scheme allow better organization and representation of knowledge, hence, leading to more accurate classification. The results in Table II shows the classification accuracy of CBBN classifiers and RBMN classifiers in case of BAN* structure, the most accurate classifier we have. Similar results have been obtained for other structures. These results reveal that CBBN classifiers almost always show considerable improvement over RBMN classifiers or show similar performance in few data sets. This notice is confirmed in Table III by measuring the average improvement in accuracy and the winning count of CBBN classifiers over RBMN classifiers. We also notice that in all data sets the number of clusters k constructed by our approach is smaller than the number of leaves L in the DT. This confirms that our indexing scheme can always handle the replication and fragmentation problems. For example, in led24 data set, our CBBN approach used 2 clusters to build a classifier with accuracy (94.6%) while RBMN built a classifier using 9 leaves with only (73.6%) accuracy. C. Computational Complexity Let N be the number of data objects in D, ˆL be the number of non-leaf nodes in the DT, t the number of iteration required for k-modes to converge where t, k, n << N and r the maximum number of possible values for an attribute. The time cost for constructing a RBMN classifier is the time required to induce the DT plus the total time required to learn local BNs from the leaves. Similarly, the time cost for a CBBN classifier is the sum of the time spent in clustering and indexing step and the total learning time of local BNs from clusters. For RBMN, the computational complexity of C4.5 is O(nN ˆL). By contrast, in CBBNs the clustering time taken by k-modes algorithm is O(tknN) and the time required for indexing is O(rnNk 2 ). Obviously, a comparison between the time required for a DT induction and that needed for clustering and indexing in CBBN depends on the values of ˆL, k and r. The learning time of a BN from a data set is given by O(r n n 2 N). This learning process is repeated L times in RBMNs and only k times in CBBNs. However, learning from a cluster is likely to be more expensive than learning from a leaf since the number of data objects in a cluster is likely to be larger than those in a leaf on average. Table II, shows the construction time in CPU sec for both CBBN classifiers and RBMN classifiers. A quick inspection for the table reveals that our approach is more expensive than RBMN approach. We argue that the clustering and indexing phase in our approach consumes more time than DT induction in RBMN. This time difference, however, is not completely compensated by the extra time consumed by learning repetition in RBMNs.

6 TABLE II ACCURACY AND CONSTRUCTION TIME FOR CBBN AND RBMN CLASSIFIERS IN CASE OF BAN* STRUCTURE Datasets RBMN CBBN no name n class train test L k α acc. time acc. time 1 australian CV breast CV car CV chess cleve CV crx CV diabetes CV DNA flare CV german CV glass CV heart CV led liver CV letter mofn nursery pima CV satimage segment shuttle-small soybean-large CV vehicle CV vote CV waveform TABLE III CBBN CLASSIFIERS VS RBMN CLASSIFIERS CBBN naive TAN BAN BAN* GBN GBN* % win % imp VI. CONCLUSIONS AND FUTURE WORK In this paper, we have compared RBMN classifiers to our CBBN classifiers. We have shown that RBMNs rely on restricted partitioning of domain knowledge using a DT induction algorithm which may not yield the best partitioning of domain knowledge. By contrast, our CBBNs approach use a flexible clustering methodology to discover the best way to partition the data and handle outliers. Thereby, knowledge can be better represented and more accurate classifiers can be constructed. Experimental results have shown that our CBBN classifiers considerably outperform RBMN classifiers for different structures. One concern on our approach is how to choose the clustering algorithm and its associated parameters that best fit the data in a particular database. In this paper, we chose the k-modes algorithm for clustering with k and α alternatively adjusted by the user for better results. However, this choice does not guarantee the best possible results. We believe that by selecting the clustering algorithm that best fit the data and optimizing the value of k and α, we can get even better classification accuracy. One advantage of CBBNs that makes it stands alone when compared to BMNs and RBMNs is that its partitioning approach is not restricted to well-separated partitions. In particular, we plan to extend the definition of CBBNs to overlapping clusters since we believe that this organization of knowledge can help improve the performance of CBBNs at least in some data sets. REFERENCES [1] J. Cheng, D. Bell, and W. Liu, Learning BeliefNetworks from Data: An Information Theory Based Approach, in Proc. of the sixth ACM International Conference on Information and Knowledge Management, [2] J. Cheng, and R. Greiner, Comparing Bayesian Network Classifiers, in Proc. of the fifteenth Conference on Uncertainty in Artificial Intelligence, [3] J. Cheng, and R. Greiner, Learning Bayesian Belief Network Classifiers: Algorithms and Systems, in Proc. of the Fourteenth Canadian Conference on Artificial Intelligence, [4] C.K. Chow, and C.N. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. on information theory, vol. 14, pp , [5] N. Friedman, D. Geiger, and M. Goldszmidt, Bayesian Network Classifiers, Machine Learning, vol. 29, pp , [6] Z. Huang, Extensions to the k-means algorithm for clustering large data sets, Data Mining and Knowledge Discovery, vol. 2:3, pp , [7] Lam, W. and Bacchus, F. Learning Bayesian Belief Networks: An Approach Based on the MDL Principle, Computational Intelligence, vol. 10:4, [8] J. M. Pena, J. A. Lozano, and P. Larranaga, Learning Recursive Bayesian Multinets for Data Clustering by Means of Constructive Induction, Machine Learning, vol.47:1, pp , [9] E. Santos Jr., and A. Hussein, Case-Based Bayesian Network Classifiers, To appear in Proc. of the Seventeenth International FLAIRS Conference, 2004.

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance