Comparing Case-Based Bayesian Network and Recursive Bayesian Multi-net Classifiers

Size: px
Start display at page:

Download "Comparing Case-Based Bayesian Network and Recursive Bayesian Multi-net Classifiers"

Transcription

1 Comparing Case-Based Bayesian Network and Recursive Bayesian Multi-net Classifiers Eugene Santos Dept. of Computer Science and Engineering University of Connecticut Storrs, CT 06268, USA FAX:(860) Phone:(860) Ahmed Hussein Dept. of Computer Science and Engineering University of Connecticut Storrs, CT 06268, USA FAX:(860) Phone:(860) Abstract Recent work in Bayesian classifiers has shown that a better and more flexible representation of domain knowledge results in more accurate classifiers. We have recently examined a new type of Bayesian classifiers called Case-Based Bayesian Network (CBBN) classifiers. The basic idea is to partition the training data into semantically sound clusters. A local BN classifier is then learned independently from each cluster. Such a flexible organization of domain knowledge can represent dependency assertions among attributes more accurately and more relevantly than possible in traditional Bayesian classifiers (i.e., BN and BMN classifiers), hence improving classification accuracy. RBMNs also provide a more flexible representation scheme than BNs and generalize BMNs. Briefly, a RBMN is a Decision Tree (DT) with component BNs at the leaves. In this paper, we further explore our CBBN classifiers by comparing them to RBMN classifiers. RBMNs partition the data using a DT induction algorithm. By contrast, CBBNs rely on a flexible strategy for clustering that handles outliers, therefore, allowing more freedom to search for the best way to cluster the data and represent the knowledge. Our experimental results show that CBBN classifiers perform significantly better than RBMN classifiers. Keywords: Bayesian Classifiers, Data Clustering, Learning Bayesian Networks, Knowledge Representation, Uncertainty. I. INTRODUCTION Recent work in Bayesian classifiers has shown that better and more flexible representations of the dependency assertions among attributes can significantly improve classification accuracy. For Bayesian Network (BN) classifiers [2], [5], the basic idea is to exploit learning algorithms such as MDL score algorithm [7] and the information gain test algorithm [1] to learn more expressive relationships between attributes from the training data. For example, the Tree Augmented naive Bayes (TAN) classifier relaxes the structure of the naive Bayes classifier by approximating the interactions between the attributes using a tree-like structure. In addition, the Bayesian network Augmented naive Bayes (BAN) classifier extends TAN by allowing the attributes to form an arbitrary graph, rather than only a tree. Another type of BN classifier that permits even more flexible structure is the General Bayesian Network (GBN) classifier which treats the classification node as an ordinary node and identifies a relevant attribute subset around it determined by its Markov blanket. These classifiers have shown a considerable improvement in classification accuracy over naive Bayes [2], [5]. Bayesian Multi-Net (BMN) classifiers [3], [5] are a generalized form of the augmented naive structure (i.e., TAN and BAN) in the sense that they allow different relationships among attributes for different values of the class variable C. A BMN classifier consists of the prior probability distribution of the class variable C and a set of local networks, each corresponding to a value of C. Although the structures of these classifiers are strictly more expressive than BN classifiers (i.e., TAN and BAN), experiments have shown that BMN classifiers perform as well as BN classifiers and that neither approach clearly dominates [3], [5]. Recursive Bayesian Multi-Nets (RBMNs) [8] is a knowledge representation scheme that partitions the domain knowledge according to a Decision Tree (DT) induction algorithm with component BNs at the leaves. Thus, they constitute a more flexible tool than BNs and provide a structured and specialized domain knowledge as alternative component BNs are learned for every decision path 1. Moreover, RBMNs generalize the idea behind BMNs by offering the possibility of having decision paths with conjunctions of many attributevalue pairs describing the partitions rather than only having the class attribute-value pair. The only constraint is that these decision paths must be represented by a DT. In recent work [9], we have introduced Case-Based Bayesian Networks (CBBNs), a Bayesian knowledge representation scheme that allow for even better representation of the dependency assertions among attributes than possible in BNs and BMNs. In particular, CBBNs can capture and encode case-dependent relationships which is a generalization of hypothesis (i.e., class)-specific relationships defined in BMNs. The basic idea behind CBBNs is to intelligently partition the training data into semantically sound clusters of data. Each cluster is described by an assignment of its most important and descriptive attributes. This is called an index. A local BN can then be learned independently from the data in 1 A decision path in a DT is a set of edges from the root to a leaf described by a set of attribute-value pairs, see Fig.1.

2 Fig. 1. An example of replication problem Fig. 2. An example of case-dependent relationships each cluster conditioned on its index. In order to validate our CBBN approach, we have compared CBBN classifiers to BN classifiers and BMN classifiers. Our empirical results have shown that CBBN classifiers have considerably improved classification accuracy over BN and BMN classifiers. In this paper, we further explore our CBBN knowledge representation approach by comparing CBBN classifiers to RBMN classifiers. Our motivations behind this comparison are as follows: First, while RBMNs partitioning of knowledge is restricted to the use of a DT induction algorithm which may not fit the inherent (i.e., the best) partitioning of the knowledge, our CBBNs relax this restriction by permitting the use of any appropriate clustering methodology that discovers the best way to partition the data. Thereby, knowledge can be better represented and more accurate classifiers can be constructed. Second: the indices used by CBBNs provide a solution to some potential representational shortcomings of DTs, namely replication and fragmentation problems described in the next section. II. REPRESENTATIONAL PROBLEMS OF DTS Replication problems occur in a DT when two identical subtrees are represented. An example is depicted in Fig.1 where the subtree containing attributes A 3 and A 4 is duplicated. Another aspect of replication problem, called fragmentation, can occur when the data contains attributes with too many possible values. In this case, a DT will quickly fragment the data into many subsets. The effect of these problems is that extra, needless partitions of data are constructed. This means that a large amount of data is required to build the tree. Furthermore, we may end up with small partitions where the data in each partition is not statistically sufficient to learn a local BN. For example, to build a RBMN using the DT in Fig.1, we have to learn seven local BNs for all decision paths. We will show that the indexing scheme used in CBBNs can avoid these problems since it is more robust in handling don t cares in partition descriptions than decision paths. III. CASE-BASED BAYESIAN NETWORKS Suppose that a domain is given by a database described by a set of categorical attributes. We use a clustering algorithm to discover meaningful patterns represented by different clusters of data. Each cluster is then characterized and discriminated from other clusters by a unique assignment to its most relevant and descriptive attributes. This is called an index. Intuitively, each cluster represents a piece of the domain knowledge described by the context of its index. These clusters can also be viewed as a set of conditionally independent cases with each case mapped to an index that describes the context of the knowledge relevant to that case. Because of the independence nature of the cases, the knowledge associated with each case can be represented separately by a BN. This representation of the independent cases implies that the relationships among the corresponding attributes might be different for different cases. Thus, instead of assuming fixed relationships between attributes for the whole domain, as in traditional BNs, these relationships can vary according to each different context of each case in the same domain. This conclusion is crucial, since it means that two variables X and Y might be directly dependent (X Y ) in case C i and independent in case C j. Moreover, X Y might occur in case C i while X Y occurs in case C j. Even if the relationships in different cases are the same, the parameters that represent the strength of these relationships might be different. As an example, consider a database of customers applying for a loan in a bank. Such a database might have two different clusters (i.e., cases) where each cluster represents a group of customers. The first group includes those customers who have a good balance in their checking and saving accounts. The decision to grant a loan to these customers might not be influenced by whether a customer has a guarantor, whether he has properties, or whether he is a citizen, but it might be highly affected by his residency time and somewhat by his credit history. The second group might include those customers who do not have sufficient balance in their checking account and with poor credit history. For this group the situation is different since the bank decision will be highly affected by whether they have properties, whether they have a guarantor, whether they are citizens, as well as their residency time. The bank decision and its requirements may be considered as domain variables in a BN that have case dependent relationships among them, see Fig.2. Another example is cyclic knowledge. In a database of patients suffering from diabetes a doctor can distinguish between two groups of patients; those who have just started taking a

3 specific medication and found that it causes an improvement where the glucose level starts to decrease, and those who have been taking the medication for a while and are excited by the improvement thus causing them to increase the dose of that medication or even taking an additional one. In this example, the relation between medication level and health improvement is not pure unidirectional, but is case-dependent cyclic. Obviously, CBBNs subsume BMNs in that they exploit the inherent clusters in the data to partition the domain knowledge instead of relying on a simple partitioning of the data restricted to the classes which is useful only when the relationships among attribute is very different for different classes. On the other hand, CBBNs generalize RBMNs which partitions the data using a DT induction algorithm. These algorithms are restricted to the use of statistical tests (e.g. information gain) to select the divisive attributes. However, this partitioning may not fit the inherent partitions of the data. An advantage of the partitioning methodology used in CBBNs over those used in BMNs and RBMNs is that it allows for handling outliers (noise) in the data. We argue that such a flexible partitioning of knowledge can be more expressive than simple or restricted partitioning employed by BMNs and RBMNs since it allows for more natural clustering in the underlying data. Consequently, it can reveal more relevant and more accurate relationships among attributes. IV. CBBN CLASSIFIERS In this section, we describe algorithms to learn and test CBBN classifiers. A. Formal Definitions Suppose D is a data set described by a set of categorical attributes V = {A 1, A 2,..., A n }. Let R(A i ) be the domain of an attribute A i (i.e., the set of all possible values/states of A i ). Definition 1: A case C j is a vector [c j,1, c j,2,..., c j,n ] where c j,i R(A i ) {x} where x is a unique symbol not in the range of any rv called don t care. Definition 2: Given two cases C 1 = [c 1,1, c 1,2,..., c 1,n ] and C 2 = [c 2,1, c 2,2,..., c 2,n ], C 1 and C 2 are said to be compatible iff one of the following condition holds: (1) c 1,i = c 2,i (2) c 1,i =x or (3) c 2,i =x i = 1, 2,..., n. Otherwise, C 1 and C 2 are said to be mutually exclusive. Definition 3: a data object D = [d 1, d 2,..., d n ] is said to be covered by a case C j iff one of the following condition holds: (1)d i = c j,i or (2)c j,i =x i = 1, 2,..., n. Definition 4: The index of a case C j, called I j, is a set of pairs (A i, c j,i ) such that (A i, c j,i ) I j iff c j,i x (i.e., I j = {(A i, c j,i ) : c j,i x}). B. Learning CBBN Classifiers From Data The learning process is accomplished in the following two steps: 1. Clustering and indexing: a clustering algorithm is used to partition D into a set of clusters Q = {Q 1, Q 2,..., Q k } characterized by a set of mutually exclusive cases C = {C 1, C 2,..., C k } respectively. Each case C j is mapped to an index I j I = {I 1, I 2,..., I k } as defined previously. In order to generate such an indexing scheme, algorithm A shown below begins by initializing C with don t care values for all elements of each vector C j. For a particular cluster Q j, the algorithm computes the probability distribution for each attribute A i, (i.e., the frequencies of its possible values estimated from the data in this cluster). The algorithm proceeds to determine the value of each attribute that has the maximum frequency and assigns this value to A i in C j (i.e., c j,i ) if its frequency exceeds an indexing threshold α. The resulting case is then used as a description of the objects in Q j, thus the algorithm moves all objects that are not covered by C j from Q j to the outliers cluster Q o. This procedure is repeated for each cluster. The algorithm then visits the outliers cluster to check for possible coverage of its objects by the cases describing the other clusters. These objects are retrieved from the outliers to be placed in a cluster if the objects are compatible to (covered by) the cluster s description case. In order to achieve mutual exclusion between the above cases, algorithm B checks each two cases for the mutual exclusion condition (at least one common attribute is assigned differently). If they do not satisfy this condition, it searches for the don t care attribute value in both cases that can be assigned differently in both of them such that a minimum number of objects is rejected from both clusters by the new cases. The algorithm then updates the members of all clusters, including the outliers cluster, according to the new mutually exclusive cases. Finally, to produce the index of each case (cluster), the algorithm simply discards any don t care attributes in each case. Algorithm A: Clustering and Indexing Input: D: data set described by categorical attributes A 1,..., A n, C k: number of clusters α: indexing threshold Output: Q: set of k clusters Q 1,Q 2,..., Q k I: set of indices I 1, I 2,..., I k Q o : possible outliers cluster Notation: R(A i ): the domain of the attribute A i a j,i : a i which maximizes P (A i = a i Q j ) P j,i : P (A i = a j,i Q j) c j,i : the value of attribute A i in C j including x Begin Call a clustering algorithm on D to form the set Q For each cluster Q j Initialize C j with don t care values For each attribute A i Compute P (A i = a i Q j ) a i R(A i ) Find a j,i and P j,i If (P j,i > α) then put c j,i=a j,i

4 Move the objects of Q j not covered by C j to Q o For each cluster Q j Move from Q o objects covered by C j back to Q j Call Algorithm B for mutually exclusive vectors in C For each cluster Q j and using its updated C j Move objects of Q j not covered by C j to the Q o For each cluster Q j and using its updated C j Move from Q o the objects covered by C j back to Q j End Algorithm B: check and fix Input: Q: a set of k data clusters C: a set of k n-dimensional vectors (cases) Output: I: a set of indices Notation: c i,t : the value of the attribute A t in C i a i,t : a t which maximizes P (A t = a t Q i ) u i : no. of uncovered objects by C i in Q i u j : no. of uncovered objects by C j in Q j Begin For i = 1 to k 1 For j = i + 1 to k If (C i and C j are not mutually exclusive) then For each attribute A t (t = 1, 2,..., n) If (a i,t = a j,t = x ) then Find a i,t and a j,t If (a i,t! = a j,t ) then c i,t = a i,t and c j,t = a j,t Find u i and u j Compute s t = u i + u j Retrieve the original state of C i and C j Find the attribute A p that minimizes s t Put c i,p = a i,p and c j,p = a j,p For each updated C j Discard any don t care attribute to form I j End 2. Learning: we apply a BN learning algorithm to learn a local BN classifier B i, where i {1, 2,..., k}, from the data objects in each indexed cluster produced by algorithms A and B. This local classifier is defined over a subset V i V. If V (I i ) is the set of the attributes in I i then V i = V V (I i ). We also learn a BN classifier, B o, from the outliers cluster defined over the whole set V. The set of local classifiers together with the indices constitute a CBBN classifier. C. Testing CBBN Classifiers We test the newly learned CBBN classification model on a given test data set T. Basically, we map each test object (a 1, a 2,..., a n ) in T to an index in I by comparing the attributes assignment in both of them. We then compute P (C a 1, a 2,..., a n ) from the local classifier characterized by that index and assign to C the value that maximizes P. Because TABLE I INDICES GENERATED BY CBBNS cluster index Q 1 {(A 3,1),(A 4,1)} Q 2 {(A 3,1),(A 4,0)} Q 3 {(A 1,1),(A 3,0)} Q o default of the mutual exclusion property of our indexing scheme, an object can map to at most one local classifier B i. If an object cannot be mapped to any index in I, we map it to B o as the default classifier. Finally, we compute the accuracy by comparing the predicted values of C found above to its true values in T. D. CBBN Classifiers vs. RBMN Classifiers As we mentioned, RBMN classifiers rely on a DT induction algorithm to partition the training data set into subsets with each subset characterized by a decision path in the DT. A BN classifier can also be learned at each leaf of the tree without considering the attributes involved in the tests on the decision path leading to the leaf. However, our CBBN classifiers generalize RBMN classifiers in three aspects: 1) They permit more flexibility in partitioning the data using any appropriate clustering algorithm instead of being restricted to a DT induction algorithm. Obviously, algorithms A and B presented above work independent of the clustering algorithm being used to partition the data. This allows us to choose a clustering algorithm appropriate to the underlying data set. Moreover, we can optimize the input parameters of a specific clustering algorithm to obtain an optimal clustering scheme that best partition and represent the domain knowledge. 2) The partitioning approach used for CBBN classifiers has the ability to handle outliers in a data set. We argue that this ability play an important role in improving classifiers performance since these outliers are likely to be incorrectly classified in RBMN classifiers. 3) The indexing approach used in CBBNs is a generalization of the decision paths used in RBMNs since it allows for a more robust handling of don t care in context where certain attributes are not descriptors of a given cluster. This might be helpful to avoid the potential replication and fragmentation problems in DTs. For example, an alternative partitioning scheme that might be suggested by our approach to that shown in Fig.1 is the clustering scheme shown in Table I. This scheme defines only four clusters instead of seven leaves in the DT. By using don t cares in the descriptive cases of the clusters, partitions P 2 and P 5 in the DT are included in cluster Q 1 and partitions P 3 and P 6 are included in cluster Q 2.

5 A. Experiment Settings V. EMPIRICAL EVALUATION We have learned classifiers of different structures (i.e., naive, TAN, BAN, BAN*, GBN and GBN*) from a set of twenty-five benchmark databases. These classifiers have been built based on RBMN approach and based on our CBBN approach. Moreover, the structures of local classifiers have been learned using different learning algorithms. In particular, we used the MDL score algorithm to learn BAN and GBN, and CBL2 algorithm [1] to learn BAN* and GBN*. For TAN classifier, we used Chow and Liu algorithm [4] to learn a treelike structure. The data sets were obtained from the UCI machine learning repository ( In all data sets, objects with missing attribute values have been removed and numerical attributes have been categorized. When comparing a CBBN classifier to an RBMN classifier, we learned corresponding types of structure from exactly the same training set and test them on exactly the same test set. For DT induction, we used the well-known induction algorithm C4.5 with the divisive attribute at each node selected based on the measure of the sum of information gains over all attributes rather than only the class attribute as in supervised learning. For data clustering in CBBN, we used the clustering algorithm, k-modes [6], that extends the popular clustering algorithm, k-means, to categorical domains. The biggest advantage of this algorithm is that it is scalable to very large data sets in terms of both number of records and number of clusters. Another advantage of k-modes algorithm is that the modes provide characteristic descriptions of the clusters. These descriptions are important in indexing clusters in our CBBN model. The k-modes algorithm, as many clustering algorithms, requires that the user specify the number of clusters k. Furthermore, the user also has to specify the indexing threshold α. Running algorithm A several times with different values of k and α leads to a sequence of clustering schemes with more granular and less separated clusters. The answer to the question of which clustering scheme is preferable is not trivial. Many data mining approaches implicitly require the user to alternate between running the algorithm, modifying the parameters (i.e., k, α), and choosing the results which seem best. In our work, we have determined an acceptable range of k for each data set. More specifically, k can take integer values between k min = 2 and k max which is the maximum number of clusters estimated such that each cluster has a number of objects sufficient to learn a BN classifier. The threshold α lies in the range ]0, 1]. We alternatively change k and α within these ranges and select the values that give better classification accuracy. However, optimizing the values of k and α for the best possible results is an issue that we plan to consider in future work. B. Classification Accuracy In order to compare CBBN classifiers vs. RBMN classifiers, we have considered their average improvement in accuracy and the winning count over all data sets. Our hypothesis is that the flexible partitioning of knowledge employed by CBBNs and their more sophisticated indexing scheme allow better organization and representation of knowledge, hence, leading to more accurate classification. The results in Table II shows the classification accuracy of CBBN classifiers and RBMN classifiers in case of BAN* structure, the most accurate classifier we have. Similar results have been obtained for other structures. These results reveal that CBBN classifiers almost always show considerable improvement over RBMN classifiers or show similar performance in few data sets. This notice is confirmed in Table III by measuring the average improvement in accuracy and the winning count of CBBN classifiers over RBMN classifiers. We also notice that in all data sets the number of clusters k constructed by our approach is smaller than the number of leaves L in the DT. This confirms that our indexing scheme can always handle the replication and fragmentation problems. For example, in led24 data set, our CBBN approach used 2 clusters to build a classifier with accuracy (94.6%) while RBMN built a classifier using 9 leaves with only (73.6%) accuracy. C. Computational Complexity Let N be the number of data objects in D, ˆL be the number of non-leaf nodes in the DT, t the number of iteration required for k-modes to converge where t, k, n << N and r the maximum number of possible values for an attribute. The time cost for constructing a RBMN classifier is the time required to induce the DT plus the total time required to learn local BNs from the leaves. Similarly, the time cost for a CBBN classifier is the sum of the time spent in clustering and indexing step and the total learning time of local BNs from clusters. For RBMN, the computational complexity of C4.5 is O(nN ˆL). By contrast, in CBBNs the clustering time taken by k-modes algorithm is O(tknN) and the time required for indexing is O(rnNk 2 ). Obviously, a comparison between the time required for a DT induction and that needed for clustering and indexing in CBBN depends on the values of ˆL, k and r. The learning time of a BN from a data set is given by O(r n n 2 N). This learning process is repeated L times in RBMNs and only k times in CBBNs. However, learning from a cluster is likely to be more expensive than learning from a leaf since the number of data objects in a cluster is likely to be larger than those in a leaf on average. Table II, shows the construction time in CPU sec for both CBBN classifiers and RBMN classifiers. A quick inspection for the table reveals that our approach is more expensive than RBMN approach. We argue that the clustering and indexing phase in our approach consumes more time than DT induction in RBMN. This time difference, however, is not completely compensated by the extra time consumed by learning repetition in RBMNs.

6 TABLE II ACCURACY AND CONSTRUCTION TIME FOR CBBN AND RBMN CLASSIFIERS IN CASE OF BAN* STRUCTURE Datasets RBMN CBBN no name n class train test L k α acc. time acc. time 1 australian CV breast CV car CV chess cleve CV crx CV diabetes CV DNA flare CV german CV glass CV heart CV led liver CV letter mofn nursery pima CV satimage segment shuttle-small soybean-large CV vehicle CV vote CV waveform TABLE III CBBN CLASSIFIERS VS RBMN CLASSIFIERS CBBN naive TAN BAN BAN* GBN GBN* % win % imp VI. CONCLUSIONS AND FUTURE WORK In this paper, we have compared RBMN classifiers to our CBBN classifiers. We have shown that RBMNs rely on restricted partitioning of domain knowledge using a DT induction algorithm which may not yield the best partitioning of domain knowledge. By contrast, our CBBNs approach use a flexible clustering methodology to discover the best way to partition the data and handle outliers. Thereby, knowledge can be better represented and more accurate classifiers can be constructed. Experimental results have shown that our CBBN classifiers considerably outperform RBMN classifiers for different structures. One concern on our approach is how to choose the clustering algorithm and its associated parameters that best fit the data in a particular database. In this paper, we chose the k-modes algorithm for clustering with k and α alternatively adjusted by the user for better results. However, this choice does not guarantee the best possible results. We believe that by selecting the clustering algorithm that best fit the data and optimizing the value of k and α, we can get even better classification accuracy. One advantage of CBBNs that makes it stands alone when compared to BMNs and RBMNs is that its partitioning approach is not restricted to well-separated partitions. In particular, we plan to extend the definition of CBBNs to overlapping clusters since we believe that this organization of knowledge can help improve the performance of CBBNs at least in some data sets. REFERENCES [1] J. Cheng, D. Bell, and W. Liu, Learning BeliefNetworks from Data: An Information Theory Based Approach, in Proc. of the sixth ACM International Conference on Information and Knowledge Management, [2] J. Cheng, and R. Greiner, Comparing Bayesian Network Classifiers, in Proc. of the fifteenth Conference on Uncertainty in Artificial Intelligence, [3] J. Cheng, and R. Greiner, Learning Bayesian Belief Network Classifiers: Algorithms and Systems, in Proc. of the Fourteenth Canadian Conference on Artificial Intelligence, [4] C.K. Chow, and C.N. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. on information theory, vol. 14, pp , [5] N. Friedman, D. Geiger, and M. Goldszmidt, Bayesian Network Classifiers, Machine Learning, vol. 29, pp , [6] Z. Huang, Extensions to the k-means algorithm for clustering large data sets, Data Mining and Knowledge Discovery, vol. 2:3, pp , [7] Lam, W. and Bacchus, F. Learning Bayesian Belief Networks: An Approach Based on the MDL Principle, Computational Intelligence, vol. 10:4, [8] J. M. Pena, J. A. Lozano, and P. Larranaga, Learning Recursive Bayesian Multinets for Data Clustering by Means of Constructive Induction, Machine Learning, vol.47:1, pp , [9] E. Santos Jr., and A. Hussein, Case-Based Bayesian Network Classifiers, To appear in Proc. of the Seventeenth International FLAIRS Conference, 2004.

Building Classifiers using Bayesian Networks

Building Classifiers using Bayesian Networks Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance

More information

SSV Criterion Based Discretization for Naive Bayes Classifiers

SSV Criterion Based Discretization for Naive Bayes Classifiers SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,

More information

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms Remco R. Bouckaert 1,2 and Eibe Frank 2 1 Xtal Mountain Information Technology 215 Three Oaks Drive, Dairy Flat, Auckland,

More information

Bayesian Network Structure Learning by Recursive Autonomy Identification

Bayesian Network Structure Learning by Recursive Autonomy Identification Bayesian Network Structure Learning by Recursive Autonomy Identification Raanan Yehezkel and Boaz Lerner Pattern Analysis and Machine Learning Lab Department of Electrical & Computer Engineering Ben-Gurion

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Learning Bayesian Networks (part 3) Goals for the lecture

Learning Bayesian Networks (part 3) Goals for the lecture Learning Bayesian Networks (part 3) Mark Craven and David Page Computer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA 1 HALF&HALF BAGGING AND HARD BOUNDARY POINTS Leo Breiman Statistics Department University of California Berkeley, CA 94720 leo@stat.berkeley.edu Technical Report 534 Statistics Department September 1998

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

Exam Advanced Data Mining Date: Time:

Exam Advanced Data Mining Date: Time: Exam Advanced Data Mining Date: 11-11-2010 Time: 13.30-16.30 General Remarks 1. You are allowed to consult 1 A4 sheet with notes written on both sides. 2. Always show how you arrived at the result of your

More information

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS 4.1 Introduction Although MST-based clustering methods are effective for complex data, they require quadratic computational time which is high for

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

CloNI: clustering of JN -interval discretization

CloNI: clustering of JN -interval discretization CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically

More information

Chapter 8 The C 4.5*stat algorithm

Chapter 8 The C 4.5*stat algorithm 109 The C 4.5*stat algorithm This chapter explains a new algorithm namely C 4.5*stat for numeric data sets. It is a variant of the C 4.5 algorithm and it uses variance instead of information gain for the

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.4. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

Comparing Univariate and Multivariate Decision Trees *

Comparing Univariate and Multivariate Decision Trees * Comparing Univariate and Multivariate Decision Trees * Olcay Taner Yıldız, Ethem Alpaydın Department of Computer Engineering Boğaziçi University, 80815 İstanbul Turkey yildizol@cmpe.boun.edu.tr, alpaydin@boun.edu.tr

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due October 15th, beginning of class October 1, 2008 Instructions: There are six questions on this assignment. Each question has the name of one of the TAs beside it,

More information

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier Data Mining 3.2 Decision Tree Classifier Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Basic Algorithm for Decision Tree Induction Attribute Selection Measures Information Gain Gain Ratio

More information

Decision Trees Oct

Decision Trees Oct Decision Trees Oct - 7-2009 Previously We learned two different classifiers Perceptron: LTU KNN: complex decision boundary If you are a novice in this field, given a classification application, are these

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

MULTI-VIEW TARGET CLASSIFICATION IN SYNTHETIC APERTURE SONAR IMAGERY

MULTI-VIEW TARGET CLASSIFICATION IN SYNTHETIC APERTURE SONAR IMAGERY MULTI-VIEW TARGET CLASSIFICATION IN SYNTHETIC APERTURE SONAR IMAGERY David Williams a, Johannes Groen b ab NATO Undersea Research Centre, Viale San Bartolomeo 400, 19126 La Spezia, Italy Contact Author:

More information

Interval Estimation Naïve Bayes

Interval Estimation Naïve Bayes Interval Estimation Naïve Bayes V. Robles 1, P. Larrañaga 2, J.M. Peña 1, E. Menasalvas 1, and M.S. Pérez 1 1 Department of Computer Architecture and Technology, Technical University of Madrid, Madrid,

More information

Salman Ahmed.G* et al. /International Journal of Pharmacy & Technology

Salman Ahmed.G* et al. /International Journal of Pharmacy & Technology ISSN: 0975-766X CODEN: IJPTFI Available Online through Research Article www.ijptonline.com A FRAMEWORK FOR CLASSIFICATION OF MEDICAL DATA USING BIJECTIVE SOFT SET Salman Ahmed.G* Research Scholar M. Tech

More information

7. Decision or classification trees

7. Decision or classification trees 7. Decision or classification trees Next we are going to consider a rather different approach from those presented so far to machine learning that use one of the most common and important data structure,

More information

Nearest neighbor classification DSE 220

Nearest neighbor classification DSE 220 Nearest neighbor classification DSE 220 Decision Trees Target variable Label Dependent variable Output space Person ID Age Gender Income Balance Mortgag e payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique Research Paper Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique C. Sudarsana Reddy 1 S. Aquter Babu 2 Dr. V. Vasu 3 Department

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

RIONA: A New Classification System Combining Rule Induction and Instance-Based Learning

RIONA: A New Classification System Combining Rule Induction and Instance-Based Learning Fundamenta Informaticae 51 2002) 369 390 369 IOS Press IONA: A New Classification System Combining ule Induction and Instance-Based Learning Grzegorz Góra Institute of Informatics Warsaw University ul.

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011, Weighted Association Rule Mining Without Pre-assigned Weights PURNA PRASAD MUTYALA, KUMAR VASANTHA Department of CSE, Avanthi Institute of Engg & Tech, Tamaram, Visakhapatnam, A.P., India. Abstract Association

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

A Heart Disease Risk Prediction System Based On Novel Technique Stratified Sampling

A Heart Disease Risk Prediction System Based On Novel Technique Stratified Sampling IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. X (Mar-Apr. 2014), PP 32-37 A Heart Disease Risk Prediction System Based On Novel Technique

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Trade-offs in Explanatory

Trade-offs in Explanatory 1 Trade-offs in Explanatory 21 st of February 2012 Model Learning Data Analysis Project Madalina Fiterau DAP Committee Artur Dubrawski Jeff Schneider Geoff Gordon 2 Outline Motivation: need for interpretable

More information

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification Data Mining 3.3 Fall 2008 Instructor: Dr. Masoud Yaghini Outline Using IF-THEN Rules for Classification Rules With Exceptions Rule Extraction from a Decision Tree 1R Algorithm Sequential Covering Algorithms

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

Cost-sensitive C4.5 with post-pruning and competition

Cost-sensitive C4.5 with post-pruning and competition Cost-sensitive C4.5 with post-pruning and competition Zilong Xu, Fan Min, William Zhu Lab of Granular Computing, Zhangzhou Normal University, Zhangzhou 363, China Abstract Decision tree is an effective

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Packet Classification Using Dynamically Generated Decision Trees

Packet Classification Using Dynamically Generated Decision Trees 1 Packet Classification Using Dynamically Generated Decision Trees Yu-Chieh Cheng, Pi-Chung Wang Abstract Binary Search on Levels (BSOL) is a decision-tree algorithm for packet classification with superior

More information

COMP 465: Data Mining Classification Basics

COMP 465: Data Mining Classification Basics Supervised vs. Unsupervised Learning COMP 465: Data Mining Classification Basics Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Supervised

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

Using the Kolmogorov-Smirnov Test for Image Segmentation

Using the Kolmogorov-Smirnov Test for Image Segmentation Using the Kolmogorov-Smirnov Test for Image Segmentation Yong Jae Lee CS395T Computational Statistics Final Project Report May 6th, 2009 I. INTRODUCTION Image segmentation is a fundamental task in computer

More information

Combined Weak Classifiers

Combined Weak Classifiers Combined Weak Classifiers Chuanyi Ji and Sheng Ma Department of Electrical, Computer and System Engineering Rensselaer Polytechnic Institute, Troy, NY 12180 chuanyi@ecse.rpi.edu, shengm@ecse.rpi.edu Abstract

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Classification with Decision Tree Induction

Classification with Decision Tree Induction Classification with Decision Tree Induction This algorithm makes Classification Decision for a test sample with the help of tree like structure (Similar to Binary Tree OR k-ary tree) Nodes in the tree

More information

A Reduction of Conway s Thrackle Conjecture

A Reduction of Conway s Thrackle Conjecture A Reduction of Conway s Thrackle Conjecture Wei Li, Karen Daniels, and Konstantin Rybnikov Department of Computer Science and Department of Mathematical Sciences University of Massachusetts, Lowell 01854

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Lazy Decision Trees Ronny Kohavi

Lazy Decision Trees Ronny Kohavi Lazy Decision Trees Ronny Kohavi Data Mining and Visualization Group Silicon Graphics, Inc. Joint work with Jerry Friedman and Yeogirl Yun Stanford University Motivation: Average Impurity = / interesting

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Closed Pattern Mining from n-ary Relations

Closed Pattern Mining from n-ary Relations Closed Pattern Mining from n-ary Relations R V Nataraj Department of Information Technology PSG College of Technology Coimbatore, India S Selvan Department of Computer Science Francis Xavier Engineering

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Cluster based boosting for high dimensional data

Cluster based boosting for high dimensional data Cluster based boosting for high dimensional data Rutuja Shirbhate, Dr. S. D. Babar Abstract -Data Dimensionality is crucial for learning and prediction systems. Term Curse of High Dimensionality means

More information

Evaluating the Explanatory Value of Bayesian Network Structure Learning Algorithms

Evaluating the Explanatory Value of Bayesian Network Structure Learning Algorithms Evaluating the Explanatory Value of Bayesian Network Structure Learning Algorithms Patrick Shaughnessy University of Massachusetts, Lowell pshaughn@cs.uml.edu Gary Livingston University of Massachusetts,

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either

More information

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Clustering of Data with Mixed Attributes based on Unified Similarity Metric Clustering of Data with Mixed Attributes based on Unified Similarity Metric M.Soundaryadevi 1, Dr.L.S.Jayashree 2 Dept of CSE, RVS College of Engineering and Technology, Coimbatore, Tamilnadu, India 1

More information

Introducing MESSIA: A Methodology of Developing Software Architectures Supporting Implementation Independence

Introducing MESSIA: A Methodology of Developing Software Architectures Supporting Implementation Independence Introducing MESSIA: A Methodology of Developing Software Architectures Supporting Implementation Independence Ratko Orlandic Department of Computer Science and Applied Math Illinois Institute of Technology

More information

Neural Network Application Design. Supervised Function Approximation. Supervised Function Approximation. Supervised Function Approximation

Neural Network Application Design. Supervised Function Approximation. Supervised Function Approximation. Supervised Function Approximation Supervised Function Approximation There is a tradeoff between a network s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3 January 25, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. Title A fuzzy k-modes algorithm for clustering categorical data Author(s) Huang, Z; Ng, MKP Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. 446-452 Issued Date 1999 URL http://hdl.handle.net/10722/42992

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Forward Feature Selection Using Residual Mutual Information

Forward Feature Selection Using Residual Mutual Information Forward Feature Selection Using Residual Mutual Information Erik Schaffernicht, Christoph Möller, Klaus Debes and Horst-Michael Gross Ilmenau University of Technology - Neuroinformatics and Cognitive Robotics

More information

Discretizing Continuous Attributes Using Information Theory

Discretizing Continuous Attributes Using Information Theory Discretizing Continuous Attributes Using Information Theory Chang-Hwan Lee Department of Information and Communications, DongGuk University, Seoul, Korea 100-715 chlee@dgu.ac.kr Abstract. Many classification

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

CS Machine Learning

CS Machine Learning CS 60050 Machine Learning Decision Tree Classifier Slides taken from course materials of Tan, Steinbach, Kumar 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K

More information

3 Virtual attribute subsetting

3 Virtual attribute subsetting 3 Virtual attribute subsetting Portions of this chapter were previously presented at the 19 th Australian Joint Conference on Artificial Intelligence (Horton et al., 2006). Virtual attribute subsetting

More information

Closed Non-Derivable Itemsets

Closed Non-Derivable Itemsets Closed Non-Derivable Itemsets Juho Muhonen and Hannu Toivonen Helsinki Institute for Information Technology Basic Research Unit Department of Computer Science University of Helsinki Finland Abstract. Itemset

More information

Uplift Modeling with ROC: An SRL Case Study

Uplift Modeling with ROC: An SRL Case Study Appears in the Proc. of International Conference on Inductive Logic Programming (ILP 13), Rio de Janeiro, Brazil, 2013. Uplift Modeling with ROC: An SRL Case Study Houssam Nassif, Finn Kuusisto, Elizabeth

More information

Ordering attributes for missing values prediction and data classification

Ordering attributes for missing values prediction and data classification Ordering attributes for missing values prediction and data classification E. R. Hruschka Jr., N. F. F. Ebecken COPPE /Federal University of Rio de Janeiro, Brazil. Abstract This work shows the application

More information

MetaData for Database Mining

MetaData for Database Mining MetaData for Database Mining John Cleary, Geoffrey Holmes, Sally Jo Cunningham, and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand. Abstract: At present, a machine

More information