Dataset Editing Techniques: A Comparative Study

Size: px
Start display at page:

Download "Dataset Editing Techniques: A Comparative Study"

Transcription

1 Dataset Editing Techniques: A Comparative Study Nidal Zeidat, Sujing Wang, and Christoph F. Eick Department of Computer Science, University of Houston Houston, Texas, USA {nzeidat, sujingwa, ceick}@cs.uh.edu Abstract. Editing techniques remove examples from datasets with the goal to obtain more accurate and faster classifiers. The objective of this paper is to compare several popular dataset editing techniques with respect to classification accuracy and training set compression rate including Wilson editing, Citation editing, and Multi-edit. Moreover, supervised clustering editing is introduced which replaces examples belonging to a cluster by a cluster representative. Furthermore, we explore the benefits of replacing datasets by support vectors that are commonly used in Support Vector Machine (SVM). We also discuss the results of experiments that compare and analyze the relationships between the editing techniques investigated by using a benchmark consisting of UCI and artificial 2D spatial datasets. Our empirical evaluation shows that editing techniques, in general, improve the classification accuracy of a 1-NN classifier significantly, leading to more efficient and accurate classifiers for most of the datasets tested. The experimental results show a strong performance for Wilson, Citation, and supervised clustering editing and poor performance by Multi-edit and SVM editing with respect to classification accuracy. Furthermore, training set compression rates reported by supervised clustering editing were superior to all other editing techniques investigated. 1 Introduction The Nearest Neighbor (NN) rule continues to be one of the more popular non-parametric classification techniques. However, it also has some drawbacks. First, for large size datasets with high dimensionality, the similarity computations needed are quite time consuming. Second, if the original training dataset contains erroneously labeled examples, the classification accuracy will be greatly decreased. Condensing and editing are two techniques that have been proposed to address these problems [2]. Condensing aims at reducing classifier training time while achieving no degradation in classification accuracy by preserving the decision boundaries that have been induced by the original dataset. Editing, on the other hand, seeks to remove noise examples from the original dataset with the goal to improve classification accuracy by producing smooth decision boundaries. However, surprisingly, the benefits of editing techniques have not been systematically analyzed in the literature. This fact was the main motivation for conducting the research described in this paper. Moreover, as a by-product, a methodology for evaluating and comparing different editing techniques will be introduced. As our experimental results will show, editing techniques are very successful in enhancing the accuracy of classifiers.

2 This paper is organized as follows. Section 2 introduces the algorithms we investigated in more details. In section 3, we discuss the experimental results and compare all the editing algorithms. Section 4 draws the conclusion of this paper. 2 ALGORITHMS INVESTIGATED In the next paragraphs, we describe the editing techniques investigated in this study. 2.1 Wilson Editing Wilson editing [12] removes all examples that have been misclassified by the NN-rule from a dataset. Wilson editing cleans interclass overlap regions, thereby leading to smoother boundaries between classes. The pseudo code for Wilson editing technique is presented in Figure 1. PREPROCESSING A: For each example o i in the dataset O, 1: Find the K-Nearest Neighbors of o i in O(excluding o i ) 2: Label o i with the class associated with the largest number of examples among the K nearest neighbors (breaking ties randomly) B: Edit Dataset O by deleting all examples that were misclassified in step A.2 CLASSIFICATION RULE: Classify new example q using K-NN rule with the edited subset O r Figure 1: Pseudo code for Wilson editing algorithm 2.2 Multi-edit Devijver and Kittler [3] proposed the Multi-edit technique which repeatedly applies Wilson editing to N random subsets of the original dataset until no more examples are removed. The pseudo code of the multi-edit algorithm is given in Figure 2. Notice that if we set N (number of subsets) to 1, Multi-edit becomes Wilson editing. 2.3 Citation-editing The idea of Citation editing was borrowed from Library and Information Science by Eugene Garfield [6]. Basically, if a paper cites another published article, obviously, the paper is related to that article. Similarly, if a paper is cited by an article, the paper is also related to that article. Thus both the citers and references are considered to be related to a given paper. When we apply this idea to dataset editing, we should not only consider the nearest neighbors of the test example but also the ones who count this example as their nearest neighbor. Consequently, in Citation editing, one example is removed from the dataset if its class label does not match the class label of the majority of examples consisting of K nearest neighbors and C nearest citers. The pseudo-code of Citation editing algorithm is given in Figure 3.

3 A: DIFFUSION: Divide the dataset O into N>=3 random subsets S 1, S N B: CLASSIFICATION: Classify S i using the K-NN rule with S (i+1)mod N as the training set (i=1,,n) C: EDITING: Discard all incorrectly classified examples D: CONFUSION: Replace O by subset O r consisting of the union of all remaining examples in S 1, S N E: TERMINATION: if the last iteration produced no editing then terminate; otherwise go to step A. Figure 2: Pseudo code for Multi-edit algorithm A: For each example o i in dataset O do: 1: Find the K nearest neighbors of o i in O (excluding o i ) 2: Find the C nearest citers in O which count o i among their K nearest neighbors 3: Classify o i with the class of the majority of examples in a group consisting of K nearest neighbors and C nearest citers for example o i. B: Discard examples o i from O that were misclassified in step A.3, obtaining O r. Figure 3: Citation Editing Algorithm 2.4 Supervised Clustering (SC) Editing Supervised clustering [4] deviates form traditional clustering in that it is applied on classified examples with the objective of identifying clusters with high probability density with respect to a single class. In supervised clustering editing [5], a supervised clustering algorithm is used to cluster a dataset O. Then O is replaced by subset O r which consists of cluster representatives that have been selected by the supervised clustering algorithm, as described in Figure 4. PREPROCESSING A: Apply a supervised clustering algorithm to dataset O to produce a set clusters (each having a single representative). B: Edit dataset O by deleting all non-representative examples to produce subset O r. Figure 4: Supervised clustering editing algorithm 3 EXPERIMENTAL RESULTS In this section, the performance of the editing techniques will be analyzed with respect to classification accuracy and training set compression rate for a benchmark consisting of 11 UCI datasets [10] as well as a set of 2D synthetic datasets. Moreover, the similarity among the investigated editing techniques will be assessed. We also analyze how differently the editing techniques cope with artificial noise of different forms and degrees using synthetic 2D spatial datasets.

4 3.1 Datasets Used in the Experiments We used a total of 14 different datasets in our experiment (see Table 1). The first 11 datasets are obtained from UCI [10]. The last 3 datasets, we named Complex9, Complex8, and 9Diamonds, are two dimensional spatial datasets whose examples distribute in many different shapes. These three 2D datasets were obtained from the authors of [9]. These datasets seem to be similar to proprietary datasets used in [7]. The Complex9 dataset was used to analyze how well editing techniques can cope with noise. To do that, we created 6 versions of the Complex9 dataset by adding noise examples with different size and type. The first three, Complex9_RN8, Complex9_RN16, and Complex9_RN32, were created by adding 8%, 16%, 32% random noise examples to Complex9 dataset. Two attribute values of the noise examples were randomly generated, and the class label was randomly assigned based on the prior probabilities of the nine classes. Table 1: Datasets used in the benchmark Figure 5: Complex9 dataset with 16% random noise Dataset Name No. of Examples No. of Attributes No. of Classes IRIS plants Glass Pima Ind.Diabetes Waveform Ionosphere Heart-H Image Segmentation Vehicle Silhouettes Vote Vowel Heart-StatLog Complex Complex Diamonds Figure 5 depicts the contents of the dataset Complex9 after 16% noise was injected into the dataset, Complex9_RN16. In Figure 5, the original dataset is represented by dots while the stars represent random noise we generated. In addition to analyzing the effects of random noise, we also created 3 more datasets based on Complex9 dataset by adding Gaussian noise of different intensities. We first pick t % examples from each class in the original dataset. Then, for each selected example, we maintain its class label but modify its attribute values by adding two Gaussian distribution random variables with the mean value of 0, therefore creating a new example. We chose t to be 8, 16 and 32 for the 3 versions of the dataset called Complex9_GN8, Complex9_GN16, and Complex9_GN32, respectively. All these 2D datasets are available at the website [11]. 3.2 Parameters Used in Each Technique In our experiments, Wilson editing was run with K equal to 1. Multi-edit was run using 3 subsets (N = 3). Citation Editing was run with K equal to 1 and C equal to 1. As for supervised

5 cluster editing, we used a greedy hill-climbing algorithm with randomized restart called SRIDHCR [4]. SRIDHCR starts by randomly selecting an initial set of k representatives. Dataset examples are then clustered around these representatives. The algorithm then tries to improve the quality of the clustering by repeatedly adding a single non-representative example to the set of representatives as well as by removing a single representative. The algorithm terminates if the solution quality measured by q(x), given below, does not improve. where and q(x) = Impurity(X) + β*penalty (k) (1) Impurity(X Penalty(k) = # of ) = k c 0 n Minority n k c k < c Examples with n being the total number of examples and c being the number of classes in a dataset. X represents a clustering solution consisting of all clusters. The parameter β (0< β 2.0) determines the penalty that is associated with the number of clusters k in a clustering: higher values for β imply larger penalties for a higher number of clusters. 3.3 Experimental Results We used Manhattan distance to calculate the distance between two examples. Similarity, classification accuracy and training set compression rates were determined using class-stratified 5- fold cross-validation. All experiments were repeated 3 times, each time we reshuffled all examples in the original dataset. Values reported in the tables are averages for the 3 runs. All detailed results are available at the website [11]. Following subsections discuss the experimental results., Similarity Between the Editing Techniques Let R1 and R2 be two dataset editing techniques. Let us further assume that when applied on dataset O these techniques produce edited training subsets O R1 and O R2.We define the agreement between the two editing algorithms R1 and R2 as follows: OR1 I Agreement(R1, R2) = OR2 (2) OR1 U OR2 where OR 1 UO is the union of all examples in subsets O R2 R1 and O R2 while OR 1 I O is the R2 intersection of examples in the two subsets. An agreement of 1 means that techniques R1 and R2 produce identical edited training sets. An agreement of 0 means that techniques R1 and R2 produced subsets that have no examples in common. Table 2 reports the agreements between the four editing algorithms for the Iris Plants and Glass datasets. The results we obtained using other UCI datasets showed the same trend and therefore have not been included.

6 Expectedly so, experimental results show high similarities between Wilson, Citation, and Multi-edit. These 3 techniques are based on the same principle; remove from the training set examples misclassified by a K-NN classifier. Table 2: Similarities among the editing techniques a. Iris Plats dataset Wilson Multi-edit Citation Multi-edit 97.2 Citation Editing Supervised Clustering b. Glass dataset Multi-edit 51.4 Citation Editing Supervised Clustering Notice that similarity is higher between Citation and Wilson. On the other hand, supervised clustering editing is very dissimilar to other three techniques. Unlike Wilson, Citation, and Multi-edit, supervised clustering editing removes both correctly and misclassified examples. Consequently, similarities between SC editing and all other techniques are very low. As can be noticed in the table above, the similarities among the three techniques Wilson, Citation, and Multi-edit decreased considerably for the Glass dataset while the similarities between these techniques and supervised clustering have increased. We believe the reason for this observation is that Glass is a more difficult dataset (see Table 3), and a higher percentage of examples is therefore removed by Wilson, Citation, and Multi-edit due to a larger percentage of misclassifications for the Glass dataset (i.e., 58, 17, & 72 examples, respectively) compared to Iris Plants dataset (i.e., 9,5, & 9 examples, respectively) Classification Accuracy of Editing Techniques We also computed the classification accuracy of traditional 1-NN classifier to evaluate the editing techniques. Experimental results are shown in Tables 3 and 4. Based on the results reported in Table 3, we see that in 6 out of 11 datasets (Iris, Waveform, Diabetes, Heart-H, Vote, and Heart-StatLog) the traditional 1-NN classification was outperformed by nearly all of the editing techniques. In Table 4 we observe that in 7 out of the 9 datasets (Complex9_RN8, 16, and 32 as well as Complex9_GN8, 16, and 32) traditional 1-NN classification was outperformed by 3 of the editing techniques: Wilson, Multi-edit, and Citation editing. Table 3: Classification accuracy for UCI datasets Dataset Name 1-NN Wilson Editing Multi-edit Citation Editing SC Editing IRIS Glass Diabetes Waveform Heart_H IonoShpere Segmentation Vehicle Vote Vowel Heart Statlog

7 Table 4: Classification Accuracy for the 2D datasets Dataset Name 1-NN Wilson Editining Multi-edit Citation Edit- SC Editing Complex Complex9 _RN Complex9 _RN Complex9_RN Complex9_GN Complex9_GN Complex9_GN Complex Diamonds Moreover, as the noise injected into the dataset increases, the benefit of editing increases as well, as can be seen in Table 4. For example, Wilson editing scores an improvement of 2% [i.e., ( )/92.1] in classification accuracy over 1-NN classifier for Complex9 with 8% noise (i.e., Complex9_RN8) compared to 11% increase in classification accuracy when the noise jumps to 32% (i.e., Complex9_RN32). Comparing the performance of the editing techniques among themselves, Table 3 shows that in 5 out of the 11 datasets, supervised clustering editing (SCE) outperformed all other editing techniques. Nevertheless, SCE performed quite badly on the Vowel dataset. Moreover, Wilson editing as well as Citation editing showed very strong performance as well. The results in Table 3 also show that the multi-edit technique performed quite poorly for the UCI datasets with the exception of the Heart-StatLog dataset. Results in Table 4 show that Multi-edit performs much better for the 2D-data set, especially as noise increases. We suspect this observation can be attributed to the fact that the 2D datasets are very dense datasets with large number of examples of a single class occupying a particular region. Thus, it is less likely that an example is erroneously removed, because it gets separated from its nearest neighbors that share the same class label in the diffusion process of multi-edit Training Set Compression Rate (TSCR) Training set compression rate measures the percentage of reduction applied on the training set by an editing technique. In general, if an editing technique reduces the size of a training set from n examples down to r examples, we calculate the training set compression rate for that editing technique as: r Training Set Compression Rate = 1 - (3) n Tables 5 and 6 report the training set compression rates for the UCI and 2D datasets, respectively. Inspecting the results reported in Table 5, we clearly notice that the highest training set compression rates are reported by supervised clustering editing for all datasets. Coupling this fact with the good relative performance of supervised clustering editing with respect to classification accuracy reported in Section 3.3.2, the overall performance of supervised clustering editing moves up the ladder. For example, supervised clustering editing reduced the Vote

8 dataset that contains 348 examples to a dataset that contains only 10 representative examples, at the average. On the other hand, Wilson, Citation, and Multi-edit reduced the Vote dataset to 319, 338, and 298 examples, respectively. When applying traditional 1-NN classification on a training set containing only the 10 cluster representatives selected by supervised clustering, we achieved a better classification accuracy than the classification accuracies we got when applying traditional 1-NN classification on the edited datasets produced by all the other three editing techniques. Table 5: TSCR for the UCI datasets Dataset Name SC Wilson Multiedit Citation IRIS Glass Diabetes Waveform Heart_H Ionosphere Vote Vehicle Segmentation Vowel Heart Stat- Log Table 6: TSCR for the 2D spatial datasets Dataset Name SC Wilson Multiedit Citation Complex Complex9_RN Complex _RN16 Complex _RN32 Complex9_GN Complex9_GN Complex9_GN Complex Diamond As mentioned earlier, both Wilson and Citation editing reduce the size of a dataset by removing examples that have been misclassified by a k-nn classifier. Consequently, the training set compression rates are quite low on easy classification tasks for which high prediction accuracies are normally achieved. For example, Wilson editing produces dataset reduction rates of only 0.7%, 2.3%, and 6.0% for the Vowel, Segmentation, and Iris datasets, respectively, and even removes far less examples for the 2D datasets. Similarly, it can be observed that Wilson and Citation editing produced training set compression rates of close to 0% for the 2D datasets that do not contain any noise, namely Complex9, Complex8, and 9Diamonds, while supervised clustering editing achieved training set compression rates of more than 98% for those datasets. 3.4 Overall Performance Analysis Table 7 reports the average classification accuracy and training set compression rates for the investigated editing techniques for the examined datasets. We can see that, Wilson editing, Citation Editing, and Supervised Clustering editing obtain a better performance than the traditional 1-NN classification on both UCI datasets and 2D datasets. But Multi-editing performs worse than 1-NN classification.

9 Table7: Average classification accuracy and training set compression rate 1-NN Wilson Multi-edit Citation SC Editing a. UCI Datasets Classification accuracy TS Compression Rate b. 2D Datasets Classification accuracy TS Compression rate Visualization of Different Editing Techniques This section visualizes the experimental results of different editing techniques applied on Complex9_RN32 dataset and provides a better understanding of which examples each technique removed. Figure 6 presents a visualization of similarity between Citation and Wilson editing. All Examples deleted by the Citation editing are represented by (black) big dots while examples deleted by Wilson editing are marked by (green) triangles and the original Complex9_RN32 examples are depicted using dark (blue) small dots. It can be noticed that most of the examples deleted by Wilson and Citation techniques are located between clusters, i.e., in the boundary areas. Original Dataset Deleted by Citation Editing Deleted by Wilson editing Selected by SC Editing Figure 6: Visualization of Citation, Wilson, and SC editing for Complex9_RN32 dataset Figure 6 also shows in (red) squares the cluster representatives selected (to stay) by a supervised clustering algorithm (clusters are not shown here). It can be noticed that these examples are not lying in the boundary region but in the clusters themselves.

10 Figure 7 shows the clustering of the Complex9 dataset (without noise) using the representative based SC algorithm SRIDHCR [4]. Representatives selected are characterized by squares with x in the middle. Due to the nature of supervised clustering, see Section 3.2 and [4], it selects as cluster representatives examples that minimize the number of misclassified examples. To accomplish this objective, selected representatives are some times located in the middle of the cluster (lower left rectangle clusters in Figure 7) and others are located on the edge of their clusters (lower right 2 clusters that make the U-shaped pair in Figure 7). SC Representatives Figure 7: Cluster representatives selected by supervised clustering for Complex9 Original Dataset Selected by SVM Selected by SC Figure 8: Visualization of supervised clustering & SVM editing techniques for Complex9_RN32 Moreover, representatives of neighboring clusters that belong to different classes are usually lined up opposite to each other with proper spacing between them, to minimize wrongly clustered examples, as can be seen for the two representatives of the two rectangular clusters as a first example and the question mark -shaped cluster and the U- shaped one to the left of it as a second example, in Figure 7. This makes it less likely to attract examples that belong to the neighboring cluster. In summary, the representative-based clustering algorithm picked only 21 out of the 3031 examples in the training set. Using these 21 examples as cluster representatives, as depicted in Figure 7, the class purity (see Formula 1) was 99.4% due to presence of few minority examples in the right-most cluster and in the two interleaved U-shaped clusters near the bottom center. In general, representative based clustering algorithms create clusters by assigning points to the closest representative. A challenge that these algorithms face is that they are limited to discovering clusters of convex shapes. Therefore, in order for these algorithms to discover clusters of non-convex shape, clusters have to be approximated by a union of neighboring clusters that are dominated by the same class. An example of such case is the large circular cluster in Figure 7 which was approximated using 5 neighboring clusters without any loss in purity.

11 As part of this research work we also investigated a potential dataset editing technique based on support vector machines (SVM) [8]. Practical implementations of SVM classifiers mainly produce a solution in form of a discriminant function. Such function is then used to classify unknown examples. The training examples that participate in the definition of the discriminant function are called support vectors. A technique we call SVM editing edits a dataset by keeping examples identified by a SVM classifier as support vectors and removes all other examples. Our implementation of SVM was based on LIBSVM 2.8 [1]. The experiments were run using RBF as the kernel function and C-SVC as the kernel type. Values for parameters C (cost of constraints violation) and γ (RBF kernel function parameter) were chosen using an optimization tool that LIBSVM 2.8 offers. Unfortunately, the SVM based editing technique showed poor performance compared to all other editing techniques. Experimental results showed that a SVM based editing technique produced a 1-NN classifier that has an average classification accuracy of 76.25% compared to 81.81%, 82.66%, 72.80%, 82.73%, and 82.67% classification accuracies for the traditional 1-NN classifier, Wilson editing, Multi-edit, Citation editing, and supervised clustering editing, respectively. As can clearly be seen from the results, with the exception of the Multi-edit technique, the average performance of the SVM editing is not only worse than all editing techniques, but also worse than the traditional 1-NN classifier. Figure 8 shows in (yellow) triangles examples selected as support vectors by the support vector machine editing technique for dataset Complex9_RN32. As explained in more details in [8], support vectors are those examples that participate in defining the boundary around the hyper-plane that separates the different classes in the higher dimension features space. It is natural that when such examples are transformed back to the input space, most of them are found to lie in the boundary areas between the classes, as can easily be seen in Figure 8. To be more precise, experiments show that 59% (909 examples out of 1546) of the support vectors in Figure 8 are noise examples. Furthermore, 94% (909 examples out of 967) of the added noise examples were selected as support vectors. SVM editing behaves more like a condensing rather than an editing technique as it preserves characteristics of the boundary areas between classes. This explains its low classification accuracy compared to the other techniques. Details of experimental results for SVM editing is also available at [11]. 4 Summary We investigated several dataset editing algorithms including Wilson, Citation, and Multi-edit. The performance of these algorithms is studied and compared to each others as well as to the performance of supervised clustering editing that replaces examples belonging to a cluster by a cluster representative. Experimental results show that Wilson and Citation editing are quite similar with respect to the examples they remove and that Supervised clustering editing has very little similarity to the other techniques. The experimental results with respect to classification accuracy showed a strong performance for supervised clustering editing as well as Wilson and Citation editing techniques.

12 Moreover, all three techniques, in particular Citation and Wilson editing, seem to perform particularly well for datasets that contain a lot of noise. Multi-edit and SVM editing, on the other hand, performed quite poorly with respect to classification accuracy. Our experimental results with respect to training set compression rates show that the strong performance of supervised clustering editing with respect to classification accuracy is further complemented with high training set compression rates. Wilson and Citation editing, on the other hand, accomplish relatively low training set compression rates, especially for datasets that are easy to classify. In summary, our experimental results demonstrated the benefits of editing techniques to enhance the accuracy and performance of nearest neighbor classifiers. We also believe that exploring the potential of editing techniques for enhancing other classification techniques, such as decision trees and support vector machines, deserves more attention by future research. References 1. Chang, C.-C. and Lin, C.-J.: LIBSVM -- A Library for Support Vector Machines. Version 2.8, April Dasarathy BV, Sánchez JS, Townsend S.: Nearest Neighbor Editing and Condensing Tools - Synergy Xxploitation, Pattern Analysis & Applications, Vol. 3, no. 1, pp , Devijver, P. and Kittler, J.: On the Edited Nearest Neighbor Rule. IEEE 1980 Pattern Recognition. Vol.1, Eick, C.F., Zeidat, N., and Zhenghong, Z.: Supervised Clustering Algorithms and Benefits. In proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI04), Boca Raton, Florida, November 2004, pp Eick, C.F., Zeidat, N., and Vilalta, R.: Using Representative-Based Clustering for Nearest Neighbor Dataset Editing. ICDM 2004: Garfield, E.: Citation Indexing: its Theory and Application in Science, Technology, and Humanities, New York: John Wiley & Sons, Karypis, G., Han, E.H., and Kumar, V.: Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer, Vol 32, No 8, pp 68-75, August Runarsson, T.P. and Sigurdsson, S.: Support Vector Machines. January On Line document at 9. Salvador, S. and Chan, P.: Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. ICTAI 2004, University of California at Irving, Machine Learning Repository, University of Houston, Machine Learning and Data Mining Group, cs.uh.edu/~sujingwa/pkdd05/ 12. Wilson, D.L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data, IEEE Transactions on Systems, Man, and Cybernetics, 2: , 1972.

Supervised Clustering: Algorithms and Application

Supervised Clustering: Algorithms and Application Supervised Clustering: Algorithms and Application Nidal Zeidat, Christoph F. Eick, and Zhenghong Zhao Department of Computer Science, University of Houston, Houston, Texas 77204-3010 {ceick, nzeidat, zhenzhao}@cs.uh.edu

More information

Supervised Clustering: Algorithms and Application

Supervised Clustering: Algorithms and Application Supervised Clustering: Algorithms and Application Nidal Zeidat, Christoph F. Eick, and Zhenghong Zhao Department of Computer Science, University of Houston Houston, Texas 77204-3010 {ceick, nzeidat, zhenzhao}@cs.uh.edu

More information

Using a genetic algorithm for editing k-nearest neighbor classifiers

Using a genetic algorithm for editing k-nearest neighbor classifiers Using a genetic algorithm for editing k-nearest neighbor classifiers R. Gil-Pita 1 and X. Yao 23 1 Teoría de la Señal y Comunicaciones, Universidad de Alcalá, Madrid (SPAIN) 2 Computer Sciences Department,

More information

Lab 9. Julia Janicki. Introduction

Lab 9. Julia Janicki. Introduction Lab 9 Julia Janicki Introduction My goal for this project is to map a general land cover in the area of Alexandria in Egypt using supervised classification, specifically the Maximum Likelihood and Support

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

Available online at ScienceDirect. Procedia Computer Science 35 (2014 )

Available online at  ScienceDirect. Procedia Computer Science 35 (2014 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 35 (2014 ) 388 396 18 th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Thesis Overview KNOWLEDGE EXTRACTION IN LARGE DATABASES USING ADAPTIVE STRATEGIES

Thesis Overview KNOWLEDGE EXTRACTION IN LARGE DATABASES USING ADAPTIVE STRATEGIES Thesis Overview KNOWLEDGE EXTRACTION IN LARGE DATABASES USING ADAPTIVE STRATEGIES Waldo Hasperué Advisor: Laura Lanzarini Facultad de Informática, Universidad Nacional de La Plata PhD Thesis, March 2012

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Creating Polygon Models for Spatial Clusters

Creating Polygon Models for Spatial Clusters Creating Polygon Models for Spatial Clusters Fatih Akdag, Christoph F. Eick, and Guoning Chen University of Houston, Department of Computer Science, USA {fatihak,ceick,chengu}@cs.uh.edu Abstract. This

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification Flora Yu-Hui Yeh and Marcus Gallagher School of Information Technology and Electrical Engineering University

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Leave-One-Out Support Vector Machines

Leave-One-Out Support Vector Machines Leave-One-Out Support Vector Machines Jason Weston Department of Computer Science Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 OEX, UK. Abstract We present a new learning algorithm

More information

SSV Criterion Based Discretization for Naive Bayes Classifiers

SSV Criterion Based Discretization for Naive Bayes Classifiers SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,

More information

SYMBOLIC FEATURES IN NEURAL NETWORKS

SYMBOLIC FEATURES IN NEURAL NETWORKS SYMBOLIC FEATURES IN NEURAL NETWORKS Włodzisław Duch, Karol Grudziński and Grzegorz Stawski 1 Department of Computer Methods, Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract:

More information

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

CS6716 Pattern Recognition

CS6716 Pattern Recognition CS6716 Pattern Recognition Prototype Methods Aaron Bobick School of Interactive Computing Administrivia Problem 2b was extended to March 25. Done? PS3 will be out this real soon (tonight) due April 10.

More information

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map Markus Turtinen, Topi Mäenpää, and Matti Pietikäinen Machine Vision Group, P.O.Box 4500, FIN-90014 University

More information

Adaptive Metric Nearest Neighbor Classification

Adaptive Metric Nearest Neighbor Classification Adaptive Metric Nearest Neighbor Classification Carlotta Domeniconi Jing Peng Dimitrios Gunopulos Computer Science Department Computer Science Department Computer Science Department University of California

More information

KBSVM: KMeans-based SVM for Business Intelligence

KBSVM: KMeans-based SVM for Business Intelligence Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2004 Proceedings Americas Conference on Information Systems (AMCIS) December 2004 KBSVM: KMeans-based SVM for Business Intelligence

More information

The Effects of Outliers on Support Vector Machines

The Effects of Outliers on Support Vector Machines The Effects of Outliers on Support Vector Machines Josh Hoak jrhoak@gmail.com Portland State University Abstract. Many techniques have been developed for mitigating the effects of outliers on the results

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS 4.1 Introduction Although MST-based clustering methods are effective for complex data, they require quadratic computational time which is high for

More information

Cover Page. The handle holds various files of this Leiden University dissertation.

Cover Page. The handle   holds various files of this Leiden University dissertation. Cover Page The handle http://hdl.handle.net/1887/22055 holds various files of this Leiden University dissertation. Author: Koch, Patrick Title: Efficient tuning in supervised machine learning Issue Date:

More information

Distribution-free Predictive Approaches

Distribution-free Predictive Approaches Distribution-free Predictive Approaches The methods discussed in the previous sections are essentially model-based. Model-free approaches such as tree-based classification also exist and are popular for

More information

Learning to Recognize Faces in Realistic Conditions

Learning to Recognize Faces in Realistic Conditions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Bagging for One-Class Learning

Bagging for One-Class Learning Bagging for One-Class Learning David Kamm December 13, 2008 1 Introduction Consider the following outlier detection problem: suppose you are given an unlabeled data set and make the assumptions that one

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

A Lazy Approach for Machine Learning Algorithms

A Lazy Approach for Machine Learning Algorithms A Lazy Approach for Machine Learning Algorithms Inés M. Galván, José M. Valls, Nicolas Lecomte and Pedro Isasi Abstract Most machine learning algorithms are eager methods in the sense that a model is generated

More information

A Practical Guide to Support Vector Classification

A Practical Guide to Support Vector Classification A Practical Guide to Support Vector Classification Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin Department of Computer Science and Information Engineering National Taiwan University Taipei 106, Taiwan

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

1. INTRODUCTION. AMS Subject Classification. 68U10 Image Processing

1. INTRODUCTION. AMS Subject Classification. 68U10 Image Processing ANALYSING THE NOISE SENSITIVITY OF SKELETONIZATION ALGORITHMS Attila Fazekas and András Hajdu Lajos Kossuth University 4010, Debrecen PO Box 12, Hungary Abstract. Many skeletonization algorithms have been

More information

Using the Geometrical Distribution of Prototypes for Training Set Condensing

Using the Geometrical Distribution of Prototypes for Training Set Condensing Using the Geometrical Distribution of Prototypes for Training Set Condensing María Teresa Lozano, José Salvador Sánchez, and Filiberto Pla Dept. Lenguajes y Sistemas Informáticos, Universitat Jaume I Campus

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

CHAPTER 7 A GRID CLUSTERING ALGORITHM

CHAPTER 7 A GRID CLUSTERING ALGORITHM CHAPTER 7 A GRID CLUSTERING ALGORITHM 7.1 Introduction The grid-based methods have widely been used over all the algorithms discussed in previous chapters due to their rapid clustering results. In this

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes 2009 10th International Conference on Document Analysis and Recognition Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes Alireza Alaei

More information

Combined Weak Classifiers

Combined Weak Classifiers Combined Weak Classifiers Chuanyi Ji and Sheng Ma Department of Electrical, Computer and System Engineering Rensselaer Polytechnic Institute, Troy, NY 12180 chuanyi@ecse.rpi.edu, shengm@ecse.rpi.edu Abstract

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

Nearest Cluster Classifier

Nearest Cluster Classifier Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, Behrouz Minaei Nourabad Mamasani Branch Islamic Azad University Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,

More information

Nearest Neighbor Classification

Nearest Neighbor Classification Nearest Neighbor Classification Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 11, 2017 1 / 48 Outline 1 Administration 2 First learning algorithm: Nearest

More information

Color-Based Classification of Natural Rock Images Using Classifier Combinations

Color-Based Classification of Natural Rock Images Using Classifier Combinations Color-Based Classification of Natural Rock Images Using Classifier Combinations Leena Lepistö, Iivari Kunttu, and Ari Visa Tampere University of Technology, Institute of Signal Processing, P.O. Box 553,

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Classification with Class Overlapping: A Systematic Study

Classification with Class Overlapping: A Systematic Study Classification with Class Overlapping: A Systematic Study Haitao Xiong 1 Junjie Wu 1 Lu Liu 1 1 School of Economics and Management, Beihang University, Beijing 100191, China Abstract Class overlapping has

More information

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models

More information

Toward Part-based Document Image Decoding

Toward Part-based Document Image Decoding 2012 10th IAPR International Workshop on Document Analysis Systems Toward Part-based Document Image Decoding Wang Song, Seiichi Uchida Kyushu University, Fukuoka, Japan wangsong@human.ait.kyushu-u.ac.jp,

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Using Decision Boundary to Analyze Classifiers

Using Decision Boundary to Analyze Classifiers Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision

More information

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20 Data mining Piotr Paszek Classification k-nn Classifier (Piotr Paszek) Data mining k-nn 1 / 20 Plan of the lecture 1 Lazy Learner 2 k-nearest Neighbor Classifier 1 Distance (metric) 2 How to Determine

More information

Using Pairs of Data-Points to Define Splits for Decision Trees

Using Pairs of Data-Points to Define Splits for Decision Trees Using Pairs of Data-Points to Define Splits for Decision Trees Geoffrey E. Hinton Department of Computer Science University of Toronto Toronto, Ontario, M5S la4, Canada hinton@cs.toronto.edu Michael Revow

More information

AN IMPROVED DENSITY BASED k-means ALGORITHM

AN IMPROVED DENSITY BASED k-means ALGORITHM AN IMPROVED DENSITY BASED k-means ALGORITHM Kabiru Dalhatu 1 and Alex Tze Hiang Sim 2 1 Department of Computer Science, Faculty of Computing and Mathematical Science, Kano University of Science and Technology

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Training Digital Circuits with Hamming Clustering

Training Digital Circuits with Hamming Clustering IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 47, NO. 4, APRIL 2000 513 Training Digital Circuits with Hamming Clustering Marco Muselli, Member, IEEE, and Diego

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Using Decision Trees and Soft Labeling to Filter Mislabeled Data. Abstract

Using Decision Trees and Soft Labeling to Filter Mislabeled Data. Abstract Using Decision Trees and Soft Labeling to Filter Mislabeled Data Xinchuan Zeng and Tony Martinez Department of Computer Science Brigham Young University, Provo, UT 84602 E-Mail: zengx@axon.cs.byu.edu,

More information

Nearest Neighbors Classifiers

Nearest Neighbors Classifiers Nearest Neighbors Classifiers Raúl Rojas Freie Universität Berlin July 2014 In pattern recognition we want to analyze data sets of many different types (pictures, vectors of health symptoms, audio streams,

More information

Machine Learning and Pervasive Computing

Machine Learning and Pervasive Computing Stephan Sigg Georg-August-University Goettingen, Computer Networks 17.12.2014 Overview and Structure 22.10.2014 Organisation 22.10.3014 Introduction (Def.: Machine learning, Supervised/Unsupervised, Examples)

More information

Comparing Univariate and Multivariate Decision Trees *

Comparing Univariate and Multivariate Decision Trees * Comparing Univariate and Multivariate Decision Trees * Olcay Taner Yıldız, Ethem Alpaydın Department of Computer Engineering Boğaziçi University, 80815 İstanbul Turkey yildizol@cmpe.boun.edu.tr, alpaydin@boun.edu.tr

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering 1

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering 1 MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-sheng Chen, Oner Ulvi Celepcikay, Christian Giusti, and Christoph F. Eick Computer Science Department,

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA 1 HALF&HALF BAGGING AND HARD BOUNDARY POINTS Leo Breiman Statistics Department University of California Berkeley, CA 94720 leo@stat.berkeley.edu Technical Report 534 Statistics Department September 1998

More information

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Tomohiro Tanno, Kazumasa Horie, Jun Izawa, and Masahiko Morita University

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Semi-Automatic Transcription Tool for Ancient Manuscripts

Semi-Automatic Transcription Tool for Ancient Manuscripts The Venice Atlas A Digital Humanities atlas project by DH101 EPFL Students Semi-Automatic Transcription Tool for Ancient Manuscripts In this article, we investigate various techniques from the fields of

More information

Nearest Cluster Classifier

Nearest Cluster Classifier Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, and Behrouz Minaei Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets

Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University of Houston, Houston, TX 77204-3010

More information

CS 340 Lec. 4: K-Nearest Neighbors

CS 340 Lec. 4: K-Nearest Neighbors CS 340 Lec. 4: K-Nearest Neighbors AD January 2011 AD () CS 340 Lec. 4: K-Nearest Neighbors January 2011 1 / 23 K-Nearest Neighbors Introduction Choice of Metric Overfitting and Underfitting Selection

More information

Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions

Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions Chun Sheng Chen 1, Vadeerat Rinsurongkawong 1, Christoph F. Eick 1, and Michael D. Twa 2 1 Department

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

Time Series Classification in Dissimilarity Spaces

Time Series Classification in Dissimilarity Spaces Proceedings 1st International Workshop on Advanced Analytics and Learning on Temporal Data AALTD 2015 Time Series Classification in Dissimilarity Spaces Brijnesh J. Jain and Stephan Spiegel Berlin Institute

More information

An Implementation on Histogram of Oriented Gradients for Human Detection

An Implementation on Histogram of Oriented Gradients for Human Detection An Implementation on Histogram of Oriented Gradients for Human Detection Cansın Yıldız Dept. of Computer Engineering Bilkent University Ankara,Turkey cansin@cs.bilkent.edu.tr Abstract I implemented a Histogram

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 9, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 1 / 47

More information

Cluster Analysis using Spherical SOM

Cluster Analysis using Spherical SOM Cluster Analysis using Spherical SOM H. Tokutaka 1, P.K. Kihato 2, K. Fujimura 2 and M. Ohkita 2 1) SOM Japan Co-LTD, 2) Electrical and Electronic Department, Tottori University Email: {tokutaka@somj.com,

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE DATA M. Athitya Kumaraguru 1, Viji Vinod 2, N.

PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE DATA M. Athitya Kumaraguru 1, Viji Vinod 2, N. Volume 117 No. 20 2017, 873-879 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE

More information

Cost-sensitive Boosting for Concept Drift

Cost-sensitive Boosting for Concept Drift Cost-sensitive Boosting for Concept Drift Ashok Venkatesan, Narayanan C. Krishnan, Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing, School of Computing, Informatics and Decision Systems

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Chapter 9 Chapter 9 1 / 50 1 91 Maximal margin classifier 2 92 Support vector classifiers 3 93 Support vector machines 4 94 SVMs with more than two classes 5 95 Relationshiop to

More information

Associative Cellular Learning Automata and its Applications

Associative Cellular Learning Automata and its Applications Associative Cellular Learning Automata and its Applications Meysam Ahangaran and Nasrin Taghizadeh and Hamid Beigy Department of Computer Engineering, Sharif University of Technology, Tehran, Iran ahangaran@iust.ac.ir,

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Graph Matching Iris Image Blocks with Local Binary Pattern

Graph Matching Iris Image Blocks with Local Binary Pattern Graph Matching Iris Image Blocs with Local Binary Pattern Zhenan Sun, Tieniu Tan, and Xianchao Qiu Center for Biometrics and Security Research, National Laboratory of Pattern Recognition, Institute of

More information

Dynamic Ensemble Construction via Heuristic Optimization

Dynamic Ensemble Construction via Heuristic Optimization Dynamic Ensemble Construction via Heuristic Optimization Şenay Yaşar Sağlam and W. Nick Street Department of Management Sciences The University of Iowa Abstract Classifier ensembles, in which multiple

More information

Motivation. Technical Background

Motivation. Technical Background Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

A Feature based on Encoding the Relative Position of a Point in the Character for Online Handwritten Character Recognition

A Feature based on Encoding the Relative Position of a Point in the Character for Online Handwritten Character Recognition A Feature based on Encoding the Relative Position of a Point in the Character for Online Handwritten Character Recognition Dinesh Mandalapu, Sridhar Murali Krishna HP Laboratories India HPL-2007-109 July

More information

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016 edestrian Detection Using Correlated Lidar and Image Data EECS442 Final roject Fall 2016 Samuel Rohrer University of Michigan rohrer@umich.edu Ian Lin University of Michigan tiannis@umich.edu Abstract

More information

Multiple Classifier Fusion using k-nearest Localized Templates

Multiple Classifier Fusion using k-nearest Localized Templates Multiple Classifier Fusion using k-nearest Localized Templates Jun-Ki Min and Sung-Bae Cho Department of Computer Science, Yonsei University Biometrics Engineering Research Center 134 Shinchon-dong, Sudaemoon-ku,

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

The Data Mining Application Based on WEKA: Geographical Original of Music

The Data Mining Application Based on WEKA: Geographical Original of Music Management Science and Engineering Vol. 10, No. 4, 2016, pp. 36-46 DOI:10.3968/8997 ISSN 1913-0341 [Print] ISSN 1913-035X [Online] www.cscanada.net www.cscanada.org The Data Mining Application Based on

More information