Dataset Editing Techniques: A Comparative Study

Size: px

Start display at page:

Download "Dataset Editing Techniques: A Comparative Study"

Kelly Norton
6 years ago
Views:

1 Dataset Editing Techniques: A Comparative Study Nidal Zeidat, Sujing Wang, and Christoph F. Eick Department of Computer Science, University of Houston Houston, Texas, USA {nzeidat, sujingwa, ceick}@cs.uh.edu Abstract. Editing techniques remove examples from datasets with the goal to obtain more accurate and faster classifiers. The objective of this paper is to compare several popular dataset editing techniques with respect to classification accuracy and training set compression rate including Wilson editing, Citation editing, and Multi-edit. Moreover, supervised clustering editing is introduced which replaces examples belonging to a cluster by a cluster representative. Furthermore, we explore the benefits of replacing datasets by support vectors that are commonly used in Support Vector Machine (SVM). We also discuss the results of experiments that compare and analyze the relationships between the editing techniques investigated by using a benchmark consisting of UCI and artificial 2D spatial datasets. Our empirical evaluation shows that editing techniques, in general, improve the classification accuracy of a 1-NN classifier significantly, leading to more efficient and accurate classifiers for most of the datasets tested. The experimental results show a strong performance for Wilson, Citation, and supervised clustering editing and poor performance by Multi-edit and SVM editing with respect to classification accuracy. Furthermore, training set compression rates reported by supervised clustering editing were superior to all other editing techniques investigated. 1 Introduction The Nearest Neighbor (NN) rule continues to be one of the more popular non-parametric classification techniques. However, it also has some drawbacks. First, for large size datasets with high dimensionality, the similarity computations needed are quite time consuming. Second, if the original training dataset contains erroneously labeled examples, the classification accuracy will be greatly decreased. Condensing and editing are two techniques that have been proposed to address these problems [2]. Condensing aims at reducing classifier training time while achieving no degradation in classification accuracy by preserving the decision boundaries that have been induced by the original dataset. Editing, on the other hand, seeks to remove noise examples from the original dataset with the goal to improve classification accuracy by producing smooth decision boundaries. However, surprisingly, the benefits of editing techniques have not been systematically analyzed in the literature. This fact was the main motivation for conducting the research described in this paper. Moreover, as a by-product, a methodology for evaluating and comparing different editing techniques will be introduced. As our experimental results will show, editing techniques are very successful in enhancing the accuracy of classifiers.

2 This paper is organized as follows. Section 2 introduces the algorithms we investigated in more details. In section 3, we discuss the experimental results and compare all the editing algorithms. Section 4 draws the conclusion of this paper. 2 ALGORITHMS INVESTIGATED In the next paragraphs, we describe the editing techniques investigated in this study. 2.1 Wilson Editing Wilson editing [12] removes all examples that have been misclassified by the NN-rule from a dataset. Wilson editing cleans interclass overlap regions, thereby leading to smoother boundaries between classes. The pseudo code for Wilson editing technique is presented in Figure 1. PREPROCESSING A: For each example o i in the dataset O, 1: Find the K-Nearest Neighbors of o i in O(excluding o i ) 2: Label o i with the class associated with the largest number of examples among the K nearest neighbors (breaking ties randomly) B: Edit Dataset O by deleting all examples that were misclassified in step A.2 CLASSIFICATION RULE: Classify new example q using K-NN rule with the edited subset O r Figure 1: Pseudo code for Wilson editing algorithm 2.2 Multi-edit Devijver and Kittler [3] proposed the Multi-edit technique which repeatedly applies Wilson editing to N random subsets of the original dataset until no more examples are removed. The pseudo code of the multi-edit algorithm is given in Figure 2. Notice that if we set N (number of subsets) to 1, Multi-edit becomes Wilson editing. 2.3 Citation-editing The idea of Citation editing was borrowed from Library and Information Science by Eugene Garfield [6]. Basically, if a paper cites another published article, obviously, the paper is related to that article. Similarly, if a paper is cited by an article, the paper is also related to that article. Thus both the citers and references are considered to be related to a given paper. When we apply this idea to dataset editing, we should not only consider the nearest neighbors of the test example but also the ones who count this example as their nearest neighbor. Consequently, in Citation editing, one example is removed from the dataset if its class label does not match the class label of the majority of examples consisting of K nearest neighbors and C nearest citers. The pseudo-code of Citation editing algorithm is given in Figure 3.

3 A: DIFFUSION: Divide the dataset O into N>=3 random subsets S 1, S N B: CLASSIFICATION: Classify S i using the K-NN rule with S (i+1)mod N as the training set (i=1,,n) C: EDITING: Discard all incorrectly classified examples D: CONFUSION: Replace O by subset O r consisting of the union of all remaining examples in S 1, S N E: TERMINATION: if the last iteration produced no editing then terminate; otherwise go to step A. Figure 2: Pseudo code for Multi-edit algorithm A: For each example o i in dataset O do: 1: Find the K nearest neighbors of o i in O (excluding o i ) 2: Find the C nearest citers in O which count o i among their K nearest neighbors 3: Classify o i with the class of the majority of examples in a group consisting of K nearest neighbors and C nearest citers for example o i. B: Discard examples o i from O that were misclassified in step A.3, obtaining O r. Figure 3: Citation Editing Algorithm 2.4 Supervised Clustering (SC) Editing Supervised clustering [4] deviates form traditional clustering in that it is applied on classified examples with the objective of identifying clusters with high probability density with respect to a single class. In supervised clustering editing [5], a supervised clustering algorithm is used to cluster a dataset O. Then O is replaced by subset O r which consists of cluster representatives that have been selected by the supervised clustering algorithm, as described in Figure 4. PREPROCESSING A: Apply a supervised clustering algorithm to dataset O to produce a set clusters (each having a single representative). B: Edit dataset O by deleting all non-representative examples to produce subset O r. Figure 4: Supervised clustering editing algorithm 3 EXPERIMENTAL RESULTS In this section, the performance of the editing techniques will be analyzed with respect to classification accuracy and training set compression rate for a benchmark consisting of 11 UCI datasets [10] as well as a set of 2D synthetic datasets. Moreover, the similarity among the investigated editing techniques will be assessed. We also analyze how differently the editing techniques cope with artificial noise of different forms and degrees using synthetic 2D spatial datasets.

3.1 Datasets Used in the Experiments We used a total of 14 different datasets in our experiment (see Table 1). The first 11 datasets are obtained from UCI [10].

4 3.1 Datasets Used in the Experiments We used a total of 14 different datasets in our experiment (see Table 1). The first 11 datasets are obtained from UCI [10]. The last 3 datasets, we named Complex9, Complex8, and 9Diamonds, are two dimensional spatial datasets whose examples distribute in many different shapes. These three 2D datasets were obtained from the authors of [9]. These datasets seem to be similar to proprietary datasets used in [7]. The Complex9 dataset was used to analyze how well editing techniques can cope with noise. To do that, we created 6 versions of the Complex9 dataset by adding noise examples with different size and type. The first three, Complex9_RN8, Complex9_RN16, and Complex9_RN32, were created by adding 8%, 16%, 32% random noise examples to Complex9 dataset. Two attribute values of the noise examples were randomly generated, and the class label was randomly assigned based on the prior probabilities of the nine classes. Table 1: Datasets used in the benchmark Figure 5: Complex9 dataset with 16% random noise Dataset Name No. of Examples No. of Attributes No. of Classes IRIS plants Glass Pima Ind.Diabetes Waveform Ionosphere Heart-H Image Segmentation Vehicle Silhouettes Vote Vowel Heart-StatLog Complex Complex Diamonds Figure 5 depicts the contents of the dataset Complex9 after 16% noise was injected into the dataset, Complex9_RN16. In Figure 5, the original dataset is represented by dots while the stars represent random noise we generated. In addition to analyzing the effects of random noise, we also created 3 more datasets based on Complex9 dataset by adding Gaussian noise of different intensities. We first pick t % examples from each class in the original dataset. Then, for each selected example, we maintain its class label but modify its attribute values by adding two Gaussian distribution random variables with the mean value of 0, therefore creating a new example. We chose t to be 8, 16 and 32 for the 3 versions of the dataset called Complex9_GN8, Complex9_GN16, and Complex9_GN32, respectively. All these 2D datasets are available at the website [11]. 3.2 Parameters Used in Each Technique In our experiments, Wilson editing was run with K equal to 1. Multi-edit was run using 3 subsets (N = 3). Citation Editing was run with K equal to 1 and C equal to 1. As for supervised

5 cluster editing, we used a greedy hill-climbing algorithm with randomized restart called SRIDHCR [4]. SRIDHCR starts by randomly selecting an initial set of k representatives. Dataset examples are then clustered around these representatives. The algorithm then tries to improve the quality of the clustering by repeatedly adding a single non-representative example to the set of representatives as well as by removing a single representative. The algorithm terminates if the solution quality measured by q(x), given below, does not improve. where and q(x) = Impurity(X) + β*penalty (k) (1) Impurity(X Penalty(k) = # of ) = k c 0 n Minority n k c k < c Examples with n being the total number of examples and c being the number of classes in a dataset. X represents a clustering solution consisting of all clusters. The parameter β (0< β 2.0) determines the penalty that is associated with the number of clusters k in a clustering: higher values for β imply larger penalties for a higher number of clusters. 3.3 Experimental Results We used Manhattan distance to calculate the distance between two examples. Similarity, classification accuracy and training set compression rates were determined using class-stratified 5- fold cross-validation. All experiments were repeated 3 times, each time we reshuffled all examples in the original dataset. Values reported in the tables are averages for the 3 runs. All detailed results are available at the website [11]. Following subsections discuss the experimental results., Similarity Between the Editing Techniques Let R1 and R2 be two dataset editing techniques. Let us further assume that when applied on dataset O these techniques produce edited training subsets O R1 and O R2.We define the agreement between the two editing algorithms R1 and R2 as follows: OR1 I Agreement(R1, R2) = OR2 (2) OR1 U OR2 where OR 1 UO is the union of all examples in subsets O R2 R1 and O R2 while OR 1 I O is the R2 intersection of examples in the two subsets. An agreement of 1 means that techniques R1 and R2 produce identical edited training sets. An agreement of 0 means that techniques R1 and R2 produced subsets that have no examples in common. Table 2 reports the agreements between the four editing algorithms for the Iris Plants and Glass datasets. The results we obtained using other UCI datasets showed the same trend and therefore have not been included.

6 Expectedly so, experimental results show high similarities between Wilson, Citation, and Multi-edit. These 3 techniques are based on the same principle; remove from the training set examples misclassified by a K-NN classifier. Table 2: Similarities among the editing techniques a. Iris Plats dataset Wilson Multi-edit Citation Multi-edit 97.2 Citation Editing Supervised Clustering b. Glass dataset Multi-edit 51.4 Citation Editing Supervised Clustering Notice that similarity is higher between Citation and Wilson. On the other hand, supervised clustering editing is very dissimilar to other three techniques. Unlike Wilson, Citation, and Multi-edit, supervised clustering editing removes both correctly and misclassified examples. Consequently, similarities between SC editing and all other techniques are very low. As can be noticed in the table above, the similarities among the three techniques Wilson, Citation, and Multi-edit decreased considerably for the Glass dataset while the similarities between these techniques and supervised clustering have increased. We believe the reason for this observation is that Glass is a more difficult dataset (see Table 3), and a higher percentage of examples is therefore removed by Wilson, Citation, and Multi-edit due to a larger percentage of misclassifications for the Glass dataset (i.e., 58, 17, & 72 examples, respectively) compared to Iris Plants dataset (i.e., 9,5, & 9 examples, respectively) Classification Accuracy of Editing Techniques We also computed the classification accuracy of traditional 1-NN classifier to evaluate the editing techniques. Experimental results are shown in Tables 3 and 4. Based on the results reported in Table 3, we see that in 6 out of 11 datasets (Iris, Waveform, Diabetes, Heart-H, Vote, and Heart-StatLog) the traditional 1-NN classification was outperformed by nearly all of the editing techniques. In Table 4 we observe that in 7 out of the 9 datasets (Complex9_RN8, 16, and 32 as well as Complex9_GN8, 16, and 32) traditional 1-NN classification was outperformed by 3 of the editing techniques: Wilson, Multi-edit, and Citation editing. Table 3: Classification accuracy for UCI datasets Dataset Name 1-NN Wilson Editing Multi-edit Citation Editing SC Editing IRIS Glass Diabetes Waveform Heart_H IonoShpere Segmentation Vehicle Vote Vowel Heart Statlog

7 Table 4: Classification Accuracy for the 2D datasets Dataset Name 1-NN Wilson Editining Multi-edit Citation Edit- SC Editing Complex Complex9 _RN Complex9 _RN Complex9_RN Complex9_GN Complex9_GN Complex9_GN Complex Diamonds Moreover, as the noise injected into the dataset increases, the benefit of editing increases as well, as can be seen in Table 4. For example, Wilson editing scores an improvement of 2% [i.e., ( )/92.1] in classification accuracy over 1-NN classifier for Complex9 with 8% noise (i.e., Complex9_RN8) compared to 11% increase in classification accuracy when the noise jumps to 32% (i.e., Complex9_RN32). Comparing the performance of the editing techniques among themselves, Table 3 shows that in 5 out of the 11 datasets, supervised clustering editing (SCE) outperformed all other editing techniques. Nevertheless, SCE performed quite badly on the Vowel dataset. Moreover, Wilson editing as well as Citation editing showed very strong performance as well. The results in Table 3 also show that the multi-edit technique performed quite poorly for the UCI datasets with the exception of the Heart-StatLog dataset. Results in Table 4 show that Multi-edit performs much better for the 2D-data set, especially as noise increases. We suspect this observation can be attributed to the fact that the 2D datasets are very dense datasets with large number of examples of a single class occupying a particular region. Thus, it is less likely that an example is erroneously removed, because it gets separated from its nearest neighbors that share the same class label in the diffusion process of multi-edit Training Set Compression Rate (TSCR) Training set compression rate measures the percentage of reduction applied on the training set by an editing technique. In general, if an editing technique reduces the size of a training set from n examples down to r examples, we calculate the training set compression rate for that editing technique as: r Training Set Compression Rate = 1 - (3) n Tables 5 and 6 report the training set compression rates for the UCI and 2D datasets, respectively. Inspecting the results reported in Table 5, we clearly notice that the highest training set compression rates are reported by supervised clustering editing for all datasets. Coupling this fact with the good relative performance of supervised clustering editing with respect to classification accuracy reported in Section 3.3.2, the overall performance of supervised clustering editing moves up the ladder. For example, supervised clustering editing reduced the Vote

8 dataset that contains 348 examples to a dataset that contains only 10 representative examples, at the average. On the other hand, Wilson, Citation, and Multi-edit reduced the Vote dataset to 319, 338, and 298 examples, respectively. When applying traditional 1-NN classification on a training set containing only the 10 cluster representatives selected by supervised clustering, we achieved a better classification accuracy than the classification accuracies we got when applying traditional 1-NN classification on the edited datasets produced by all the other three editing techniques. Table 5: TSCR for the UCI datasets Dataset Name SC Wilson Multiedit Citation IRIS Glass Diabetes Waveform Heart_H Ionosphere Vote Vehicle Segmentation Vowel Heart Stat- Log Table 6: TSCR for the 2D spatial datasets Dataset Name SC Wilson Multiedit Citation Complex Complex9_RN Complex _RN16 Complex _RN32 Complex9_GN Complex9_GN Complex9_GN Complex Diamond As mentioned earlier, both Wilson and Citation editing reduce the size of a dataset by removing examples that have been misclassified by a k-nn classifier. Consequently, the training set compression rates are quite low on easy classification tasks for which high prediction accuracies are normally achieved. For example, Wilson editing produces dataset reduction rates of only 0.7%, 2.3%, and 6.0% for the Vowel, Segmentation, and Iris datasets, respectively, and even removes far less examples for the 2D datasets. Similarly, it can be observed that Wilson and Citation editing produced training set compression rates of close to 0% for the 2D datasets that do not contain any noise, namely Complex9, Complex8, and 9Diamonds, while supervised clustering editing achieved training set compression rates of more than 98% for those datasets. 3.4 Overall Performance Analysis Table 7 reports the average classification accuracy and training set compression rates for the investigated editing techniques for the examined datasets. We can see that, Wilson editing, Citation Editing, and Supervised Clustering editing obtain a better performance than the traditional 1-NN classification on both UCI datasets and 2D datasets. But Multi-editing performs worse than 1-NN classification.

9 Table7: Average classification accuracy and training set compression rate 1-NN Wilson Multi-edit Citation SC Editing a. UCI Datasets Classification accuracy TS Compression Rate b. 2D Datasets Classification accuracy TS Compression rate Visualization of Different Editing Techniques This section visualizes the experimental results of different editing techniques applied on Complex9_RN32 dataset and provides a better understanding of which examples each technique removed. Figure 6 presents a visualization of similarity between Citation and Wilson editing. All Examples deleted by the Citation editing are represented by (black) big dots while examples deleted by Wilson editing are marked by (green) triangles and the original Complex9_RN32 examples are depicted using dark (blue) small dots. It can be noticed that most of the examples deleted by Wilson and Citation techniques are located between clusters, i.e., in the boundary areas. Original Dataset Deleted by Citation Editing Deleted by Wilson editing Selected by SC Editing Figure 6: Visualization of Citation, Wilson, and SC editing for Complex9_RN32 dataset Figure 6 also shows in (red) squares the cluster representatives selected (to stay) by a supervised clustering algorithm (clusters are not shown here). It can be noticed that these examples are not lying in the boundary region but in the clusters themselves.

10 Figure 7 shows the clustering of the Complex9 dataset (without noise) using the representative based SC algorithm SRIDHCR [4]. Representatives selected are characterized by squares with x in the middle. Due to the nature of supervised clustering, see Section 3.2 and [4], it selects as cluster representatives examples that minimize the number of misclassified examples. To accomplish this objective, selected representatives are some times located in the middle of the cluster (lower left rectangle clusters in Figure 7) and others are located on the edge of their clusters (lower right 2 clusters that make the U-shaped pair in Figure 7). SC Representatives Figure 7: Cluster representatives selected by supervised clustering for Complex9 Original Dataset Selected by SVM Selected by SC Figure 8: Visualization of supervised clustering & SVM editing techniques for Complex9_RN32 Moreover, representatives of neighboring clusters that belong to different classes are usually lined up opposite to each other with proper spacing between them, to minimize wrongly clustered examples, as can be seen for the two representatives of the two rectangular clusters as a first example and the question mark -shaped cluster and the U- shaped one to the left of it as a second example, in Figure 7. This makes it less likely to attract examples that belong to the neighboring cluster. In summary, the representative-based clustering algorithm picked only 21 out of the 3031 examples in the training set. Using these 21 examples as cluster representatives, as depicted in Figure 7, the class purity (see Formula 1) was 99.4% due to presence of few minority examples in the right-most cluster and in the two interleaved U-shaped clusters near the bottom center. In general, representative based clustering algorithms create clusters by assigning points to the closest representative. A challenge that these algorithms face is that they are limited to discovering clusters of convex shapes. Therefore, in order for these algorithms to discover clusters of non-convex shape, clusters have to be approximated by a union of neighboring clusters that are dominated by the same class. An example of such case is the large circular cluster in Figure 7 which was approximated using 5 neighboring clusters without any loss in purity.

11 As part of this research work we also investigated a potential dataset editing technique based on support vector machines (SVM) [8]. Practical implementations of SVM classifiers mainly produce a solution in form of a discriminant function. Such function is then used to classify unknown examples. The training examples that participate in the definition of the discriminant function are called support vectors. A technique we call SVM editing edits a dataset by keeping examples identified by a SVM classifier as support vectors and removes all other examples. Our implementation of SVM was based on LIBSVM 2.8 [1]. The experiments were run using RBF as the kernel function and C-SVC as the kernel type. Values for parameters C (cost of constraints violation) and γ (RBF kernel function parameter) were chosen using an optimization tool that LIBSVM 2.8 offers. Unfortunately, the SVM based editing technique showed poor performance compared to all other editing techniques. Experimental results showed that a SVM based editing technique produced a 1-NN classifier that has an average classification accuracy of 76.25% compared to 81.81%, 82.66%, 72.80%, 82.73%, and 82.67% classification accuracies for the traditional 1-NN classifier, Wilson editing, Multi-edit, Citation editing, and supervised clustering editing, respectively. As can clearly be seen from the results, with the exception of the Multi-edit technique, the average performance of the SVM editing is not only worse than all editing techniques, but also worse than the traditional 1-NN classifier. Figure 8 shows in (yellow) triangles examples selected as support vectors by the support vector machine editing technique for dataset Complex9_RN32. As explained in more details in [8], support vectors are those examples that participate in defining the boundary around the hyper-plane that separates the different classes in the higher dimension features space. It is natural that when such examples are transformed back to the input space, most of them are found to lie in the boundary areas between the classes, as can easily be seen in Figure 8. To be more precise, experiments show that 59% (909 examples out of 1546) of the support vectors in Figure 8 are noise examples. Furthermore, 94% (909 examples out of 967) of the added noise examples were selected as support vectors. SVM editing behaves more like a condensing rather than an editing technique as it preserves characteristics of the boundary areas between classes. This explains its low classification accuracy compared to the other techniques. Details of experimental results for SVM editing is also available at [11]. 4 Summary We investigated several dataset editing algorithms including Wilson, Citation, and Multi-edit. The performance of these algorithms is studied and compared to each others as well as to the performance of supervised clustering editing that replaces examples belonging to a cluster by a cluster representative. Experimental results show that Wilson and Citation editing are quite similar with respect to the examples they remove and that Supervised clustering editing has very little similarity to the other techniques. The experimental results with respect to classification accuracy showed a strong performance for supervised clustering editing as well as Wilson and Citation editing techniques.

12 Moreover, all three techniques, in particular Citation and Wilson editing, seem to perform particularly well for datasets that contain a lot of noise. Multi-edit and SVM editing, on the other hand, performed quite poorly with respect to classification accuracy. Our experimental results with respect to training set compression rates show that the strong performance of supervised clustering editing with respect to classification accuracy is further complemented with high training set compression rates. Wilson and Citation editing, on the other hand, accomplish relatively low training set compression rates, especially for datasets that are easy to classify. In summary, our experimental results demonstrated the benefits of editing techniques to enhance the accuracy and performance of nearest neighbor classifiers. We also believe that exploring the potential of editing techniques for enhancing other classification techniques, such as decision trees and support vector machines, deserves more attention by future research. References 1. Chang, C.-C. and Lin, C.-J.: LIBSVM -- A Library for Support Vector Machines. Version 2.8, April Dasarathy BV, Sánchez JS, Townsend S.: Nearest Neighbor Editing and Condensing Tools - Synergy Xxploitation, Pattern Analysis & Applications, Vol. 3, no. 1, pp , Devijver, P. and Kittler, J.: On the Edited Nearest Neighbor Rule. IEEE 1980 Pattern Recognition. Vol.1, Eick, C.F., Zeidat, N., and Zhenghong, Z.: Supervised Clustering Algorithms and Benefits. In proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI04), Boca Raton, Florida, November 2004, pp Eick, C.F., Zeidat, N., and Vilalta, R.: Using Representative-Based Clustering for Nearest Neighbor Dataset Editing. ICDM 2004: Garfield, E.: Citation Indexing: its Theory and Application in Science, Technology, and Humanities, New York: John Wiley & Sons, Karypis, G., Han, E.H., and Kumar, V.: Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer, Vol 32, No 8, pp 68-75, August Runarsson, T.P. and Sigurdsson, S.: Support Vector Machines. January On Line document at 9. Salvador, S. and Chan, P.: Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. ICTAI 2004, University of California at Irving, Machine Learning Repository, University of Houston, Machine Learning and Data Mining Group, cs.uh.edu/~sujingwa/pkdd05/ 12. Wilson, D.L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data, IEEE Transactions on Systems, Man, and Cybernetics, 2: , 1972.

Supervised Clustering: Algorithms and Application

Supervised Clustering: Algorithms and Application Nidal Zeidat, Christoph F. Eick, and Zhenghong Zhao Department of Computer Science, University of Houston, Houston, Texas 77204-3010 {ceick, nzeidat, zhenzhao}@cs.uh.edu