A density-based approach for instance selection

Size: px

Start display at page:

Download "A density-based approach for instance selection"

Elwin Foster
5 years ago
Views:

1 2015 IEEE 27th International Conference on Tools with Artificial Intelligence A density-based approach for instance selection Joel Luis Carbonera Institute of Informatics Universidade Federal do Rio Grande do Sul UFRGS Porto Alegre, Brazil jlcarbonera@inf.ufrgs.br Mara Abel Institute of Informatics Universidade Federal do Rio Grande do Sul UFRGS Porto Alegre, Brazil marabel@inf.ufrgs.br Abstract Instance selection is an important preprocessing step that can be applied in many machine learning tasks. Due to the increasing of the size of the datasets, techniques for instance selection have been applied for reducing the data to a manageable volume, leading to a reduction of the computational resources that are necessary for performing the learning process. Besides that, algorithms of instance selection can also be applied for removing useless, erroneous or noisy instances, before applying learning algorithms. This step can improve the accuracy in classification problems. In the last years, several approaches for instance selection have been proposed. However, most of them have long runtimes and, due to this, they cannot be used for dealing with large datasets. In this paper, we propose a simple and effective density-based approach for instance selection. Our approach, called LDIS (local density-based instance selection), evaluates the instances of each class separately and keeps only the densest instances in a given (arbitrary) neighborhood. This ensures a reasonably low time complexity. Our approach was evaluated on 15 wellknown data sets and its performance was compared with the performance of 5 state-of-the-art algorithms, considering three measures: accuracy, reduction and effectiveness. For evaluating the accuracy achieved using the datasets produced by the algorithms, we applied the KNN algorithm. The results show that LDIS achieves a performance (in terms of balance of accuracy and reduction) that is better or comparable to the performances of the other algorithms considered in the evaluation. Keywords-Instance selection; Instance reduction; dataset reduction; Machine learning; Data mining; Instance-based learning I. INTRODUCTION According to [1], instance selection (IS) is a task that consists of choosing a subset of the total available data to achieve the original purpose of the data mining (or machine learning) application as if the whole data had been used. In general, it constitutes a family of methods that performs the selection of the best possible subset of examples from the original data, by using some rules and/or heuristics. Considering this, the optimal outcome of IS would be the minimum data subset that can accomplish the same task with no performance loss. Thus, in the optimal scenario, P (A S )=P(A T ), where P is the performance, A is the machine learning algorithm, A T represents the algorithm A applied over the complete dataset T, and A S represents the algorithm A applied over a subset S of the dataset T. Thus, every instance selection strategy should face a trade-off between the reduction rate of the dataset and the classification quality [2]. In general, instance selection can be applied for reducing the data to a manageable subset, leading to a reduction of the computational resources (in terms of time and space) that are necessary for performing the learning process [3], [4], [5], [1]. Besides that, instance selection algorithms can also be applied for removing useless (redundant), erroneous or noisy instances, before applying learning algorithms. In this case, the accuracy of the learned models can increase after applying an instance selection technique [3], [1]. In the last years, several approaches for instance selection have been proposed [6], [7], [8], [9], [4], [5]. Most of these algorithms are designed for preserving the boundaries between different classes in the dataset, because border instances provide relevant information for supporting discrimination between classes [5]. However, in general, these algorithms have a high time complexity, and this is not a desirable property for algorithms that should deal with large datasets. The high time complexity of these algorithms is a consequence of the fact that they usually should search for the border instances within the whole dataset and, due to this, they usually need to perform comparisons between each pair of instances in the dataset. In this paper, we propose an algorithm for instance selection, called LDIS (Local Densitybased Instance Selection) 1, which takes a different approach. Instead of focusing on the border instances, which usually are searched across the whole dataset, our approach analyses the instances of each class separately and focuses on keeping only the densest instances in a given (arbitrary) neighborhood. That is, in a first step, our algorithm determines the set of instances that are classified by each class. In each resulting set, for each instance x, the algorithm calculates its local density (its density within its class) and its partial k-neighborhood (where k is determined by the user), which is the set of k-nearest neighbors of x that have the same class label of x. Finally, for each instance, if its density is 1 The source code of the algorithm is available in br/ jlcarbonera/?page id= /15 $ IEEE DOI /ICTAI

2 greater or equals the density of the densest instance within its partial k-neighborhood, the instance is preserved in the resulting dataset. Our approach was evaluated on 15 well-known datasets and its performance was compared with the performance of 5 important algorithms provided by the literature. The accuracy was evaluated considering the KNN algorithm [10]. The results show that, compared to the other algorithms, LDIS provides the best trade-off between accuracy and reduction, while presents a reasonably low time complexity. Section II presents some related works. Section III presents the notation that will be used throughout the paper. Section IV presents our approach (LDIS). Section V discusses our experimental evaluation. Finally, Section VI presents our main conclusions and final remarks. II. RELATED WORKS In this Section, we discuss some important instance reduction methods. For most of the algorithms presented here, we consider T as representing the original set of instances in the training set and S, where S T, as the reduced set of instances, resulting from the instance selection process. The Condensed Nearest Neighbor (CNN), proposed in [11], randomly selects one instance that belongs to each class from T and puts them in S. In the next step, each instance in T is classified using only the instances in S. If an instance is misclassified, it is added to S, in order to ensure that it will be classified correctly. This process repeats until there is no instance in T that is misclassified. It is important to notice that CNN can assign noisy and outlier instances to S, causing negative effects in the classification accuracy. Also, CNN is dependent on instance order in the training set T. The time complexity of CNN is O( T 2 ), where T is the size of the training set. The Reduced Nearest Neighbor algorithm (RNN) [12], assigns all instances in T to S first. Then it removes each instance from S, until further removal causes no other instances in T to be misclassified by the remaining instances in S. RNN produces subsets S that are smaller than the subsets produced by CNN, and is less sensitive to noise than CNN. However, the time complexity of RNN is O( T 3 ), which is higher than the time complexity of CNN. In [2], the authors propose an extension of CNN, called Generalized Condensed Nearest Neighbor (GCNN). This approach operates in a way that is similar to CNN. However, GCNN includes instances which satisfy an absorption criterion to S. Considering d N (x) as the distance between x and its nearest neighbor, and d E (x) as the distance between x and its nearest enemy (instance of a class that is different from the class of x); x is included in S if d N (x) d E (x) > ρ, where ρ is an arbitrary threshold. GCNN could produce sets S that are smaller than the sets produced by CNN. However, determining the value of ρ can be a challenge. In the Edited Nearest Neighbor (ENN) algorithm [6], all training instances are assigned to S first. Then, each instance in S is removed if it does not agree with the label of the majority of its k nearest neighbors. This algorithm removes noisy and outlier instances. Thus, it could improve classification accuracy. It keeps internal instances in contrast to removing boundary instances. Therefore, it cannot reduce the dataset as much as other reduction algorithms. The literature provides some extensions to this method, such as [13]. In [7], the authors propose 5 approaches, the Decremental Reduction Optimization Procedure (DROP) algorithms. In these algorithms, each instance x has k nearest neighbors, and those instances that have x as one of their k nearest neighbors are called the associates of x. Among the proposed algorithms, DROP3 has the best trade-off between the reduction of the dataset and the accuracy of the classification. As an initial step, it applies a noise filter algorithm such as ENN. Then it removes an instance x if its associates in the original training set can be correctly classified without x. The main drawback of DROP3 is its high time complexity. The Iterative Case Filtering algorithm (ICF) [8] is based on the notions of Coverage set and Reachable set. The coverage set of an instance x is the set of instances in T whose distance from x is less than the distance between x and its nearest enemy (instance with a different class). The Reachable set of an instance x is the set of instances in T that have x in their respective coverage sets. In this method, a given instance x is removed from S if Reachable(x) > Coverage(x), that is, when the number of other instances that can classify x correctly are greater than the number of instances that x can correctly classify. Recently, in [5], the authors propose three complementary methods for instance selection that are based on the notion of local sets. In this context, the local set of a given instance x is the set of instances contained in the largest hypersphere centered on x such that it does not contain instances from any other class. The local set-based smoother (LSSm), the first algorithm, was proposed for removing instances that are harmful, that is, instances that misclassify more instances than those that they correctly classify. It uses two notions for guiding the removing process: usefulness and harmfulness. The usefulness u(x) of a given instance x is the number of instances having x among the members of their local sets, and the harmfulness h(x) is the number of instances having x as the nearest enemy. For each instance x in T, the algorithm includes x in S if u(x) h(x). The time complexity of LSSm is O( T 2 ). The second algorithm, the Local Set-based Centroids Selector method (LSCo), firstly applies LSSm for removing noise and then it applies LSclustering [14] for identifying clusters in T without invasive points (instances that are surrounded by instances with a different class). The algorithm keeps in S only the centroids

3 of the resulting clusters. Finally, the Local Set Border Selector (LSBo) first applies LSSm for removing noise, and after it computes the local set of every instance in T. Then, the instances in T are sorted in the ascending order of the cardinality of their local sets. In the last step, LSBo verifies, for each instance x in T if any member of its local set is contained in S, thus ensuring the proper classification of x. If that is not the case, x is included in S to ensure its correct classification. The time complexity of the three approaches is O( T 2 ). Among the three algorithms, LSBo provides the best balance between reduction and accuracy. Other approaches can be found in surveys such as the ones provided in [15], [5], [1]. III. NOTATIONS In this section, we introduce the following notation that will be used throughout the paper: T = {x 1,x 2,..., x n } is a non-empty set of n instances (or data objects). It represents the original dataset that should be reduced in the instance selection process. Each x i T is a m tuple, such that x i = (x i1,x i2,..., x im ), where x ij represents the value of the j-th feature of the instance x i, for 1 j m. L = {l 1,l 2,..., l p } is the set of p class labels that are used for classifying the instances in T, where each l i L represents a given class label. l : T L is a function that maps a given instance x i T to its corresponding class label l j L. c: L 2 T is a function that maps a given class label l j L to a given set C, such that C T, which represents the set of instances in T whose class is l j. Notice that T = l L c(l). In this notation, 2T represents the powerset of T, that is, the set of all subsets of T, including the empty set and T itself. pkn: T N 1 2 T is a function that maps a given instance x i T and a given k N 1 (k 1) toa given set C, such that C c(l(x i )), which represents the set of the k nearest neighbors of x i,inc(l(x i )) (excepting x i ). Since the resulting set C includes only the neighbors that have a given class label, it defines a partial k-neighborhood. S = {x 1,x 2,..., x q } is a set of q instances, such that S T. It represents the reduced set of instances that results from the instance selection process. IV. A LOCAL DENSITY-BASED APPROACH FOR INSTANCE SELECTION As it was discussed in Section II, most of the algorithms proposed for instance selection in the literature are designed for searching the class boundaries in the whole dataset. That is, they perform a global search in the dataset. Although, in general, keeping the border instances in the dataset results in a high accuracy in classification tasks, the global search for these border points involves a high computational cost. The resulting high time complexity is a consequence of the fact that, in general, for identifying the border instances in the dataset it is necessary to perform comparisons between each pair of instances in the dataset. In this paper, we explore another strategy for selecting instances. Instead of identifying the border points, our approach identifies instances that have a high concentration of instances near to them. Besides that, instead of searching for these instances in the whole dataset, our approach deals with the instances of each class of the dataset separately, searching for the representative instances within the set of instances of each class. Since our algorithm applies only a local search strategy (within each class), it has runtimes that are lower than the runtimes of approaches that adopts a global search strategy. For identifying the representative instances, in our approach, we adopt the notion of density, adapted from [16], [17], [18], [19], which is formalized by the Dens function: Dens(x, P )= 1 d(x, y) (1) P y P, where x is a given instance, P = {x 0,x 1,..., x q } is a set of q instances, x P, and d is a given distance function. Notice that Dens(x, P ) provides the density of the instance x relatively to the set P of instances. In this way, when P is a subset of the whole dataset, Dens(x, P ) represents the local density of x, considering the set P. Our approach assumes that the local density, Dens(x, c), of a given instance x in a set c, where c(l(x)) = c, is proportional to the usefulness of x for classifying new instances of c. Thus, the locally densest instance of a given neighborhood would represent more information of its surroundings than the less dense instances and, due to this, it would be more representative of the neighborhood than its less dense neighbors. Considering this, for each l L, the LDIS algorithm verifies, for each x c(l), if there is some instance y pkn(x, k) (where k is arbitrarily chosen by the user), such that Dens(y, c(l)) >Dens(x, c(l)). If this is not the case, this means that x is the locally densest instance in its partial k-neighborhood and, due to this, x is included in S. The Algorithm 1 formalizes this strategy. Notice that when c(l(x)) k, it is necessary to consider k = c(l(x)) 1, for calculating the partial k-neighborhood of x Notice that the most expensive steps of the algorithm involve determining the partial density and the partial k- neighborhood of each instance of a given set c(l) (for some class label l). Determining the partial density of every instance of a given set c(l) is a process whose time complexity is proportional to O( c(l) 2 ). The time complexity of determining the partial k-neighborhood is equivalent. An efficient implementation of the Algorithm 1 could calculate the partial k-neighborhood and the partial density of each instance of a given set c(l) (for some class

4 Algorithm 1: LDIS (Local Density-based Instance Selection) algorithm Input: A set instances T and the number k of neighbors. Output: A subset S of T. begin S ; foreach l L do foreach x c(l) do founddenser false; foreach neighbor pkn(x, k) do if Dens(x, c(l)) <Dens(neighbor, c(l)) then founddenser true; if founddenser then S S {x}; return S; label l) just once, as a first step within the main loop, and use this information for further calculations. Considering this, the time complexity of LDIS would be proportional to O( l L c(l) 2 ). V. EXPERIMENTS For evaluating our approach, we have compared the LDIS algorithm, presented in Section IV, with 5 important instance selection algorithms provided by the literature: DROP3, ENN, ICF, LSBo and LSSm. We considered 15 well-known datasets: breast cancer, cars, E. Coli, glass, iris, letter, lung cancer, mushroom, genetic promoters, segment, soybean 2, splice-junction gene sequences, congressional voting records, wine and zoo. All datasets were obtained from the UCI Machine Learning Repository 3. In Table I, we present the details of the data sets that were used. Dataset Instances Attributes Classes Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Table I DETAILS OF THE DATA SETS USED IN THE EVALUATION PROCESS. Our experimentation was designed for comparing three evaluation measures: accuracy, reduction and effectiveness. 2 This data set combines the large soybean data set and its corresponding test data set 3 Following [5], we assume accuracy = Sucess(Test) (2) Test and T S reduction = (3) T, where Test is a given set of instances that are selected for being tested in a classification task, and Success(Test) is the number of instances in Test correctly classified in the classification task. Besides that, in this work we consider effectiveness as a measure of the degree to which an instance selection algorithm is successful in producing a small set of instances that allows a high classification accuracy of new instances. Thus, we consider effectiveness = accuracy reduction. For evaluating the accuracy of the classification of new instances, we applied the k-nearest Neighbors (KNN) algorithm [10], considering k =3, as considered in [5]. Besides that, the accuracy and reduction were evaluated in an n-fold cross-validation scheme, where n =10. In this scheme, first the dataset is randomly partitioned in 10 equal sized subsamples. From these subsamples, a single subsample is retained as validation data (Test), and the union of the remaining 9 subsamples is considered the initial training set (ITS). After, an instance selection algorithm is applied for reducing the ITS, producing the reduced training set (RT S). At this point, we can measure the reduction of the dataset. Finally, the RT S is used as the training set for the KNN algorithm, for evaluating the instances in Test. At this point, we can measure the accuracy achieved by the KNN, using RT S as the training set. The cross-validation is repeated 10 times, with each subsample used once as Test. The 10 values of accuracy and reduction are averaged to produce, respectively, the average accuracy (AA) and average reduction (AR). The average effectiveness is calculated by considering AA and AR. The Tables II, III and IV show, respectively, the resulting AA, AR and AE of each combination of dataset and instance selection algorithm. In these tables, the best results for each dataset are marked in bold typeface. In this evaluation process, we adopted k =3, for DROP3, ENN, ICF and LDIS. Besides that, we adopted the following distance function d: T T R: m d(x, y) = θ j (x, y) (4) where where θ j (x, y) = j=1 { α(x j,y j ), if j is a categorical feature x j y j, if j is a numerical feature α(x j,y j )= (5) { 1, if x j y j 0, if x yj = y j (6)

5 Table II COMPARISON OF THE accuracy ACHIEVED BY THE TRAINING SET PRODUCED BY EACH ALGORITHM, FOR EACH DATASET. Algorithm DROP3 ENN ICF LSBO LSSM LDIS K=3 Average Breast cancer 0,73 0,74 0,72 0,61 0,74 0,69 0,70 Cars 0,75 0,76 0,76 0,65 0,76 0,76 0,74 E. Coli 0,83 0,86 0,81 0,80 0,85 0,86 0,83 Glass 0,62 0,65 0,62 0,56 0,71 0,59 0,63 Iris 0,97 0,97 0,94 0,94 0,96 0,95 0,95 Letter 0,87 0,92 0,80 0,75 0,92 0,77 0,84 Lung cancer 0,31 0,34 0,41 0,32 0,45 0,38 0,37 Mushroom 1,00 1,00 0,98 1,00 1,00 1,00 1,00 Promoters 0,79 0,79 0,72 0,72 0,82 0,75 0,77 Segment 0,93 0,95 0,91 0,87 0,95 0,91 0,92 Soybean 0,85 0,90 0,84 0,63 0,91 0,78 0,82 Splice 0,71 0,73 0,70 0,75 0,76 0,75 0,73 Voting 0,92 0,92 0,91 0,89 0,92 0,91 0,91 Wine 0,70 0,74 0,73 0,78 0,76 0,72 0,74 Zoo 0,91 0,88 0,88 0,71 0,90 0,82 0,85 Average 0,79 0,81 0,78 0,73 0,83 0,78 0,79 Table III COMPARISON OF THE reduction ACHIEVED BY EACH ALGORITHM, FOR EACH DATASET. Algorithm DROP3 ENN ICF LSBO LSSM LDIS K=3 Average Breast cancer 0,77 0,29 0,85 0,73 0,14 0,87 0,61 Cars 0,88 0,18 0,82 0,74 0,12 0,86 0,60 E. Coli 0,71 0,16 0,86 0,82 0,09 0,90 0,59 Glass 0,76 0,32 0,69 0,72 0,14 0,91 0,59 Iris 0,72 0,04 0,58 0,93 0,06 0,89 0,54 Letter 0,68 0,05 0,80 0,83 0,04 0,83 0,54 Lung cancer 0,70 0,59 0,74 0,52 0,17 0,86 0,60 Mushroom 0,86 0,01 0,94 0,99 0,01 0,87 0,61 Promoters 0,60 0,18 0,70 0,60 0,05 0,84 0,49 Segment 0,69 0,04 0,83 0,91 0,04 0,83 0,55 Soybean 0,68 0,09 0,58 0,83 0,05 0,78 0,50 Splice 0,66 0,23 0,76 0,59 0,05 0,81 0,52 Voting 0,79 0,07 0,92 0,89 0,04 0,77 0,58 Wine 0,71 0,23 0,78 0,78 0,10 0,87 0,58 Zoo 0,65 0,06 0,33 0,88 0,06 0,63 0,43 Average 0,72 0,17 0,75 0,78 0,08 0,83 0,56 Table III shows that LDIS achieves the highest reduction in most of the datasets, and achieves the highest average reduction rate. Finally, The Table IV shows that LDIS achieves the highest effectiveness in most of the datasets, and achieves the highest average effectiveness. Thus, these results show that although the LDIS does not provide the highest accuracies, it provides the highest reduction rates and the best trade-off between both measures (represented by the effectiveness). We also carried out a comparison of the runtimes of the instance selection algorithms considered in our experiments. In this comparison, we applied the 5 instance selection algorithms to reduce the 3 largest datasets considered in our experiments: letter (Figure 1), splice-junction gene sequences (Figure 2) and mushroom (Figure 3). For conducting the experiments, we used an Intel R Core TM i5-3210m laptop with a 2.5 GHz CPU and 6 GB of RAM. The Figures 1, 2 and 3 show that, considering these three datasets, the LDIS algorithm has the lowest runtime compared to the other algorithms. This result is a consequence of the fact that LDIS deals with the set of instances of each class of the dataset separately, instead of performing a global search in the whole dataset. The Table II shows that LSSm achieves the highest accuracy in most of the datasets. However, LSSm was designed for removing noisy instances and, due to this, it does not provides high reduction rates. In addition, notice that the accuracies achieved by LDIS are higher than the average for several datasets, and for 3 datasets LDIS achieve the higher accuracy. Besides that, the difference between the accuracy of LDIS and LSSm is not large, and can be compensated by the gain in reduction and in runtime provided by LDIS. The Figure 1. Comparison of the runtimes of 5 instance selection algorithms, considering the Letter dataset. Table IV COMPARISON OF THE effectiveness ACHIEVED BY EACH ALGORITHM, FOR EACH DATASET. Algorithm DROP3 ENN ICF LSBO LSSM LDIS K=3 Average Breast cancer 0,56 0,21 0,61 0,45 0,10 0,60 0,42 Cars 0,66 0,14 0,62 0,49 0,09 0,65 0,44 E. Coli 0,59 0,14 0,70 0,66 0,08 0,77 0,49 Glass 0,47 0,21 0,42 0,40 0,10 0,54 0,36 Iris 0,69 0,04 0,55 0,87 0,06 0,85 0,51 Letter 0,59 0,05 0,65 0,62 0,03 0,64 0,43 Lung cancer 0,22 0,20 0,30 0,17 0,08 0,32 0,21 Mushroom 0,86 0,01 0,92 0,99 0,01 0,87 0,61 Promoters 0,47 0,14 0,51 0,43 0,04 0,63 0,37 Segment 0,64 0,04 0,75 0,79 0,04 0,75 0,50 Soybean 0,58 0,08 0,49 0,52 0,04 0,61 0,39 Splice 0,47 0,17 0,53 0,45 0,04 0,61 0,38 Voting 0,72 0,07 0,84 0,79 0,04 0,71 0,53 Wine 0,50 0,17 0,57 0,61 0,07 0,63 0,42 Zoo 0,59 0,05 0,29 0,62 0,05 0,52 0,35 Average 0,57 0,11 0,58 0,59 0,06 0,65 0,43 Figure 2. Comparison of the runtimes of 5 instance selection algorithms, considering the dataset of splice-junction gene sequences

6 Table VI COMPARISON OF THE reduction ACHIEVED BY LDIS, WITH DIFFERENT VALUES OF k, FOR EACH DATASET. Figure 3. Comparison of the runtimes of 5 instance selection algorithms, considering the Mushroom dataset. Table V COMPARISON OF THE accuracy ACHIEVED BY LDIS, WITH DIFFERENT VALUES OF k, FOR EACH DATASET. Value of K K=1 K=2 K=3 K=5 K=10 K=20 Average Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Average Finally, we also carried out experiments for evaluating the impact of the parameter k in the performance of LDIS. The Tables V, VI and VII show, respectively, the accuracy, the reduction and the effectiveness of LDIS, with k assuming the values 1, 2, 3, 5, 10 and 20. These results show that the variation of k has a significant impact on the performance of the algorithm. The results suggest that, in general, as the value of k increases, the accuracy tends to decrease, the reduction increases, and the effectiveness increases up to a point from which it begins to decrease. This suggests the possibility of investigating strategies for automatically estimating the best value of k for determining the partial k- neighborhood of a given instance. Also, the Table V shows some exceptions to the general rule, as in the case of the datasets cars, e. coli and iris. We hypothesize that, in this cases, a higher value of k led to an aditional removal of noisy instances, resulting in an increasing of the accuracy. This hypothesis should be investigates in the future. VI. CONCLUSION Most of the instance selection algorithms available in the literature usually search for the border instances in the whole dataset. Due to this, in general, these algorithms Value of K K=1 K=2 K=3 K=5 K=10 K=20 Average Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Average Table VII COMPARISON OF THE effectiveness ACHIEVED BY LDIS, WITH DIFFERENT VALUES OF k, FOR EACH DATASET. Value of K K=1 K=2 K=3 K=5 K=10 K=20 Average Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Average have a high time complexity. In this paper, we proposed an algorithm, called LDIS (Local Density-based Instance Selection), which adopts a different strategy. It analyzes the instances of each class separately, with the goal of keeping only the densest instances of a given neighborhood within each class. Since LDIS searches the representative instances within each class, its resulting time complexity is reasonably low, compared to the main algorithms proposed in the literature. In an overview, our experiments showed that LDIS provides the best reduction rates and the best balance between accuracy and reduction, with the lower time complexity, compared with other algorithms available in the literature. In future works, we plan to investigate strategies for automatically estimating the best value of the parameter k for each problem. Regarding this point, it is reasonable to hypothesize that the best value of k can be different for different classes within the same dataset. This hypothesis should be investigated as well. We also, plan to develop a version of LDIS that abstracts the information of the neighbor instances within the partial

7 k-neighborhood of a dense instance, instead of just eliminating the instances that are less dense. Besides that, we also plan to investigate how the LDIS algorithm can be combined with other instance selection algorithms. Finally, the performance of LDIS encourages the investigation of novel instance selection strategies that are based on other local properties of the dataset ACKNOWLEDGMENT The authors would like to thank the Brazilian Research Council (CNPq) and the PRH PB-17 program (supported by Petrobras) for the support to this work. Also, we would like to thank Sandro Fiorini for comments and ideas. REFERENCES [1] S. García, J. Luengo, and F. Herrera, Data preprocessing in data mining. Springer, [2] C.-H. Chou, B.-H. Kuo, and F. Chang, The generalized condensed nearest neighbor rule as a data reduction method, in Pattern Recognition, ICPR th International Conference on, vol. 2. IEEE, 2006, pp [3] H. Liu and H. Motoda, On issues of instance selection, Data Mining and Knowledge Discovery, vol. 6, no. 2, pp , [4] W.-C. Lin, C.-F. Tsai, S.-W. Ke, C.-W. Hung, and W. Eberle, Learning to detect representative data for large scale instance selection, Journal of Systems and Software, vol. 106, pp. 1 8, [12] G. W. Gates, Reduced nearest neighbor rule, IEEE TRansactions on Information Theory, vol. 18, no. 3, pp , [13] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics, no. 6, pp , [14] Y. Caises, A. González, E. Leyva, and R. Pérez, Combining instance selection methods based on data characterization: An approach to increase their effectiveness, Information Sciences, vol. 181, no. 20, pp , [15] J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, Irahc: Instance reduction algorithm using hyperrectangle clustering, Pattern Recognition, vol. 48, no. 5, pp , [16] L. Bai, J. Liang, C. Dang, and F. Cao, A cluster centers initialization method for clustering categorical data, Expert Systems with Applications, vol. 39, no. 9, pp , [17] J. L. Carbonera and M. Abel, A cognition-inspired knowledge representation approach for knowledge-based interpretation systems, in Proceedings of 17th ICEIS, 2015, pp [18], A cognitively inspired approach for knowledge representation and reasoning in knowledge-based systems, in Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015), [19], Extended ontologies: a cognitively inspired approach, in Proceedings of the 7th Ontology Research Seminar in Brazil (Ontobras), [5] E. Leyva, A. González, and R. Pérez, Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective, Pattern Recognition, vol. 48, no. 4, pp , [6] D. L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, Systems, Man and Cybernetics, IEEE Transactions on, no. 3, pp , [7] D. R. Wilson and T. R. Martinez, Reduction techniques for instance-based learning algorithms, Machine learning, vol. 38, no. 3, pp , [8] H. Brighton and C. Mellish, Advances in instance selection for instance-based learning algorithms, Data mining and knowledge discovery, vol. 6, no. 2, pp , [9] K. Nikolaidis, J. Y. Goulermas, and Q. Wu, A class boundary preserving algorithm for data condensation, Pattern Recognition, vol. 44, no. 3, pp , [10] T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE TRansactions on Information Theory, vol. 13, no. 1, pp , [11] P. E. Hart, The condensed nearest neighbor rule, IEEE TRansactions on Information Theory, vol. 14, pp ,

K-modes Clustering Algorithm for Categorical Data

K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute