A density-based approach for instance selection

Size: px
Start display at page:

Download "A density-based approach for instance selection"

Transcription

1 2015 IEEE 27th International Conference on Tools with Artificial Intelligence A density-based approach for instance selection Joel Luis Carbonera Institute of Informatics Universidade Federal do Rio Grande do Sul UFRGS Porto Alegre, Brazil jlcarbonera@inf.ufrgs.br Mara Abel Institute of Informatics Universidade Federal do Rio Grande do Sul UFRGS Porto Alegre, Brazil marabel@inf.ufrgs.br Abstract Instance selection is an important preprocessing step that can be applied in many machine learning tasks. Due to the increasing of the size of the datasets, techniques for instance selection have been applied for reducing the data to a manageable volume, leading to a reduction of the computational resources that are necessary for performing the learning process. Besides that, algorithms of instance selection can also be applied for removing useless, erroneous or noisy instances, before applying learning algorithms. This step can improve the accuracy in classification problems. In the last years, several approaches for instance selection have been proposed. However, most of them have long runtimes and, due to this, they cannot be used for dealing with large datasets. In this paper, we propose a simple and effective density-based approach for instance selection. Our approach, called LDIS (local density-based instance selection), evaluates the instances of each class separately and keeps only the densest instances in a given (arbitrary) neighborhood. This ensures a reasonably low time complexity. Our approach was evaluated on 15 wellknown data sets and its performance was compared with the performance of 5 state-of-the-art algorithms, considering three measures: accuracy, reduction and effectiveness. For evaluating the accuracy achieved using the datasets produced by the algorithms, we applied the KNN algorithm. The results show that LDIS achieves a performance (in terms of balance of accuracy and reduction) that is better or comparable to the performances of the other algorithms considered in the evaluation. Keywords-Instance selection; Instance reduction; dataset reduction; Machine learning; Data mining; Instance-based learning I. INTRODUCTION According to [1], instance selection (IS) is a task that consists of choosing a subset of the total available data to achieve the original purpose of the data mining (or machine learning) application as if the whole data had been used. In general, it constitutes a family of methods that performs the selection of the best possible subset of examples from the original data, by using some rules and/or heuristics. Considering this, the optimal outcome of IS would be the minimum data subset that can accomplish the same task with no performance loss. Thus, in the optimal scenario, P (A S )=P(A T ), where P is the performance, A is the machine learning algorithm, A T represents the algorithm A applied over the complete dataset T, and A S represents the algorithm A applied over a subset S of the dataset T. Thus, every instance selection strategy should face a trade-off between the reduction rate of the dataset and the classification quality [2]. In general, instance selection can be applied for reducing the data to a manageable subset, leading to a reduction of the computational resources (in terms of time and space) that are necessary for performing the learning process [3], [4], [5], [1]. Besides that, instance selection algorithms can also be applied for removing useless (redundant), erroneous or noisy instances, before applying learning algorithms. In this case, the accuracy of the learned models can increase after applying an instance selection technique [3], [1]. In the last years, several approaches for instance selection have been proposed [6], [7], [8], [9], [4], [5]. Most of these algorithms are designed for preserving the boundaries between different classes in the dataset, because border instances provide relevant information for supporting discrimination between classes [5]. However, in general, these algorithms have a high time complexity, and this is not a desirable property for algorithms that should deal with large datasets. The high time complexity of these algorithms is a consequence of the fact that they usually should search for the border instances within the whole dataset and, due to this, they usually need to perform comparisons between each pair of instances in the dataset. In this paper, we propose an algorithm for instance selection, called LDIS (Local Densitybased Instance Selection) 1, which takes a different approach. Instead of focusing on the border instances, which usually are searched across the whole dataset, our approach analyses the instances of each class separately and focuses on keeping only the densest instances in a given (arbitrary) neighborhood. That is, in a first step, our algorithm determines the set of instances that are classified by each class. In each resulting set, for each instance x, the algorithm calculates its local density (its density within its class) and its partial k-neighborhood (where k is determined by the user), which is the set of k-nearest neighbors of x that have the same class label of x. Finally, for each instance, if its density is 1 The source code of the algorithm is available in br/ jlcarbonera/?page id= /15 $ IEEE DOI /ICTAI

2 greater or equals the density of the densest instance within its partial k-neighborhood, the instance is preserved in the resulting dataset. Our approach was evaluated on 15 well-known datasets and its performance was compared with the performance of 5 important algorithms provided by the literature. The accuracy was evaluated considering the KNN algorithm [10]. The results show that, compared to the other algorithms, LDIS provides the best trade-off between accuracy and reduction, while presents a reasonably low time complexity. Section II presents some related works. Section III presents the notation that will be used throughout the paper. Section IV presents our approach (LDIS). Section V discusses our experimental evaluation. Finally, Section VI presents our main conclusions and final remarks. II. RELATED WORKS In this Section, we discuss some important instance reduction methods. For most of the algorithms presented here, we consider T as representing the original set of instances in the training set and S, where S T, as the reduced set of instances, resulting from the instance selection process. The Condensed Nearest Neighbor (CNN), proposed in [11], randomly selects one instance that belongs to each class from T and puts them in S. In the next step, each instance in T is classified using only the instances in S. If an instance is misclassified, it is added to S, in order to ensure that it will be classified correctly. This process repeats until there is no instance in T that is misclassified. It is important to notice that CNN can assign noisy and outlier instances to S, causing negative effects in the classification accuracy. Also, CNN is dependent on instance order in the training set T. The time complexity of CNN is O( T 2 ), where T is the size of the training set. The Reduced Nearest Neighbor algorithm (RNN) [12], assigns all instances in T to S first. Then it removes each instance from S, until further removal causes no other instances in T to be misclassified by the remaining instances in S. RNN produces subsets S that are smaller than the subsets produced by CNN, and is less sensitive to noise than CNN. However, the time complexity of RNN is O( T 3 ), which is higher than the time complexity of CNN. In [2], the authors propose an extension of CNN, called Generalized Condensed Nearest Neighbor (GCNN). This approach operates in a way that is similar to CNN. However, GCNN includes instances which satisfy an absorption criterion to S. Considering d N (x) as the distance between x and its nearest neighbor, and d E (x) as the distance between x and its nearest enemy (instance of a class that is different from the class of x); x is included in S if d N (x) d E (x) > ρ, where ρ is an arbitrary threshold. GCNN could produce sets S that are smaller than the sets produced by CNN. However, determining the value of ρ can be a challenge. In the Edited Nearest Neighbor (ENN) algorithm [6], all training instances are assigned to S first. Then, each instance in S is removed if it does not agree with the label of the majority of its k nearest neighbors. This algorithm removes noisy and outlier instances. Thus, it could improve classification accuracy. It keeps internal instances in contrast to removing boundary instances. Therefore, it cannot reduce the dataset as much as other reduction algorithms. The literature provides some extensions to this method, such as [13]. In [7], the authors propose 5 approaches, the Decremental Reduction Optimization Procedure (DROP) algorithms. In these algorithms, each instance x has k nearest neighbors, and those instances that have x as one of their k nearest neighbors are called the associates of x. Among the proposed algorithms, DROP3 has the best trade-off between the reduction of the dataset and the accuracy of the classification. As an initial step, it applies a noise filter algorithm such as ENN. Then it removes an instance x if its associates in the original training set can be correctly classified without x. The main drawback of DROP3 is its high time complexity. The Iterative Case Filtering algorithm (ICF) [8] is based on the notions of Coverage set and Reachable set. The coverage set of an instance x is the set of instances in T whose distance from x is less than the distance between x and its nearest enemy (instance with a different class). The Reachable set of an instance x is the set of instances in T that have x in their respective coverage sets. In this method, a given instance x is removed from S if Reachable(x) > Coverage(x), that is, when the number of other instances that can classify x correctly are greater than the number of instances that x can correctly classify. Recently, in [5], the authors propose three complementary methods for instance selection that are based on the notion of local sets. In this context, the local set of a given instance x is the set of instances contained in the largest hypersphere centered on x such that it does not contain instances from any other class. The local set-based smoother (LSSm), the first algorithm, was proposed for removing instances that are harmful, that is, instances that misclassify more instances than those that they correctly classify. It uses two notions for guiding the removing process: usefulness and harmfulness. The usefulness u(x) of a given instance x is the number of instances having x among the members of their local sets, and the harmfulness h(x) is the number of instances having x as the nearest enemy. For each instance x in T, the algorithm includes x in S if u(x) h(x). The time complexity of LSSm is O( T 2 ). The second algorithm, the Local Set-based Centroids Selector method (LSCo), firstly applies LSSm for removing noise and then it applies LSclustering [14] for identifying clusters in T without invasive points (instances that are surrounded by instances with a different class). The algorithm keeps in S only the centroids

3 of the resulting clusters. Finally, the Local Set Border Selector (LSBo) first applies LSSm for removing noise, and after it computes the local set of every instance in T. Then, the instances in T are sorted in the ascending order of the cardinality of their local sets. In the last step, LSBo verifies, for each instance x in T if any member of its local set is contained in S, thus ensuring the proper classification of x. If that is not the case, x is included in S to ensure its correct classification. The time complexity of the three approaches is O( T 2 ). Among the three algorithms, LSBo provides the best balance between reduction and accuracy. Other approaches can be found in surveys such as the ones provided in [15], [5], [1]. III. NOTATIONS In this section, we introduce the following notation that will be used throughout the paper: T = {x 1,x 2,..., x n } is a non-empty set of n instances (or data objects). It represents the original dataset that should be reduced in the instance selection process. Each x i T is a m tuple, such that x i = (x i1,x i2,..., x im ), where x ij represents the value of the j-th feature of the instance x i, for 1 j m. L = {l 1,l 2,..., l p } is the set of p class labels that are used for classifying the instances in T, where each l i L represents a given class label. l : T L is a function that maps a given instance x i T to its corresponding class label l j L. c: L 2 T is a function that maps a given class label l j L to a given set C, such that C T, which represents the set of instances in T whose class is l j. Notice that T = l L c(l). In this notation, 2T represents the powerset of T, that is, the set of all subsets of T, including the empty set and T itself. pkn: T N 1 2 T is a function that maps a given instance x i T and a given k N 1 (k 1) toa given set C, such that C c(l(x i )), which represents the set of the k nearest neighbors of x i,inc(l(x i )) (excepting x i ). Since the resulting set C includes only the neighbors that have a given class label, it defines a partial k-neighborhood. S = {x 1,x 2,..., x q } is a set of q instances, such that S T. It represents the reduced set of instances that results from the instance selection process. IV. A LOCAL DENSITY-BASED APPROACH FOR INSTANCE SELECTION As it was discussed in Section II, most of the algorithms proposed for instance selection in the literature are designed for searching the class boundaries in the whole dataset. That is, they perform a global search in the dataset. Although, in general, keeping the border instances in the dataset results in a high accuracy in classification tasks, the global search for these border points involves a high computational cost. The resulting high time complexity is a consequence of the fact that, in general, for identifying the border instances in the dataset it is necessary to perform comparisons between each pair of instances in the dataset. In this paper, we explore another strategy for selecting instances. Instead of identifying the border points, our approach identifies instances that have a high concentration of instances near to them. Besides that, instead of searching for these instances in the whole dataset, our approach deals with the instances of each class of the dataset separately, searching for the representative instances within the set of instances of each class. Since our algorithm applies only a local search strategy (within each class), it has runtimes that are lower than the runtimes of approaches that adopts a global search strategy. For identifying the representative instances, in our approach, we adopt the notion of density, adapted from [16], [17], [18], [19], which is formalized by the Dens function: Dens(x, P )= 1 d(x, y) (1) P y P, where x is a given instance, P = {x 0,x 1,..., x q } is a set of q instances, x P, and d is a given distance function. Notice that Dens(x, P ) provides the density of the instance x relatively to the set P of instances. In this way, when P is a subset of the whole dataset, Dens(x, P ) represents the local density of x, considering the set P. Our approach assumes that the local density, Dens(x, c), of a given instance x in a set c, where c(l(x)) = c, is proportional to the usefulness of x for classifying new instances of c. Thus, the locally densest instance of a given neighborhood would represent more information of its surroundings than the less dense instances and, due to this, it would be more representative of the neighborhood than its less dense neighbors. Considering this, for each l L, the LDIS algorithm verifies, for each x c(l), if there is some instance y pkn(x, k) (where k is arbitrarily chosen by the user), such that Dens(y, c(l)) >Dens(x, c(l)). If this is not the case, this means that x is the locally densest instance in its partial k-neighborhood and, due to this, x is included in S. The Algorithm 1 formalizes this strategy. Notice that when c(l(x)) k, it is necessary to consider k = c(l(x)) 1, for calculating the partial k-neighborhood of x Notice that the most expensive steps of the algorithm involve determining the partial density and the partial k- neighborhood of each instance of a given set c(l) (for some class label l). Determining the partial density of every instance of a given set c(l) is a process whose time complexity is proportional to O( c(l) 2 ). The time complexity of determining the partial k-neighborhood is equivalent. An efficient implementation of the Algorithm 1 could calculate the partial k-neighborhood and the partial density of each instance of a given set c(l) (for some class

4 Algorithm 1: LDIS (Local Density-based Instance Selection) algorithm Input: A set instances T and the number k of neighbors. Output: A subset S of T. begin S ; foreach l L do foreach x c(l) do founddenser false; foreach neighbor pkn(x, k) do if Dens(x, c(l)) <Dens(neighbor, c(l)) then founddenser true; if founddenser then S S {x}; return S; label l) just once, as a first step within the main loop, and use this information for further calculations. Considering this, the time complexity of LDIS would be proportional to O( l L c(l) 2 ). V. EXPERIMENTS For evaluating our approach, we have compared the LDIS algorithm, presented in Section IV, with 5 important instance selection algorithms provided by the literature: DROP3, ENN, ICF, LSBo and LSSm. We considered 15 well-known datasets: breast cancer, cars, E. Coli, glass, iris, letter, lung cancer, mushroom, genetic promoters, segment, soybean 2, splice-junction gene sequences, congressional voting records, wine and zoo. All datasets were obtained from the UCI Machine Learning Repository 3. In Table I, we present the details of the data sets that were used. Dataset Instances Attributes Classes Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Table I DETAILS OF THE DATA SETS USED IN THE EVALUATION PROCESS. Our experimentation was designed for comparing three evaluation measures: accuracy, reduction and effectiveness. 2 This data set combines the large soybean data set and its corresponding test data set 3 Following [5], we assume accuracy = Sucess(Test) (2) Test and T S reduction = (3) T, where Test is a given set of instances that are selected for being tested in a classification task, and Success(Test) is the number of instances in Test correctly classified in the classification task. Besides that, in this work we consider effectiveness as a measure of the degree to which an instance selection algorithm is successful in producing a small set of instances that allows a high classification accuracy of new instances. Thus, we consider effectiveness = accuracy reduction. For evaluating the accuracy of the classification of new instances, we applied the k-nearest Neighbors (KNN) algorithm [10], considering k =3, as considered in [5]. Besides that, the accuracy and reduction were evaluated in an n-fold cross-validation scheme, where n =10. In this scheme, first the dataset is randomly partitioned in 10 equal sized subsamples. From these subsamples, a single subsample is retained as validation data (Test), and the union of the remaining 9 subsamples is considered the initial training set (ITS). After, an instance selection algorithm is applied for reducing the ITS, producing the reduced training set (RT S). At this point, we can measure the reduction of the dataset. Finally, the RT S is used as the training set for the KNN algorithm, for evaluating the instances in Test. At this point, we can measure the accuracy achieved by the KNN, using RT S as the training set. The cross-validation is repeated 10 times, with each subsample used once as Test. The 10 values of accuracy and reduction are averaged to produce, respectively, the average accuracy (AA) and average reduction (AR). The average effectiveness is calculated by considering AA and AR. The Tables II, III and IV show, respectively, the resulting AA, AR and AE of each combination of dataset and instance selection algorithm. In these tables, the best results for each dataset are marked in bold typeface. In this evaluation process, we adopted k =3, for DROP3, ENN, ICF and LDIS. Besides that, we adopted the following distance function d: T T R: m d(x, y) = θ j (x, y) (4) where where θ j (x, y) = j=1 { α(x j,y j ), if j is a categorical feature x j y j, if j is a numerical feature α(x j,y j )= (5) { 1, if x j y j 0, if x yj = y j (6)

5 Table II COMPARISON OF THE accuracy ACHIEVED BY THE TRAINING SET PRODUCED BY EACH ALGORITHM, FOR EACH DATASET. Algorithm DROP3 ENN ICF LSBO LSSM LDIS K=3 Average Breast cancer 0,73 0,74 0,72 0,61 0,74 0,69 0,70 Cars 0,75 0,76 0,76 0,65 0,76 0,76 0,74 E. Coli 0,83 0,86 0,81 0,80 0,85 0,86 0,83 Glass 0,62 0,65 0,62 0,56 0,71 0,59 0,63 Iris 0,97 0,97 0,94 0,94 0,96 0,95 0,95 Letter 0,87 0,92 0,80 0,75 0,92 0,77 0,84 Lung cancer 0,31 0,34 0,41 0,32 0,45 0,38 0,37 Mushroom 1,00 1,00 0,98 1,00 1,00 1,00 1,00 Promoters 0,79 0,79 0,72 0,72 0,82 0,75 0,77 Segment 0,93 0,95 0,91 0,87 0,95 0,91 0,92 Soybean 0,85 0,90 0,84 0,63 0,91 0,78 0,82 Splice 0,71 0,73 0,70 0,75 0,76 0,75 0,73 Voting 0,92 0,92 0,91 0,89 0,92 0,91 0,91 Wine 0,70 0,74 0,73 0,78 0,76 0,72 0,74 Zoo 0,91 0,88 0,88 0,71 0,90 0,82 0,85 Average 0,79 0,81 0,78 0,73 0,83 0,78 0,79 Table III COMPARISON OF THE reduction ACHIEVED BY EACH ALGORITHM, FOR EACH DATASET. Algorithm DROP3 ENN ICF LSBO LSSM LDIS K=3 Average Breast cancer 0,77 0,29 0,85 0,73 0,14 0,87 0,61 Cars 0,88 0,18 0,82 0,74 0,12 0,86 0,60 E. Coli 0,71 0,16 0,86 0,82 0,09 0,90 0,59 Glass 0,76 0,32 0,69 0,72 0,14 0,91 0,59 Iris 0,72 0,04 0,58 0,93 0,06 0,89 0,54 Letter 0,68 0,05 0,80 0,83 0,04 0,83 0,54 Lung cancer 0,70 0,59 0,74 0,52 0,17 0,86 0,60 Mushroom 0,86 0,01 0,94 0,99 0,01 0,87 0,61 Promoters 0,60 0,18 0,70 0,60 0,05 0,84 0,49 Segment 0,69 0,04 0,83 0,91 0,04 0,83 0,55 Soybean 0,68 0,09 0,58 0,83 0,05 0,78 0,50 Splice 0,66 0,23 0,76 0,59 0,05 0,81 0,52 Voting 0,79 0,07 0,92 0,89 0,04 0,77 0,58 Wine 0,71 0,23 0,78 0,78 0,10 0,87 0,58 Zoo 0,65 0,06 0,33 0,88 0,06 0,63 0,43 Average 0,72 0,17 0,75 0,78 0,08 0,83 0,56 Table III shows that LDIS achieves the highest reduction in most of the datasets, and achieves the highest average reduction rate. Finally, The Table IV shows that LDIS achieves the highest effectiveness in most of the datasets, and achieves the highest average effectiveness. Thus, these results show that although the LDIS does not provide the highest accuracies, it provides the highest reduction rates and the best trade-off between both measures (represented by the effectiveness). We also carried out a comparison of the runtimes of the instance selection algorithms considered in our experiments. In this comparison, we applied the 5 instance selection algorithms to reduce the 3 largest datasets considered in our experiments: letter (Figure 1), splice-junction gene sequences (Figure 2) and mushroom (Figure 3). For conducting the experiments, we used an Intel R Core TM i5-3210m laptop with a 2.5 GHz CPU and 6 GB of RAM. The Figures 1, 2 and 3 show that, considering these three datasets, the LDIS algorithm has the lowest runtime compared to the other algorithms. This result is a consequence of the fact that LDIS deals with the set of instances of each class of the dataset separately, instead of performing a global search in the whole dataset. The Table II shows that LSSm achieves the highest accuracy in most of the datasets. However, LSSm was designed for removing noisy instances and, due to this, it does not provides high reduction rates. In addition, notice that the accuracies achieved by LDIS are higher than the average for several datasets, and for 3 datasets LDIS achieve the higher accuracy. Besides that, the difference between the accuracy of LDIS and LSSm is not large, and can be compensated by the gain in reduction and in runtime provided by LDIS. The Figure 1. Comparison of the runtimes of 5 instance selection algorithms, considering the Letter dataset. Table IV COMPARISON OF THE effectiveness ACHIEVED BY EACH ALGORITHM, FOR EACH DATASET. Algorithm DROP3 ENN ICF LSBO LSSM LDIS K=3 Average Breast cancer 0,56 0,21 0,61 0,45 0,10 0,60 0,42 Cars 0,66 0,14 0,62 0,49 0,09 0,65 0,44 E. Coli 0,59 0,14 0,70 0,66 0,08 0,77 0,49 Glass 0,47 0,21 0,42 0,40 0,10 0,54 0,36 Iris 0,69 0,04 0,55 0,87 0,06 0,85 0,51 Letter 0,59 0,05 0,65 0,62 0,03 0,64 0,43 Lung cancer 0,22 0,20 0,30 0,17 0,08 0,32 0,21 Mushroom 0,86 0,01 0,92 0,99 0,01 0,87 0,61 Promoters 0,47 0,14 0,51 0,43 0,04 0,63 0,37 Segment 0,64 0,04 0,75 0,79 0,04 0,75 0,50 Soybean 0,58 0,08 0,49 0,52 0,04 0,61 0,39 Splice 0,47 0,17 0,53 0,45 0,04 0,61 0,38 Voting 0,72 0,07 0,84 0,79 0,04 0,71 0,53 Wine 0,50 0,17 0,57 0,61 0,07 0,63 0,42 Zoo 0,59 0,05 0,29 0,62 0,05 0,52 0,35 Average 0,57 0,11 0,58 0,59 0,06 0,65 0,43 Figure 2. Comparison of the runtimes of 5 instance selection algorithms, considering the dataset of splice-junction gene sequences

6 Table VI COMPARISON OF THE reduction ACHIEVED BY LDIS, WITH DIFFERENT VALUES OF k, FOR EACH DATASET. Figure 3. Comparison of the runtimes of 5 instance selection algorithms, considering the Mushroom dataset. Table V COMPARISON OF THE accuracy ACHIEVED BY LDIS, WITH DIFFERENT VALUES OF k, FOR EACH DATASET. Value of K K=1 K=2 K=3 K=5 K=10 K=20 Average Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Average Finally, we also carried out experiments for evaluating the impact of the parameter k in the performance of LDIS. The Tables V, VI and VII show, respectively, the accuracy, the reduction and the effectiveness of LDIS, with k assuming the values 1, 2, 3, 5, 10 and 20. These results show that the variation of k has a significant impact on the performance of the algorithm. The results suggest that, in general, as the value of k increases, the accuracy tends to decrease, the reduction increases, and the effectiveness increases up to a point from which it begins to decrease. This suggests the possibility of investigating strategies for automatically estimating the best value of k for determining the partial k- neighborhood of a given instance. Also, the Table V shows some exceptions to the general rule, as in the case of the datasets cars, e. coli and iris. We hypothesize that, in this cases, a higher value of k led to an aditional removal of noisy instances, resulting in an increasing of the accuracy. This hypothesis should be investigates in the future. VI. CONCLUSION Most of the instance selection algorithms available in the literature usually search for the border instances in the whole dataset. Due to this, in general, these algorithms Value of K K=1 K=2 K=3 K=5 K=10 K=20 Average Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Average Table VII COMPARISON OF THE effectiveness ACHIEVED BY LDIS, WITH DIFFERENT VALUES OF k, FOR EACH DATASET. Value of K K=1 K=2 K=3 K=5 K=10 K=20 Average Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Average have a high time complexity. In this paper, we proposed an algorithm, called LDIS (Local Density-based Instance Selection), which adopts a different strategy. It analyzes the instances of each class separately, with the goal of keeping only the densest instances of a given neighborhood within each class. Since LDIS searches the representative instances within each class, its resulting time complexity is reasonably low, compared to the main algorithms proposed in the literature. In an overview, our experiments showed that LDIS provides the best reduction rates and the best balance between accuracy and reduction, with the lower time complexity, compared with other algorithms available in the literature. In future works, we plan to investigate strategies for automatically estimating the best value of the parameter k for each problem. Regarding this point, it is reasonable to hypothesize that the best value of k can be different for different classes within the same dataset. This hypothesis should be investigated as well. We also, plan to develop a version of LDIS that abstracts the information of the neighbor instances within the partial

7 k-neighborhood of a dense instance, instead of just eliminating the instances that are less dense. Besides that, we also plan to investigate how the LDIS algorithm can be combined with other instance selection algorithms. Finally, the performance of LDIS encourages the investigation of novel instance selection strategies that are based on other local properties of the dataset ACKNOWLEDGMENT The authors would like to thank the Brazilian Research Council (CNPq) and the PRH PB-17 program (supported by Petrobras) for the support to this work. Also, we would like to thank Sandro Fiorini for comments and ideas. REFERENCES [1] S. García, J. Luengo, and F. Herrera, Data preprocessing in data mining. Springer, [2] C.-H. Chou, B.-H. Kuo, and F. Chang, The generalized condensed nearest neighbor rule as a data reduction method, in Pattern Recognition, ICPR th International Conference on, vol. 2. IEEE, 2006, pp [3] H. Liu and H. Motoda, On issues of instance selection, Data Mining and Knowledge Discovery, vol. 6, no. 2, pp , [4] W.-C. Lin, C.-F. Tsai, S.-W. Ke, C.-W. Hung, and W. Eberle, Learning to detect representative data for large scale instance selection, Journal of Systems and Software, vol. 106, pp. 1 8, [12] G. W. Gates, Reduced nearest neighbor rule, IEEE TRansactions on Information Theory, vol. 18, no. 3, pp , [13] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics, no. 6, pp , [14] Y. Caises, A. González, E. Leyva, and R. Pérez, Combining instance selection methods based on data characterization: An approach to increase their effectiveness, Information Sciences, vol. 181, no. 20, pp , [15] J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, Irahc: Instance reduction algorithm using hyperrectangle clustering, Pattern Recognition, vol. 48, no. 5, pp , [16] L. Bai, J. Liang, C. Dang, and F. Cao, A cluster centers initialization method for clustering categorical data, Expert Systems with Applications, vol. 39, no. 9, pp , [17] J. L. Carbonera and M. Abel, A cognition-inspired knowledge representation approach for knowledge-based interpretation systems, in Proceedings of 17th ICEIS, 2015, pp [18], A cognitively inspired approach for knowledge representation and reasoning in knowledge-based systems, in Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015), [19], Extended ontologies: a cognitively inspired approach, in Proceedings of the 7th Ontology Research Seminar in Brazil (Ontobras), [5] E. Leyva, A. González, and R. Pérez, Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective, Pattern Recognition, vol. 48, no. 4, pp , [6] D. L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, Systems, Man and Cybernetics, IEEE Transactions on, no. 3, pp , [7] D. R. Wilson and T. R. Martinez, Reduction techniques for instance-based learning algorithms, Machine learning, vol. 38, no. 3, pp , [8] H. Brighton and C. Mellish, Advances in instance selection for instance-based learning algorithms, Data mining and knowledge discovery, vol. 6, no. 2, pp , [9] K. Nikolaidis, J. Y. Goulermas, and Q. Wu, A class boundary preserving algorithm for data condensation, Pattern Recognition, vol. 44, no. 3, pp , [10] T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE TRansactions on Information Theory, vol. 13, no. 1, pp , [11] P. E. Hart, The condensed nearest neighbor rule, IEEE TRansactions on Information Theory, vol. 14, pp ,

K-modes Clustering Algorithm for Categorical Data

K-modes Clustering Algorithm for Categorical Data K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute

More information

Using a genetic algorithm for editing k-nearest neighbor classifiers

Using a genetic algorithm for editing k-nearest neighbor classifiers Using a genetic algorithm for editing k-nearest neighbor classifiers R. Gil-Pita 1 and X. Yao 23 1 Teoría de la Señal y Comunicaciones, Universidad de Alcalá, Madrid (SPAIN) 2 Computer Sciences Department,

More information

Instance Pruning Techniques

Instance Pruning Techniques In Fisher, D., ed., Machine Learning: Proceedings of the Fourteenth International Conference, Morgan Kaufmann Publishers, San Francisco, CA, pp. 404-411, 1997. Instance Pruning Techniques D. Randall Wilson,

More information

Available online at ScienceDirect. Procedia Computer Science 35 (2014 )

Available online at  ScienceDirect. Procedia Computer Science 35 (2014 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 35 (2014 ) 388 396 18 th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems

More information

Classifier Inspired Scaling for Training Set Selection

Classifier Inspired Scaling for Training Set Selection Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016-2511 Outline Instance-based classification

More information

Competence-guided Editing Methods for Lazy Learning

Competence-guided Editing Methods for Lazy Learning Competence-guided Editing Methods for Lazy Learning Elizabeth McKenna and Barry Smyth Abstract. Lazy learning algorithms retain their raw training examples and defer all example-processing until problem

More information

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

K-Means Clustering With Initial Centroids Based On Difference Operator

K-Means Clustering With Initial Centroids Based On Difference Operator K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Wrapper Feature Selection using Discrete Cuckoo Optimization Algorithm Abstract S.J. Mousavirad and H. Ebrahimpour-Komleh* 1 Department of Computer and Electrical Engineering, University of Kashan, Kashan,

More information

Fuzzy Ant Clustering by Centroid Positioning

Fuzzy Ant Clustering by Centroid Positioning Fuzzy Ant Clustering by Centroid Positioning Parag M. Kanade and Lawrence O. Hall Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract We

More information

Detecting Clusters and Outliers for Multidimensional

Detecting Clusters and Outliers for Multidimensional Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

More Efficient Classification of Web Content Using Graph Sampling

More Efficient Classification of Web Content Using Graph Sampling More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

A Novel Template Reduction Approach for the K-Nearest Neighbor Method

A Novel Template Reduction Approach for the K-Nearest Neighbor Method Manuscript ID TNN-2008-B-0876 1 A Novel Template Reduction Approach for the K-Nearest Neighbor Method Hatem A. Fayed, Amir F. Atiya Abstract The K Nearest Neighbor (KNN) rule is one of the most widely

More information

Geometric Decision Rules for Instance-based Learning Problems

Geometric Decision Rules for Instance-based Learning Problems Geometric Decision Rules for Instance-based Learning Problems 1 Binay Bhattacharya 1 Kaustav Mukherjee 2 Godfried Toussaint 1 School of Computing Science, Simon Fraser University Burnaby, B.C., Canada

More information

FEATURE EXTRACTION TECHNIQUE FOR HANDWRITTEN INDIAN NUMBERS CLASSIFICATION

FEATURE EXTRACTION TECHNIQUE FOR HANDWRITTEN INDIAN NUMBERS CLASSIFICATION FEATURE EXTRACTION TECHNIQUE FOR HANDWRITTEN INDIAN NUMBERS CLASSIFICATION 1 SALAMEH A. MJLAE, 2 SALIM A. ALKHAWALDEH, 3 SALAH M. AL-SALEH 1, 3 Department of Computer Science, Zarqa University Collage,

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Normalization based K means Clustering Algorithm

Normalization based K means Clustering Algorithm Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd.,

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

1-Nearest Neighbor Boundary

1-Nearest Neighbor Boundary Linear Models Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We would like here to draw a linear separator, and get so a

More information

Bagging and Boosting Algorithms for Support Vector Machine Classifiers

Bagging and Boosting Algorithms for Support Vector Machine Classifiers Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima

More information

Feature-weighted k-nearest Neighbor Classifier

Feature-weighted k-nearest Neighbor Classifier Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka

More information

AN IMPROVED DENSITY BASED k-means ALGORITHM

AN IMPROVED DENSITY BASED k-means ALGORITHM AN IMPROVED DENSITY BASED k-means ALGORITHM Kabiru Dalhatu 1 and Alex Tze Hiang Sim 2 1 Department of Computer Science, Faculty of Computing and Mathematical Science, Kano University of Science and Technology

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

Nearest Cluster Classifier

Nearest Cluster Classifier Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, Behrouz Minaei Nourabad Mamasani Branch Islamic Azad University Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,

More information

Using Decision Trees and Soft Labeling to Filter Mislabeled Data. Abstract

Using Decision Trees and Soft Labeling to Filter Mislabeled Data. Abstract Using Decision Trees and Soft Labeling to Filter Mislabeled Data Xinchuan Zeng and Tony Martinez Department of Computer Science Brigham Young University, Provo, UT 84602 E-Mail: zengx@axon.cs.byu.edu,

More information

Smart Prototype Selection for Machine Learning based on Ignorance Zones Analysis

Smart Prototype Selection for Machine Learning based on Ignorance Zones Analysis Anton Nikulin Smart Prototype Selection for Machine Learning based on Ignorance Zones Analysis Master s Thesis in Information Technology March 25, 2018 University of Jyväskylä Faculty of Information Technology

More information

PROTOTYPE CLASSIFIER DESIGN WITH PRUNING

PROTOTYPE CLASSIFIER DESIGN WITH PRUNING International Journal on Artificial Intelligence Tools c World Scientific Publishing Company PROTOTYPE CLASSIFIER DESIGN WITH PRUNING JIANG LI, MICHAEL T. MANRY, CHANGHUA YU Electrical Engineering Department,

More information

A Parallel Community Detection Algorithm for Big Social Networks

A Parallel Community Detection Algorithm for Big Social Networks A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

Comparing Univariate and Multivariate Decision Trees *

Comparing Univariate and Multivariate Decision Trees * Comparing Univariate and Multivariate Decision Trees * Olcay Taner Yıldız, Ethem Alpaydın Department of Computer Engineering Boğaziçi University, 80815 İstanbul Turkey yildizol@cmpe.boun.edu.tr, alpaydin@boun.edu.tr

More information

Using the Geometrical Distribution of Prototypes for Training Set Condensing

Using the Geometrical Distribution of Prototypes for Training Set Condensing Using the Geometrical Distribution of Prototypes for Training Set Condensing María Teresa Lozano, José Salvador Sánchez, and Filiberto Pla Dept. Lenguajes y Sistemas Informáticos, Universitat Jaume I Campus

More information

PROTOTYPE CLASSIFIER DESIGN WITH PRUNING

PROTOTYPE CLASSIFIER DESIGN WITH PRUNING International Journal on Artificial Intelligence Tools c World Scientific Publishing Company PROTOTYPE CLASSIFIER DESIGN WITH PRUNING JIANG LI, MICHAEL T. MANRY, CHANGHUA YU Electrical Engineering Department,

More information

An Efficient Clustering for Crime Analysis

An Efficient Clustering for Crime Analysis An Efficient Clustering for Crime Analysis Malarvizhi S 1, Siddique Ibrahim 2 1 UG Scholar, Department of Computer Science and Engineering, Kumaraguru College Of Technology, Coimbatore, Tamilnadu, India

More information

Cluster-based instance selection for machine classification

Cluster-based instance selection for machine classification Knowl Inf Syst (2012) 30:113 133 DOI 10.1007/s10115-010-0375-z REGULAR PAPER Cluster-based instance selection for machine classification Ireneusz Czarnowski Received: 24 November 2009 / Revised: 30 June

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH

CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 37 CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 4.1 INTRODUCTION Genes can belong to any genetic network and are also coordinated by many regulatory

More information

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat K Nearest Neighbor Wrap Up K- Means Clustering Slides adapted from Prof. Carpuat K Nearest Neighbor classification Classification is based on Test instance with Training Data K: number of neighbors that

More information

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Improving Classifier Performance by Imputing Missing Values using Discretization Method Improving Classifier Performance by Imputing Missing Values using Discretization Method E. CHANDRA BLESSIE Assistant Professor, Department of Computer Science, D.J.Academy for Managerial Excellence, Coimbatore,

More information

OWA-FRPS: A Prototype Selection method based on Ordered Weighted Average Fuzzy Rough Set Theory

OWA-FRPS: A Prototype Selection method based on Ordered Weighted Average Fuzzy Rough Set Theory OWA-FRPS: A Prototype Selection method based on Ordered Weighted Average Fuzzy Rough Set Theory Nele Verbiest 1, Chris Cornelis 1,2, and Francisco Herrera 2 1 Department of Applied Mathematics, Computer

More information

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 07 December 2016 ISSN (online): 2349-6010 Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer

More information

Prototype Selection for Handwritten Connected Digits Classification

Prototype Selection for Handwritten Connected Digits Classification 2009 0th International Conference on Document Analysis and Recognition Prototype Selection for Handwritten Connected Digits Classification Cristiano de Santana Pereira and George D. C. Cavalcanti 2 Federal

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values

A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values Patrick G. Clark Department of Electrical Eng. and Computer Sci. University of Kansas Lawrence,

More information

A Classifier with the Function-based Decision Tree

A Classifier with the Function-based Decision Tree A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw

More information

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Package ECoL. January 22, 2018

Package ECoL. January 22, 2018 Type Package Version 0.1.0 Date 2018-01-22 Package ECoL January 22, 2018 Title Compleity Measures for Classification Problems Provides measures to characterize the compleity of classification problems

More information

The Role of Biomedical Dataset in Classification

The Role of Biomedical Dataset in Classification The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,

More information

Feature Selection Based on Relative Attribute Dependency: An Experimental Study

Feature Selection Based on Relative Attribute Dependency: An Experimental Study Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han, Ricardo Sanchez, Xiaohua Hu, T.Y. Lin Department of Computer Science, California State University Dominguez

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Toward Part-based Document Image Decoding

Toward Part-based Document Image Decoding 2012 10th IAPR International Workshop on Document Analysis Systems Toward Part-based Document Image Decoding Wang Song, Seiichi Uchida Kyushu University, Fukuoka, Japan wangsong@human.ait.kyushu-u.ac.jp,

More information

SVM-based Filter Using Evidence Theory and Neural Network for Image Denosing

SVM-based Filter Using Evidence Theory and Neural Network for Image Denosing Journal of Software Engineering and Applications 013 6 106-110 doi:10.436/sea.013.63b03 Published Online March 013 (http://www.scirp.org/ournal/sea) SVM-based Filter Using Evidence Theory and Neural Network

More information

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering World Journal of Computer Application and Technology 5(2): 24-29, 2017 DOI: 10.13189/wjcat.2017.050202 http://www.hrpub.org Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

More information

Lorentzian Distance Classifier for Multiple Features

Lorentzian Distance Classifier for Multiple Features Yerzhan Kerimbekov 1 and Hasan Şakir Bilge 2 1 Department of Computer Engineering, Ahmet Yesevi University, Ankara, Turkey 2 Department of Electrical-Electronics Engineering, Gazi University, Ankara, Turkey

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task Xiaolong Wang, Xiangying Jiang, Abhishek Kolagunda, Hagit Shatkay and Chandra Kambhamettu Department of Computer and Information

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

MODULE 7 Nearest Neighbour Classifier and its Variants LESSON 12

MODULE 7 Nearest Neighbour Classifier and its Variants LESSON 12 MODULE 7 Nearest Neighbour Classifier and its Variants LESSON 2 Soft Nearest Neighbour Classifiers Keywords: Fuzzy, Neighbours in a Sphere, Classification Time Fuzzy knn Algorithm In Fuzzy knn Algorithm,

More information

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,

More information

Image Mining: frameworks and techniques

Image Mining: frameworks and techniques Image Mining: frameworks and techniques Madhumathi.k 1, Dr.Antony Selvadoss Thanamani 2 M.Phil, Department of computer science, NGM College, Pollachi, Coimbatore, India 1 HOD Department of Computer Science,

More information

THE discrete multi-valued neuron was presented by N.

THE discrete multi-valued neuron was presented by N. Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Multi-Valued Neuron with New Learning Schemes Shin-Fu Wu and Shie-Jue Lee Department of Electrical

More information

Version Space Support Vector Machines: An Extended Paper

Version Space Support Vector Machines: An Extended Paper Version Space Support Vector Machines: An Extended Paper E.N. Smirnov, I.G. Sprinkhuizen-Kuyper, G.I. Nalbantov 2, and S. Vanderlooy Abstract. We argue to use version spaces as an approach to reliable

More information

Maintaining Footprint-Based Retrieval for Case Deletion

Maintaining Footprint-Based Retrieval for Case Deletion Maintaining Footprint-Based Retrieval for Case Deletion Ning Lu, Jie Lu, Guangquan Zhang Decision Systems & e-service Intelligence (DeSI) Lab Centre for Quantum Computation & Intelligent Systems (QCIS)

More information

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality

Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Abstract: Mass classification of objects is an important area of research and application in a variety of fields. In this

More information

Dataset Editing Techniques: A Comparative Study

Dataset Editing Techniques: A Comparative Study Dataset Editing Techniques: A Comparative Study Nidal Zeidat, Sujing Wang, and Christoph F. Eick Department of Computer Science, University of Houston Houston, Texas, USA {nzeidat, sujingwa, ceick}@cs.uh.edu

More information

Data Mining and Data Warehousing Classification-Lazy Learners

Data Mining and Data Warehousing Classification-Lazy Learners Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is

More information

Data Preprocessing. Supervised Learning

Data Preprocessing. Supervised Learning Supervised Learning Regression Given the value of an input X, the output Y belongs to the set of real values R. The goal is to predict output accurately for a new input. The predictions or outputs y are

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued. The Bayes Classifier We have been starting to look at the supervised classification problem: we are given data (x i, y i ) for i = 1,..., n, where x i R d, and y i {1,..., K}. In this section, we suppose

More information

Multiple Classifier Fusion using k-nearest Localized Templates

Multiple Classifier Fusion using k-nearest Localized Templates Multiple Classifier Fusion using k-nearest Localized Templates Jun-Ki Min and Sung-Bae Cho Department of Computer Science, Yonsei University Biometrics Engineering Research Center 134 Shinchon-dong, Sudaemoon-ku,

More information

RECORD-TO-RECORD TRAVEL ALGORITHM FOR ATTRIBUTE REDUCTION IN ROUGH SET THEORY

RECORD-TO-RECORD TRAVEL ALGORITHM FOR ATTRIBUTE REDUCTION IN ROUGH SET THEORY RECORD-TO-RECORD TRAVEL ALGORITHM FOR ATTRIBUTE REDUCTION IN ROUGH SET THEORY MAJDI MAFARJA 1,2, SALWANI ABDULLAH 1 1 Data Mining and Optimization Research Group (DMO), Center for Artificial Intelligence

More information

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the

More information

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological

More information

Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification

Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification Bing Xue, Mengjie Zhang, and Will N. Browne School of Engineering and Computer Science Victoria University of

More information

An Ensemble of Classifiers using Dynamic Method on Ambiguous Data

An Ensemble of Classifiers using Dynamic Method on Ambiguous Data An Ensemble of Classifiers using Dynamic Method on Ambiguous Data Dnyaneshwar Kudande D.Y. Patil College of Engineering, Pune, Maharashtra, India Abstract- The aim of proposed work is to analyze the Instance

More information

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

More information

Fast k Most Similar Neighbor Classifier for Mixed Data Based on a Tree Structure

Fast k Most Similar Neighbor Classifier for Mixed Data Based on a Tree Structure Fast k Most Similar Neighbor Classifier for Mixed Data Based on a Tree Structure Selene Hernández-Rodríguez, J. Fco. Martínez-Trinidad, and J. Ariel Carrasco-Ochoa Computer Science Department National

More information

Nearest Cluster Classifier

Nearest Cluster Classifier Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, and Behrouz Minaei Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique Research Paper Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique C. Sudarsana Reddy 1 S. Aquter Babu 2 Dr. V. Vasu 3 Department

More information

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

LEARNING WEIGHTS OF FUZZY RULES BY USING GRAVITATIONAL SEARCH ALGORITHM

LEARNING WEIGHTS OF FUZZY RULES BY USING GRAVITATIONAL SEARCH ALGORITHM International Journal of Innovative Computing, Information and Control ICIC International c 2013 ISSN 1349-4198 Volume 9, Number 4, April 2013 pp. 1593 1601 LEARNING WEIGHTS OF FUZZY RULES BY USING GRAVITATIONAL

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information