A density-based approach for instance selection
|
|
- Elwin Foster
- 5 years ago
- Views:
Transcription
1 2015 IEEE 27th International Conference on Tools with Artificial Intelligence A density-based approach for instance selection Joel Luis Carbonera Institute of Informatics Universidade Federal do Rio Grande do Sul UFRGS Porto Alegre, Brazil jlcarbonera@inf.ufrgs.br Mara Abel Institute of Informatics Universidade Federal do Rio Grande do Sul UFRGS Porto Alegre, Brazil marabel@inf.ufrgs.br Abstract Instance selection is an important preprocessing step that can be applied in many machine learning tasks. Due to the increasing of the size of the datasets, techniques for instance selection have been applied for reducing the data to a manageable volume, leading to a reduction of the computational resources that are necessary for performing the learning process. Besides that, algorithms of instance selection can also be applied for removing useless, erroneous or noisy instances, before applying learning algorithms. This step can improve the accuracy in classification problems. In the last years, several approaches for instance selection have been proposed. However, most of them have long runtimes and, due to this, they cannot be used for dealing with large datasets. In this paper, we propose a simple and effective density-based approach for instance selection. Our approach, called LDIS (local density-based instance selection), evaluates the instances of each class separately and keeps only the densest instances in a given (arbitrary) neighborhood. This ensures a reasonably low time complexity. Our approach was evaluated on 15 wellknown data sets and its performance was compared with the performance of 5 state-of-the-art algorithms, considering three measures: accuracy, reduction and effectiveness. For evaluating the accuracy achieved using the datasets produced by the algorithms, we applied the KNN algorithm. The results show that LDIS achieves a performance (in terms of balance of accuracy and reduction) that is better or comparable to the performances of the other algorithms considered in the evaluation. Keywords-Instance selection; Instance reduction; dataset reduction; Machine learning; Data mining; Instance-based learning I. INTRODUCTION According to [1], instance selection (IS) is a task that consists of choosing a subset of the total available data to achieve the original purpose of the data mining (or machine learning) application as if the whole data had been used. In general, it constitutes a family of methods that performs the selection of the best possible subset of examples from the original data, by using some rules and/or heuristics. Considering this, the optimal outcome of IS would be the minimum data subset that can accomplish the same task with no performance loss. Thus, in the optimal scenario, P (A S )=P(A T ), where P is the performance, A is the machine learning algorithm, A T represents the algorithm A applied over the complete dataset T, and A S represents the algorithm A applied over a subset S of the dataset T. Thus, every instance selection strategy should face a trade-off between the reduction rate of the dataset and the classification quality [2]. In general, instance selection can be applied for reducing the data to a manageable subset, leading to a reduction of the computational resources (in terms of time and space) that are necessary for performing the learning process [3], [4], [5], [1]. Besides that, instance selection algorithms can also be applied for removing useless (redundant), erroneous or noisy instances, before applying learning algorithms. In this case, the accuracy of the learned models can increase after applying an instance selection technique [3], [1]. In the last years, several approaches for instance selection have been proposed [6], [7], [8], [9], [4], [5]. Most of these algorithms are designed for preserving the boundaries between different classes in the dataset, because border instances provide relevant information for supporting discrimination between classes [5]. However, in general, these algorithms have a high time complexity, and this is not a desirable property for algorithms that should deal with large datasets. The high time complexity of these algorithms is a consequence of the fact that they usually should search for the border instances within the whole dataset and, due to this, they usually need to perform comparisons between each pair of instances in the dataset. In this paper, we propose an algorithm for instance selection, called LDIS (Local Densitybased Instance Selection) 1, which takes a different approach. Instead of focusing on the border instances, which usually are searched across the whole dataset, our approach analyses the instances of each class separately and focuses on keeping only the densest instances in a given (arbitrary) neighborhood. That is, in a first step, our algorithm determines the set of instances that are classified by each class. In each resulting set, for each instance x, the algorithm calculates its local density (its density within its class) and its partial k-neighborhood (where k is determined by the user), which is the set of k-nearest neighbors of x that have the same class label of x. Finally, for each instance, if its density is 1 The source code of the algorithm is available in br/ jlcarbonera/?page id= /15 $ IEEE DOI /ICTAI
2 greater or equals the density of the densest instance within its partial k-neighborhood, the instance is preserved in the resulting dataset. Our approach was evaluated on 15 well-known datasets and its performance was compared with the performance of 5 important algorithms provided by the literature. The accuracy was evaluated considering the KNN algorithm [10]. The results show that, compared to the other algorithms, LDIS provides the best trade-off between accuracy and reduction, while presents a reasonably low time complexity. Section II presents some related works. Section III presents the notation that will be used throughout the paper. Section IV presents our approach (LDIS). Section V discusses our experimental evaluation. Finally, Section VI presents our main conclusions and final remarks. II. RELATED WORKS In this Section, we discuss some important instance reduction methods. For most of the algorithms presented here, we consider T as representing the original set of instances in the training set and S, where S T, as the reduced set of instances, resulting from the instance selection process. The Condensed Nearest Neighbor (CNN), proposed in [11], randomly selects one instance that belongs to each class from T and puts them in S. In the next step, each instance in T is classified using only the instances in S. If an instance is misclassified, it is added to S, in order to ensure that it will be classified correctly. This process repeats until there is no instance in T that is misclassified. It is important to notice that CNN can assign noisy and outlier instances to S, causing negative effects in the classification accuracy. Also, CNN is dependent on instance order in the training set T. The time complexity of CNN is O( T 2 ), where T is the size of the training set. The Reduced Nearest Neighbor algorithm (RNN) [12], assigns all instances in T to S first. Then it removes each instance from S, until further removal causes no other instances in T to be misclassified by the remaining instances in S. RNN produces subsets S that are smaller than the subsets produced by CNN, and is less sensitive to noise than CNN. However, the time complexity of RNN is O( T 3 ), which is higher than the time complexity of CNN. In [2], the authors propose an extension of CNN, called Generalized Condensed Nearest Neighbor (GCNN). This approach operates in a way that is similar to CNN. However, GCNN includes instances which satisfy an absorption criterion to S. Considering d N (x) as the distance between x and its nearest neighbor, and d E (x) as the distance between x and its nearest enemy (instance of a class that is different from the class of x); x is included in S if d N (x) d E (x) > ρ, where ρ is an arbitrary threshold. GCNN could produce sets S that are smaller than the sets produced by CNN. However, determining the value of ρ can be a challenge. In the Edited Nearest Neighbor (ENN) algorithm [6], all training instances are assigned to S first. Then, each instance in S is removed if it does not agree with the label of the majority of its k nearest neighbors. This algorithm removes noisy and outlier instances. Thus, it could improve classification accuracy. It keeps internal instances in contrast to removing boundary instances. Therefore, it cannot reduce the dataset as much as other reduction algorithms. The literature provides some extensions to this method, such as [13]. In [7], the authors propose 5 approaches, the Decremental Reduction Optimization Procedure (DROP) algorithms. In these algorithms, each instance x has k nearest neighbors, and those instances that have x as one of their k nearest neighbors are called the associates of x. Among the proposed algorithms, DROP3 has the best trade-off between the reduction of the dataset and the accuracy of the classification. As an initial step, it applies a noise filter algorithm such as ENN. Then it removes an instance x if its associates in the original training set can be correctly classified without x. The main drawback of DROP3 is its high time complexity. The Iterative Case Filtering algorithm (ICF) [8] is based on the notions of Coverage set and Reachable set. The coverage set of an instance x is the set of instances in T whose distance from x is less than the distance between x and its nearest enemy (instance with a different class). The Reachable set of an instance x is the set of instances in T that have x in their respective coverage sets. In this method, a given instance x is removed from S if Reachable(x) > Coverage(x), that is, when the number of other instances that can classify x correctly are greater than the number of instances that x can correctly classify. Recently, in [5], the authors propose three complementary methods for instance selection that are based on the notion of local sets. In this context, the local set of a given instance x is the set of instances contained in the largest hypersphere centered on x such that it does not contain instances from any other class. The local set-based smoother (LSSm), the first algorithm, was proposed for removing instances that are harmful, that is, instances that misclassify more instances than those that they correctly classify. It uses two notions for guiding the removing process: usefulness and harmfulness. The usefulness u(x) of a given instance x is the number of instances having x among the members of their local sets, and the harmfulness h(x) is the number of instances having x as the nearest enemy. For each instance x in T, the algorithm includes x in S if u(x) h(x). The time complexity of LSSm is O( T 2 ). The second algorithm, the Local Set-based Centroids Selector method (LSCo), firstly applies LSSm for removing noise and then it applies LSclustering [14] for identifying clusters in T without invasive points (instances that are surrounded by instances with a different class). The algorithm keeps in S only the centroids
3 of the resulting clusters. Finally, the Local Set Border Selector (LSBo) first applies LSSm for removing noise, and after it computes the local set of every instance in T. Then, the instances in T are sorted in the ascending order of the cardinality of their local sets. In the last step, LSBo verifies, for each instance x in T if any member of its local set is contained in S, thus ensuring the proper classification of x. If that is not the case, x is included in S to ensure its correct classification. The time complexity of the three approaches is O( T 2 ). Among the three algorithms, LSBo provides the best balance between reduction and accuracy. Other approaches can be found in surveys such as the ones provided in [15], [5], [1]. III. NOTATIONS In this section, we introduce the following notation that will be used throughout the paper: T = {x 1,x 2,..., x n } is a non-empty set of n instances (or data objects). It represents the original dataset that should be reduced in the instance selection process. Each x i T is a m tuple, such that x i = (x i1,x i2,..., x im ), where x ij represents the value of the j-th feature of the instance x i, for 1 j m. L = {l 1,l 2,..., l p } is the set of p class labels that are used for classifying the instances in T, where each l i L represents a given class label. l : T L is a function that maps a given instance x i T to its corresponding class label l j L. c: L 2 T is a function that maps a given class label l j L to a given set C, such that C T, which represents the set of instances in T whose class is l j. Notice that T = l L c(l). In this notation, 2T represents the powerset of T, that is, the set of all subsets of T, including the empty set and T itself. pkn: T N 1 2 T is a function that maps a given instance x i T and a given k N 1 (k 1) toa given set C, such that C c(l(x i )), which represents the set of the k nearest neighbors of x i,inc(l(x i )) (excepting x i ). Since the resulting set C includes only the neighbors that have a given class label, it defines a partial k-neighborhood. S = {x 1,x 2,..., x q } is a set of q instances, such that S T. It represents the reduced set of instances that results from the instance selection process. IV. A LOCAL DENSITY-BASED APPROACH FOR INSTANCE SELECTION As it was discussed in Section II, most of the algorithms proposed for instance selection in the literature are designed for searching the class boundaries in the whole dataset. That is, they perform a global search in the dataset. Although, in general, keeping the border instances in the dataset results in a high accuracy in classification tasks, the global search for these border points involves a high computational cost. The resulting high time complexity is a consequence of the fact that, in general, for identifying the border instances in the dataset it is necessary to perform comparisons between each pair of instances in the dataset. In this paper, we explore another strategy for selecting instances. Instead of identifying the border points, our approach identifies instances that have a high concentration of instances near to them. Besides that, instead of searching for these instances in the whole dataset, our approach deals with the instances of each class of the dataset separately, searching for the representative instances within the set of instances of each class. Since our algorithm applies only a local search strategy (within each class), it has runtimes that are lower than the runtimes of approaches that adopts a global search strategy. For identifying the representative instances, in our approach, we adopt the notion of density, adapted from [16], [17], [18], [19], which is formalized by the Dens function: Dens(x, P )= 1 d(x, y) (1) P y P, where x is a given instance, P = {x 0,x 1,..., x q } is a set of q instances, x P, and d is a given distance function. Notice that Dens(x, P ) provides the density of the instance x relatively to the set P of instances. In this way, when P is a subset of the whole dataset, Dens(x, P ) represents the local density of x, considering the set P. Our approach assumes that the local density, Dens(x, c), of a given instance x in a set c, where c(l(x)) = c, is proportional to the usefulness of x for classifying new instances of c. Thus, the locally densest instance of a given neighborhood would represent more information of its surroundings than the less dense instances and, due to this, it would be more representative of the neighborhood than its less dense neighbors. Considering this, for each l L, the LDIS algorithm verifies, for each x c(l), if there is some instance y pkn(x, k) (where k is arbitrarily chosen by the user), such that Dens(y, c(l)) >Dens(x, c(l)). If this is not the case, this means that x is the locally densest instance in its partial k-neighborhood and, due to this, x is included in S. The Algorithm 1 formalizes this strategy. Notice that when c(l(x)) k, it is necessary to consider k = c(l(x)) 1, for calculating the partial k-neighborhood of x Notice that the most expensive steps of the algorithm involve determining the partial density and the partial k- neighborhood of each instance of a given set c(l) (for some class label l). Determining the partial density of every instance of a given set c(l) is a process whose time complexity is proportional to O( c(l) 2 ). The time complexity of determining the partial k-neighborhood is equivalent. An efficient implementation of the Algorithm 1 could calculate the partial k-neighborhood and the partial density of each instance of a given set c(l) (for some class
4 Algorithm 1: LDIS (Local Density-based Instance Selection) algorithm Input: A set instances T and the number k of neighbors. Output: A subset S of T. begin S ; foreach l L do foreach x c(l) do founddenser false; foreach neighbor pkn(x, k) do if Dens(x, c(l)) <Dens(neighbor, c(l)) then founddenser true; if founddenser then S S {x}; return S; label l) just once, as a first step within the main loop, and use this information for further calculations. Considering this, the time complexity of LDIS would be proportional to O( l L c(l) 2 ). V. EXPERIMENTS For evaluating our approach, we have compared the LDIS algorithm, presented in Section IV, with 5 important instance selection algorithms provided by the literature: DROP3, ENN, ICF, LSBo and LSSm. We considered 15 well-known datasets: breast cancer, cars, E. Coli, glass, iris, letter, lung cancer, mushroom, genetic promoters, segment, soybean 2, splice-junction gene sequences, congressional voting records, wine and zoo. All datasets were obtained from the UCI Machine Learning Repository 3. In Table I, we present the details of the data sets that were used. Dataset Instances Attributes Classes Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Table I DETAILS OF THE DATA SETS USED IN THE EVALUATION PROCESS. Our experimentation was designed for comparing three evaluation measures: accuracy, reduction and effectiveness. 2 This data set combines the large soybean data set and its corresponding test data set 3 Following [5], we assume accuracy = Sucess(Test) (2) Test and T S reduction = (3) T, where Test is a given set of instances that are selected for being tested in a classification task, and Success(Test) is the number of instances in Test correctly classified in the classification task. Besides that, in this work we consider effectiveness as a measure of the degree to which an instance selection algorithm is successful in producing a small set of instances that allows a high classification accuracy of new instances. Thus, we consider effectiveness = accuracy reduction. For evaluating the accuracy of the classification of new instances, we applied the k-nearest Neighbors (KNN) algorithm [10], considering k =3, as considered in [5]. Besides that, the accuracy and reduction were evaluated in an n-fold cross-validation scheme, where n =10. In this scheme, first the dataset is randomly partitioned in 10 equal sized subsamples. From these subsamples, a single subsample is retained as validation data (Test), and the union of the remaining 9 subsamples is considered the initial training set (ITS). After, an instance selection algorithm is applied for reducing the ITS, producing the reduced training set (RT S). At this point, we can measure the reduction of the dataset. Finally, the RT S is used as the training set for the KNN algorithm, for evaluating the instances in Test. At this point, we can measure the accuracy achieved by the KNN, using RT S as the training set. The cross-validation is repeated 10 times, with each subsample used once as Test. The 10 values of accuracy and reduction are averaged to produce, respectively, the average accuracy (AA) and average reduction (AR). The average effectiveness is calculated by considering AA and AR. The Tables II, III and IV show, respectively, the resulting AA, AR and AE of each combination of dataset and instance selection algorithm. In these tables, the best results for each dataset are marked in bold typeface. In this evaluation process, we adopted k =3, for DROP3, ENN, ICF and LDIS. Besides that, we adopted the following distance function d: T T R: m d(x, y) = θ j (x, y) (4) where where θ j (x, y) = j=1 { α(x j,y j ), if j is a categorical feature x j y j, if j is a numerical feature α(x j,y j )= (5) { 1, if x j y j 0, if x yj = y j (6)
5 Table II COMPARISON OF THE accuracy ACHIEVED BY THE TRAINING SET PRODUCED BY EACH ALGORITHM, FOR EACH DATASET. Algorithm DROP3 ENN ICF LSBO LSSM LDIS K=3 Average Breast cancer 0,73 0,74 0,72 0,61 0,74 0,69 0,70 Cars 0,75 0,76 0,76 0,65 0,76 0,76 0,74 E. Coli 0,83 0,86 0,81 0,80 0,85 0,86 0,83 Glass 0,62 0,65 0,62 0,56 0,71 0,59 0,63 Iris 0,97 0,97 0,94 0,94 0,96 0,95 0,95 Letter 0,87 0,92 0,80 0,75 0,92 0,77 0,84 Lung cancer 0,31 0,34 0,41 0,32 0,45 0,38 0,37 Mushroom 1,00 1,00 0,98 1,00 1,00 1,00 1,00 Promoters 0,79 0,79 0,72 0,72 0,82 0,75 0,77 Segment 0,93 0,95 0,91 0,87 0,95 0,91 0,92 Soybean 0,85 0,90 0,84 0,63 0,91 0,78 0,82 Splice 0,71 0,73 0,70 0,75 0,76 0,75 0,73 Voting 0,92 0,92 0,91 0,89 0,92 0,91 0,91 Wine 0,70 0,74 0,73 0,78 0,76 0,72 0,74 Zoo 0,91 0,88 0,88 0,71 0,90 0,82 0,85 Average 0,79 0,81 0,78 0,73 0,83 0,78 0,79 Table III COMPARISON OF THE reduction ACHIEVED BY EACH ALGORITHM, FOR EACH DATASET. Algorithm DROP3 ENN ICF LSBO LSSM LDIS K=3 Average Breast cancer 0,77 0,29 0,85 0,73 0,14 0,87 0,61 Cars 0,88 0,18 0,82 0,74 0,12 0,86 0,60 E. Coli 0,71 0,16 0,86 0,82 0,09 0,90 0,59 Glass 0,76 0,32 0,69 0,72 0,14 0,91 0,59 Iris 0,72 0,04 0,58 0,93 0,06 0,89 0,54 Letter 0,68 0,05 0,80 0,83 0,04 0,83 0,54 Lung cancer 0,70 0,59 0,74 0,52 0,17 0,86 0,60 Mushroom 0,86 0,01 0,94 0,99 0,01 0,87 0,61 Promoters 0,60 0,18 0,70 0,60 0,05 0,84 0,49 Segment 0,69 0,04 0,83 0,91 0,04 0,83 0,55 Soybean 0,68 0,09 0,58 0,83 0,05 0,78 0,50 Splice 0,66 0,23 0,76 0,59 0,05 0,81 0,52 Voting 0,79 0,07 0,92 0,89 0,04 0,77 0,58 Wine 0,71 0,23 0,78 0,78 0,10 0,87 0,58 Zoo 0,65 0,06 0,33 0,88 0,06 0,63 0,43 Average 0,72 0,17 0,75 0,78 0,08 0,83 0,56 Table III shows that LDIS achieves the highest reduction in most of the datasets, and achieves the highest average reduction rate. Finally, The Table IV shows that LDIS achieves the highest effectiveness in most of the datasets, and achieves the highest average effectiveness. Thus, these results show that although the LDIS does not provide the highest accuracies, it provides the highest reduction rates and the best trade-off between both measures (represented by the effectiveness). We also carried out a comparison of the runtimes of the instance selection algorithms considered in our experiments. In this comparison, we applied the 5 instance selection algorithms to reduce the 3 largest datasets considered in our experiments: letter (Figure 1), splice-junction gene sequences (Figure 2) and mushroom (Figure 3). For conducting the experiments, we used an Intel R Core TM i5-3210m laptop with a 2.5 GHz CPU and 6 GB of RAM. The Figures 1, 2 and 3 show that, considering these three datasets, the LDIS algorithm has the lowest runtime compared to the other algorithms. This result is a consequence of the fact that LDIS deals with the set of instances of each class of the dataset separately, instead of performing a global search in the whole dataset. The Table II shows that LSSm achieves the highest accuracy in most of the datasets. However, LSSm was designed for removing noisy instances and, due to this, it does not provides high reduction rates. In addition, notice that the accuracies achieved by LDIS are higher than the average for several datasets, and for 3 datasets LDIS achieve the higher accuracy. Besides that, the difference between the accuracy of LDIS and LSSm is not large, and can be compensated by the gain in reduction and in runtime provided by LDIS. The Figure 1. Comparison of the runtimes of 5 instance selection algorithms, considering the Letter dataset. Table IV COMPARISON OF THE effectiveness ACHIEVED BY EACH ALGORITHM, FOR EACH DATASET. Algorithm DROP3 ENN ICF LSBO LSSM LDIS K=3 Average Breast cancer 0,56 0,21 0,61 0,45 0,10 0,60 0,42 Cars 0,66 0,14 0,62 0,49 0,09 0,65 0,44 E. Coli 0,59 0,14 0,70 0,66 0,08 0,77 0,49 Glass 0,47 0,21 0,42 0,40 0,10 0,54 0,36 Iris 0,69 0,04 0,55 0,87 0,06 0,85 0,51 Letter 0,59 0,05 0,65 0,62 0,03 0,64 0,43 Lung cancer 0,22 0,20 0,30 0,17 0,08 0,32 0,21 Mushroom 0,86 0,01 0,92 0,99 0,01 0,87 0,61 Promoters 0,47 0,14 0,51 0,43 0,04 0,63 0,37 Segment 0,64 0,04 0,75 0,79 0,04 0,75 0,50 Soybean 0,58 0,08 0,49 0,52 0,04 0,61 0,39 Splice 0,47 0,17 0,53 0,45 0,04 0,61 0,38 Voting 0,72 0,07 0,84 0,79 0,04 0,71 0,53 Wine 0,50 0,17 0,57 0,61 0,07 0,63 0,42 Zoo 0,59 0,05 0,29 0,62 0,05 0,52 0,35 Average 0,57 0,11 0,58 0,59 0,06 0,65 0,43 Figure 2. Comparison of the runtimes of 5 instance selection algorithms, considering the dataset of splice-junction gene sequences
6 Table VI COMPARISON OF THE reduction ACHIEVED BY LDIS, WITH DIFFERENT VALUES OF k, FOR EACH DATASET. Figure 3. Comparison of the runtimes of 5 instance selection algorithms, considering the Mushroom dataset. Table V COMPARISON OF THE accuracy ACHIEVED BY LDIS, WITH DIFFERENT VALUES OF k, FOR EACH DATASET. Value of K K=1 K=2 K=3 K=5 K=10 K=20 Average Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Average Finally, we also carried out experiments for evaluating the impact of the parameter k in the performance of LDIS. The Tables V, VI and VII show, respectively, the accuracy, the reduction and the effectiveness of LDIS, with k assuming the values 1, 2, 3, 5, 10 and 20. These results show that the variation of k has a significant impact on the performance of the algorithm. The results suggest that, in general, as the value of k increases, the accuracy tends to decrease, the reduction increases, and the effectiveness increases up to a point from which it begins to decrease. This suggests the possibility of investigating strategies for automatically estimating the best value of k for determining the partial k- neighborhood of a given instance. Also, the Table V shows some exceptions to the general rule, as in the case of the datasets cars, e. coli and iris. We hypothesize that, in this cases, a higher value of k led to an aditional removal of noisy instances, resulting in an increasing of the accuracy. This hypothesis should be investigates in the future. VI. CONCLUSION Most of the instance selection algorithms available in the literature usually search for the border instances in the whole dataset. Due to this, in general, these algorithms Value of K K=1 K=2 K=3 K=5 K=10 K=20 Average Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Average Table VII COMPARISON OF THE effectiveness ACHIEVED BY LDIS, WITH DIFFERENT VALUES OF k, FOR EACH DATASET. Value of K K=1 K=2 K=3 K=5 K=10 K=20 Average Breast cancer Cars E. Coli Glass Iris Letter Lung cancer Mushroom Promoters Segment Soybean Splice Voting Wine Zoo Average have a high time complexity. In this paper, we proposed an algorithm, called LDIS (Local Density-based Instance Selection), which adopts a different strategy. It analyzes the instances of each class separately, with the goal of keeping only the densest instances of a given neighborhood within each class. Since LDIS searches the representative instances within each class, its resulting time complexity is reasonably low, compared to the main algorithms proposed in the literature. In an overview, our experiments showed that LDIS provides the best reduction rates and the best balance between accuracy and reduction, with the lower time complexity, compared with other algorithms available in the literature. In future works, we plan to investigate strategies for automatically estimating the best value of the parameter k for each problem. Regarding this point, it is reasonable to hypothesize that the best value of k can be different for different classes within the same dataset. This hypothesis should be investigated as well. We also, plan to develop a version of LDIS that abstracts the information of the neighbor instances within the partial
7 k-neighborhood of a dense instance, instead of just eliminating the instances that are less dense. Besides that, we also plan to investigate how the LDIS algorithm can be combined with other instance selection algorithms. Finally, the performance of LDIS encourages the investigation of novel instance selection strategies that are based on other local properties of the dataset ACKNOWLEDGMENT The authors would like to thank the Brazilian Research Council (CNPq) and the PRH PB-17 program (supported by Petrobras) for the support to this work. Also, we would like to thank Sandro Fiorini for comments and ideas. REFERENCES [1] S. García, J. Luengo, and F. Herrera, Data preprocessing in data mining. Springer, [2] C.-H. Chou, B.-H. Kuo, and F. Chang, The generalized condensed nearest neighbor rule as a data reduction method, in Pattern Recognition, ICPR th International Conference on, vol. 2. IEEE, 2006, pp [3] H. Liu and H. Motoda, On issues of instance selection, Data Mining and Knowledge Discovery, vol. 6, no. 2, pp , [4] W.-C. Lin, C.-F. Tsai, S.-W. Ke, C.-W. Hung, and W. Eberle, Learning to detect representative data for large scale instance selection, Journal of Systems and Software, vol. 106, pp. 1 8, [12] G. W. Gates, Reduced nearest neighbor rule, IEEE TRansactions on Information Theory, vol. 18, no. 3, pp , [13] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Transactions on Systems, Man, and Cybernetics, no. 6, pp , [14] Y. Caises, A. González, E. Leyva, and R. Pérez, Combining instance selection methods based on data characterization: An approach to increase their effectiveness, Information Sciences, vol. 181, no. 20, pp , [15] J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, Irahc: Instance reduction algorithm using hyperrectangle clustering, Pattern Recognition, vol. 48, no. 5, pp , [16] L. Bai, J. Liang, C. Dang, and F. Cao, A cluster centers initialization method for clustering categorical data, Expert Systems with Applications, vol. 39, no. 9, pp , [17] J. L. Carbonera and M. Abel, A cognition-inspired knowledge representation approach for knowledge-based interpretation systems, in Proceedings of 17th ICEIS, 2015, pp [18], A cognitively inspired approach for knowledge representation and reasoning in knowledge-based systems, in Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015), [19], Extended ontologies: a cognitively inspired approach, in Proceedings of the 7th Ontology Research Seminar in Brazil (Ontobras), [5] E. Leyva, A. González, and R. Pérez, Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective, Pattern Recognition, vol. 48, no. 4, pp , [6] D. L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, Systems, Man and Cybernetics, IEEE Transactions on, no. 3, pp , [7] D. R. Wilson and T. R. Martinez, Reduction techniques for instance-based learning algorithms, Machine learning, vol. 38, no. 3, pp , [8] H. Brighton and C. Mellish, Advances in instance selection for instance-based learning algorithms, Data mining and knowledge discovery, vol. 6, no. 2, pp , [9] K. Nikolaidis, J. Y. Goulermas, and Q. Wu, A class boundary preserving algorithm for data condensation, Pattern Recognition, vol. 44, no. 3, pp , [10] T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE TRansactions on Information Theory, vol. 13, no. 1, pp , [11] P. E. Hart, The condensed nearest neighbor rule, IEEE TRansactions on Information Theory, vol. 14, pp ,
K-modes Clustering Algorithm for Categorical Data
K-modes Clustering Algorithm for Categorical Data Neha Sharma Samrat Ashok Technological Institute Department of Information Technology, Vidisha, India Nirmal Gaud Samrat Ashok Technological Institute
More informationUsing a genetic algorithm for editing k-nearest neighbor classifiers
Using a genetic algorithm for editing k-nearest neighbor classifiers R. Gil-Pita 1 and X. Yao 23 1 Teoría de la Señal y Comunicaciones, Universidad de Alcalá, Madrid (SPAIN) 2 Computer Sciences Department,
More informationInstance Pruning Techniques
In Fisher, D., ed., Machine Learning: Proceedings of the Fourteenth International Conference, Morgan Kaufmann Publishers, San Francisco, CA, pp. 404-411, 1997. Instance Pruning Techniques D. Randall Wilson,
More informationAvailable online at ScienceDirect. Procedia Computer Science 35 (2014 )
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 35 (2014 ) 388 396 18 th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems
More informationClassifier Inspired Scaling for Training Set Selection
Classifier Inspired Scaling for Training Set Selection Walter Bennette DISTRIBUTION A: Approved for public release: distribution unlimited: 16 May 2016. Case #88ABW-2016-2511 Outline Instance-based classification
More informationCompetence-guided Editing Methods for Lazy Learning
Competence-guided Editing Methods for Lazy Learning Elizabeth McKenna and Barry Smyth Abstract. Lazy learning algorithms retain their raw training examples and defer all example-processing until problem
More informationWEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1
WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1 H. Altay Güvenir and Aynur Akkuş Department of Computer Engineering and Information Science Bilkent University, 06533, Ankara, Turkey
More informationData Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy
Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department
More informationK-Means Clustering With Initial Centroids Based On Difference Operator
K-Means Clustering With Initial Centroids Based On Difference Operator Satish Chaurasiya 1, Dr.Ratish Agrawal 2 M.Tech Student, School of Information and Technology, R.G.P.V, Bhopal, India Assistant Professor,
More informationAn Empirical Study of Lazy Multilabel Classification Algorithms
An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
More informationWrapper Feature Selection using Discrete Cuckoo Optimization Algorithm Abstract S.J. Mousavirad and H. Ebrahimpour-Komleh* 1 Department of Computer and Electrical Engineering, University of Kashan, Kashan,
More informationFuzzy Ant Clustering by Centroid Positioning
Fuzzy Ant Clustering by Centroid Positioning Parag M. Kanade and Lawrence O. Hall Computer Science & Engineering Dept University of South Florida, Tampa FL 33620 @csee.usf.edu Abstract We
More informationDetecting Clusters and Outliers for Multidimensional
Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationMore Efficient Classification of Web Content Using Graph Sampling
More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information
More informationAn Information-Theoretic Approach to the Prepruning of Classification Rules
An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from
More informationEnhancing K-means Clustering Algorithm with Improved Initial Center
Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of
More informationA Novel Template Reduction Approach for the K-Nearest Neighbor Method
Manuscript ID TNN-2008-B-0876 1 A Novel Template Reduction Approach for the K-Nearest Neighbor Method Hatem A. Fayed, Amir F. Atiya Abstract The K Nearest Neighbor (KNN) rule is one of the most widely
More informationGeometric Decision Rules for Instance-based Learning Problems
Geometric Decision Rules for Instance-based Learning Problems 1 Binay Bhattacharya 1 Kaustav Mukherjee 2 Godfried Toussaint 1 School of Computing Science, Simon Fraser University Burnaby, B.C., Canada
More informationFEATURE EXTRACTION TECHNIQUE FOR HANDWRITTEN INDIAN NUMBERS CLASSIFICATION
FEATURE EXTRACTION TECHNIQUE FOR HANDWRITTEN INDIAN NUMBERS CLASSIFICATION 1 SALAMEH A. MJLAE, 2 SALIM A. ALKHAWALDEH, 3 SALAH M. AL-SALEH 1, 3 Department of Computer Science, Zarqa University Collage,
More informationFast Efficient Clustering Algorithm for Balanced Data
Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut
More informationNormalization based K means Clustering Algorithm
Normalization based K means Clustering Algorithm Deepali Virmani 1,Shweta Taneja 2,Geetika Malhotra 3 1 Department of Computer Science,Bhagwan Parshuram Institute of Technology,New Delhi Email:deepalivirmani@gmail.com
More informationKeywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering
More informationA Modular Reduction Method for k-nn Algorithm with Self-recombination Learning
A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd.,
More informationFlexible-Hybrid Sequential Floating Search in Statistical Feature Selection
Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More information1-Nearest Neighbor Boundary
Linear Models Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We would like here to draw a linear separator, and get so a
More informationBagging and Boosting Algorithms for Support Vector Machine Classifiers
Bagging and Boosting Algorithms for Support Vector Machine Classifiers Noritaka SHIGEI and Hiromi MIYAJIMA Dept. of Electrical and Electronics Engineering, Kagoshima University 1-21-40, Korimoto, Kagoshima
More informationFeature-weighted k-nearest Neighbor Classifier
Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka
More informationAN IMPROVED DENSITY BASED k-means ALGORITHM
AN IMPROVED DENSITY BASED k-means ALGORITHM Kabiru Dalhatu 1 and Alex Tze Hiang Sim 2 1 Department of Computer Science, Faculty of Computing and Mathematical Science, Kano University of Science and Technology
More informationEstimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees
Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,
More informationNearest Cluster Classifier
Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, Behrouz Minaei Nourabad Mamasani Branch Islamic Azad University Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,
More informationUsing Decision Trees and Soft Labeling to Filter Mislabeled Data. Abstract
Using Decision Trees and Soft Labeling to Filter Mislabeled Data Xinchuan Zeng and Tony Martinez Department of Computer Science Brigham Young University, Provo, UT 84602 E-Mail: zengx@axon.cs.byu.edu,
More informationSmart Prototype Selection for Machine Learning based on Ignorance Zones Analysis
Anton Nikulin Smart Prototype Selection for Machine Learning based on Ignorance Zones Analysis Master s Thesis in Information Technology March 25, 2018 University of Jyväskylä Faculty of Information Technology
More informationPROTOTYPE CLASSIFIER DESIGN WITH PRUNING
International Journal on Artificial Intelligence Tools c World Scientific Publishing Company PROTOTYPE CLASSIFIER DESIGN WITH PRUNING JIANG LI, MICHAEL T. MANRY, CHANGHUA YU Electrical Engineering Department,
More informationA Parallel Community Detection Algorithm for Big Social Networks
A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic
More informationClustering Billions of Images with Large Scale Nearest Neighbor Search
Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton
More informationDistance-based Outlier Detection: Consolidation and Renewed Bearing
Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction
More informationComparing Univariate and Multivariate Decision Trees *
Comparing Univariate and Multivariate Decision Trees * Olcay Taner Yıldız, Ethem Alpaydın Department of Computer Engineering Boğaziçi University, 80815 İstanbul Turkey yildizol@cmpe.boun.edu.tr, alpaydin@boun.edu.tr
More informationUsing the Geometrical Distribution of Prototypes for Training Set Condensing
Using the Geometrical Distribution of Prototypes for Training Set Condensing María Teresa Lozano, José Salvador Sánchez, and Filiberto Pla Dept. Lenguajes y Sistemas Informáticos, Universitat Jaume I Campus
More informationPROTOTYPE CLASSIFIER DESIGN WITH PRUNING
International Journal on Artificial Intelligence Tools c World Scientific Publishing Company PROTOTYPE CLASSIFIER DESIGN WITH PRUNING JIANG LI, MICHAEL T. MANRY, CHANGHUA YU Electrical Engineering Department,
More informationAn Efficient Clustering for Crime Analysis
An Efficient Clustering for Crime Analysis Malarvizhi S 1, Siddique Ibrahim 2 1 UG Scholar, Department of Computer Science and Engineering, Kumaraguru College Of Technology, Coimbatore, Tamilnadu, India
More informationCluster-based instance selection for machine classification
Knowl Inf Syst (2012) 30:113 133 DOI 10.1007/s10115-010-0375-z REGULAR PAPER Cluster-based instance selection for machine classification Ireneusz Czarnowski Received: 24 November 2009 / Revised: 30 June
More informationAn Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data
An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University
More informationCHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH
37 CHAPTER 4 AN IMPROVED INITIALIZATION METHOD FOR FUZZY C-MEANS CLUSTERING USING DENSITY BASED APPROACH 4.1 INTRODUCTION Genes can belong to any genetic network and are also coordinated by many regulatory
More informationK Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat
K Nearest Neighbor Wrap Up K- Means Clustering Slides adapted from Prof. Carpuat K Nearest Neighbor classification Classification is based on Test instance with Training Data K: number of neighbors that
More informationImproving Classifier Performance by Imputing Missing Values using Discretization Method
Improving Classifier Performance by Imputing Missing Values using Discretization Method E. CHANDRA BLESSIE Assistant Professor, Department of Computer Science, D.J.Academy for Managerial Excellence, Coimbatore,
More informationOWA-FRPS: A Prototype Selection method based on Ordered Weighted Average Fuzzy Rough Set Theory
OWA-FRPS: A Prototype Selection method based on Ordered Weighted Average Fuzzy Rough Set Theory Nele Verbiest 1, Chris Cornelis 1,2, and Francisco Herrera 2 1 Department of Applied Mathematics, Computer
More informationAccelerating Unique Strategy for Centroid Priming in K-Means Clustering
IJIRST International Journal for Innovative Research in Science & Technology Volume 3 Issue 07 December 2016 ISSN (online): 2349-6010 Accelerating Unique Strategy for Centroid Priming in K-Means Clustering
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining
More informationK-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection
K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer
More informationPrototype Selection for Handwritten Connected Digits Classification
2009 0th International Conference on Document Analysis and Recognition Prototype Selection for Handwritten Connected Digits Classification Cristiano de Santana Pereira and George D. C. Cavalcanti 2 Federal
More informationStudy on Classifiers using Genetic Algorithm and Class based Rules Generation
2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules
More informationA Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values
A Comparison of Global and Local Probabilistic Approximations in Mining Data with Many Missing Attribute Values Patrick G. Clark Department of Electrical Eng. and Computer Sci. University of Kansas Lawrence,
More informationA Classifier with the Function-based Decision Tree
A Classifier with the Function-based Decision Tree Been-Chian Chien and Jung-Yi Lin Institute of Information Engineering I-Shou University, Kaohsiung 84008, Taiwan, R.O.C E-mail: cbc@isu.edu.tw, m893310m@isu.edu.tw
More informationCS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008
CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem
More informationA Review on Cluster Based Approach in Data Mining
A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,
More informationPackage ECoL. January 22, 2018
Type Package Version 0.1.0 Date 2018-01-22 Package ECoL January 22, 2018 Title Compleity Measures for Classification Problems Provides measures to characterize the compleity of classification problems
More informationThe Role of Biomedical Dataset in Classification
The Role of Biomedical Dataset in Classification Ajay Kumar Tanwani and Muddassar Farooq Next Generation Intelligent Networks Research Center (nexgin RC) National University of Computer & Emerging Sciences
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationA FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM
A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM Akshay S. Agrawal 1, Prof. Sachin Bojewar 2 1 P.G. Scholar, Department of Computer Engg., ARMIET, Sapgaon, (India) 2 Associate Professor, VIT,
More informationFeature Selection Based on Relative Attribute Dependency: An Experimental Study
Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han, Ricardo Sanchez, Xiaohua Hu, T.Y. Lin Department of Computer Science, California State University Dominguez
More informationCHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION
CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)
More informationData Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining
More informationToward Part-based Document Image Decoding
2012 10th IAPR International Workshop on Document Analysis Systems Toward Part-based Document Image Decoding Wang Song, Seiichi Uchida Kyushu University, Fukuoka, Japan wangsong@human.ait.kyushu-u.ac.jp,
More informationSVM-based Filter Using Evidence Theory and Neural Network for Image Denosing
Journal of Software Engineering and Applications 013 6 106-110 doi:10.436/sea.013.63b03 Published Online March 013 (http://www.scirp.org/ournal/sea) SVM-based Filter Using Evidence Theory and Neural Network
More informationOutlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering
World Journal of Computer Application and Technology 5(2): 24-29, 2017 DOI: 10.13189/wjcat.2017.050202 http://www.hrpub.org Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering
More informationLorentzian Distance Classifier for Multiple Features
Yerzhan Kerimbekov 1 and Hasan Şakir Bilge 2 1 Department of Computer Engineering, Ahmet Yesevi University, Ankara, Turkey 2 Department of Electrical-Electronics Engineering, Gazi University, Ankara, Turkey
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationCIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task
CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task Xiaolong Wang, Xiangying Jiang, Abhishek Kolagunda, Hagit Shatkay and Chandra Kambhamettu Department of Computer and Information
More informationCS229 Lecture notes. Raphael John Lamarre Townshend
CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based
More informationMODULE 7 Nearest Neighbour Classifier and its Variants LESSON 12
MODULE 7 Nearest Neighbour Classifier and its Variants LESSON 2 Soft Nearest Neighbour Classifiers Keywords: Fuzzy, Neighbours in a Sphere, Classification Time Fuzzy knn Algorithm In Fuzzy knn Algorithm,
More informationInternational Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at
Performance Evaluation of Ensemble Method Based Outlier Detection Algorithm Priya. M 1, M. Karthikeyan 2 Department of Computer and Information Science, Annamalai University, Annamalai Nagar, Tamil Nadu,
More informationImage Mining: frameworks and techniques
Image Mining: frameworks and techniques Madhumathi.k 1, Dr.Antony Selvadoss Thanamani 2 M.Phil, Department of computer science, NGM College, Pollachi, Coimbatore, India 1 HOD Department of Computer Science,
More informationTHE discrete multi-valued neuron was presented by N.
Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Multi-Valued Neuron with New Learning Schemes Shin-Fu Wu and Shie-Jue Lee Department of Electrical
More informationVersion Space Support Vector Machines: An Extended Paper
Version Space Support Vector Machines: An Extended Paper E.N. Smirnov, I.G. Sprinkhuizen-Kuyper, G.I. Nalbantov 2, and S. Vanderlooy Abstract. We argue to use version spaces as an approach to reliable
More informationMaintaining Footprint-Based Retrieval for Case Deletion
Maintaining Footprint-Based Retrieval for Case Deletion Ning Lu, Jie Lu, Guangquan Zhang Decision Systems & e-service Intelligence (DeSI) Lab Centre for Quantum Computation & Intelligent Systems (QCIS)
More informationMass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality
Mass Classification Method in Mammogram Using Fuzzy K-Nearest Neighbour Equality Abstract: Mass classification of objects is an important area of research and application in a variety of fields. In this
More informationDataset Editing Techniques: A Comparative Study
Dataset Editing Techniques: A Comparative Study Nidal Zeidat, Sujing Wang, and Christoph F. Eick Department of Computer Science, University of Houston Houston, Texas, USA {nzeidat, sujingwa, ceick}@cs.uh.edu
More informationData Mining and Data Warehousing Classification-Lazy Learners
Motivation Data Mining and Data Warehousing Classification-Lazy Learners Lazy Learners are the most intuitive type of learners and are used in many practical scenarios. The reason of their popularity is
More informationData Preprocessing. Supervised Learning
Supervised Learning Regression Given the value of an input X, the output Y belongs to the set of real values R. The goal is to predict output accurately for a new input. The predictions or outputs y are
More informationKeywords: clustering algorithms, unsupervised learning, cluster validity
Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based
More informationWe use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.
The Bayes Classifier We have been starting to look at the supervised classification problem: we are given data (x i, y i ) for i = 1,..., n, where x i R d, and y i {1,..., K}. In this section, we suppose
More informationMultiple Classifier Fusion using k-nearest Localized Templates
Multiple Classifier Fusion using k-nearest Localized Templates Jun-Ki Min and Sung-Bae Cho Department of Computer Science, Yonsei University Biometrics Engineering Research Center 134 Shinchon-dong, Sudaemoon-ku,
More informationRECORD-TO-RECORD TRAVEL ALGORITHM FOR ATTRIBUTE REDUCTION IN ROUGH SET THEORY
RECORD-TO-RECORD TRAVEL ALGORITHM FOR ATTRIBUTE REDUCTION IN ROUGH SET THEORY MAJDI MAFARJA 1,2, SALWANI ABDULLAH 1 1 Data Mining and Optimization Research Group (DMO), Center for Artificial Intelligence
More informationLecture-17: Clustering with K-Means (Contd: DT + Random Forest)
Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the
More informationPreprocessing of Stream Data using Attribute Selection based on Survival of the Fittest
Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest Bhakti V. Gavali 1, Prof. Vivekanand Reddy 2 1 Department of Computer Science and Engineering, Visvesvaraya Technological
More informationNovel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification
Novel Initialisation and Updating Mechanisms in PSO for Feature Selection in Classification Bing Xue, Mengjie Zhang, and Will N. Browne School of Engineering and Computer Science Victoria University of
More informationAn Ensemble of Classifiers using Dynamic Method on Ambiguous Data
An Ensemble of Classifiers using Dynamic Method on Ambiguous Data Dnyaneshwar Kudande D.Y. Patil College of Engineering, Pune, Maharashtra, India Abstract- The aim of proposed work is to analyze the Instance
More informationA Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data
Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data
More informationFast k Most Similar Neighbor Classifier for Mixed Data Based on a Tree Structure
Fast k Most Similar Neighbor Classifier for Mixed Data Based on a Tree Structure Selene Hernández-Rodríguez, J. Fco. Martínez-Trinidad, and J. Ariel Carrasco-Ochoa Computer Science Department National
More informationNearest Cluster Classifier
Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, and Behrouz Minaei Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,
More informationCse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University
Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before
More informationUncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique
Research Paper Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique C. Sudarsana Reddy 1 S. Aquter Babu 2 Dr. V. Vasu 3 Department
More informationAn Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm
Proceedings of the National Conference on Recent Trends in Mathematical Computing NCRTMC 13 427 An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm A.Veeraswamy
More informationBest First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis
Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationLEARNING WEIGHTS OF FUZZY RULES BY USING GRAVITATIONAL SEARCH ALGORITHM
International Journal of Innovative Computing, Information and Control ICIC International c 2013 ISSN 1349-4198 Volume 9, Number 4, April 2013 pp. 1593 1601 LEARNING WEIGHTS OF FUZZY RULES BY USING GRAVITATIONAL
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationAn Efficient Clustering Method for k-anonymization
An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management
More information