An Improved k-nearest Neighbor Classification Algorithm Using Shared Nearest Neighbor Similarity

Similar documents
MODULE 7 Nearest Neighbour Classifier and its Variants LESSON 12

THE RESEARCH ON LPA ALGORITHM AND ITS IMPROVEMENT BASED ON PARTIAL INFORMATION

The Improved SOM-Based Dimensionality Reducton Method for KNN Classifier Using Weighted Euclidean Metric

EVOLVING HYPERPARAMETERS OF SUPPORT VECTOR MACHINES BASED ON MULTI-SCALE RBF KERNELS

Adaptive Metric Nearest Neighbor Classification

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis

Using a genetic algorithm for editing k-nearest neighbor classifiers

Motion Compensated Frame Interpolation Using Motion Vector Refinement with Low Complexity

Nearest Cluster Classifier

Topic 1 Classification Alternatives

Nearest Neighbor Classifiers

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

WEIGHTED K NEAREST NEIGHBOR CLASSIFICATION ON FEATURE PROJECTIONS 1

9 Classification: KNN and SVM

Global Metric Learning by Gradient Descent

Individualized Error Estimation for Classification and Regression Models

Minimum Spanning Tree in Fuzzy Weighted Rough Graph


SYMBOLIC FEATURES IN NEURAL NETWORKS

CS6716 Pattern Recognition

Nearest Cluster Classifier

A Classifier with the Function-based Decision Tree

An Enhanced Density Clustering Algorithm for Datasets with Complex Structures

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Journal of Theoretical and Applied Information Technology. KNNBA: K-NEAREST-NEIGHBOR-BASED-ASSOCIATION ALGORITHM

9881 Project Deliverable #1

Distribution-free Predictive Approaches

Supervised Learning Classification Algorithms Comparison

ECG782: Multidimensional Digital Signal Processing

International Journal of Multidisciplinary Research and Modern Education (IJMRME) Impact Factor: 6.725, ISSN (Online):

Univariate Margin Tree

Unigrams Bigrams Trigrams

Metric Learning Applied for Automatic Large Image Classification

A Flexible Scheme of Self Recovery for Digital Image Protection

Lorentzian Distance Classifier for Multiple Features

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Feature-weighted k-nearest Neighbor Classifier

Color Image Segmentation

Binary Histogram in Image Classification for Retrieval Purposes

FEATURE EXTRACTION TECHNIQUES USING SUPPORT VECTOR MACHINES IN DISEASE PREDICTION

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Color-Based Classification of Natural Rock Images Using Classifier Combinations

Support Vector Machines

Classifier Inspired Scaling for Training Set Selection

Learning Ontology Alignments using Recursive Neural Networks

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Combining one-class classifiers to classify missing data

Efficient Pruning Method for Ensemble Self-Generating Neural Networks

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

1 School of Biological Science and Medical Engineering, Beihang University, Beijing, China

CSC 411: Lecture 05: Nearest Neighbors

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

Improved DAG SVM: A New Method for Multi-Class SVM Classification

CS570: Introduction to Data Mining

Performance Analysis of Data Mining Classification Techniques

A Maximal Margin Classification Algorithm Based on Data Field

Using Network Analysis to Improve Nearest Neighbor Classification of Non-Network Data

Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin and Man-Pyo Hong

Clustering Algorithms for Data Stream

Multiple Classifier Fusion using k-nearest Localized Templates

Texture Segmentation by Windowed Projection

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Introduction to Artificial Intelligence

VECTOR SPACE CLASSIFICATION

Semi-supervised Data Representation via Affinity Graph Learning

ICA as a preprocessing technique for classification

Image retrieval based on bag of images

B-Spline and NURBS Surfaces CS 525 Denbigh Starkey. 1. Background 2 2. B-Spline Surfaces 3 3. NURBS Surfaces 8 4. Surfaces of Rotation 9

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Using Center Representation and Variance Effect on K-NN Classification

Font Type Extraction and Character Prototyping Using Gabor Filters

Using K-NN SVMs for Performance Improvement and Comparison to K-Highest Lagrange Multipliers Selection

Data Clustering With Leaders and Subleaders Algorithm

Nearest neighbor classifiers

Predictive Indexing for Fast Search

Stepwise Nearest Neighbor Discriminant Analysis

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

Semi-Supervised Clustering with Partial Background Information

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

Rank Measures for Ordering

IMAGE DEBLOCKING IN WAVELET DOMAIN BASED ON LOCAL LAPLACE PRIOR

CS570: Introduction to Data Mining

Applying Supervised Learning

Locally Smooth Metric Learning with Application to Image Retrieval

Parallel K-Means Clustering with Triangle Inequality

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Double Sort Algorithm Resulting in Reference Set of the Desired Size

DBSCAN. Presented by: Garrett Poppe

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Applied Fuzzy C-means Clustering to Operation Evaluation for Gastric Cancer Patients

CS570: Introduction to Data Mining

Data Mining Final Project Francisco R. Ortega Professor: Dr. Tao Li

Iterative Removing Salt and Pepper Noise based on Neighbourhood Information

Nearest Neighbor Predictors

Transcription:

ging, 26(3), p.p.405-421. 10. D. Adalsteinsson, J. A. Sethian (1999) The Fast Construction of Extension Velocities in Leel Set Methods. Journal of Computational Physics, 148, p.p.2-22. Information technologies 11. Guopu Zhu, Qingshuang Zeng, Changhong Wang (2007) Boundary-based Image Segmentation Using Binary Leel Set Method. Opt. Eng., 46(5), p.p.33-39. An Improed k-nearest Neighbor Classification Algorithm Using Shared Nearest Neighbor Similarity Wei Zheng College of Science, Hebei North Uniersity, Zhangjiakou 075000, Hebei, China HaiDong Wang Library of Hebei Institute of Architecture and Ciil Engineering, Zhangjiakou 075000, Hebei, China Lin Ma College of Science, Hebei North Uniersity, Zhangjiakou 075000, Hebei, China RuoYi Wang School of Information Science and Engineering, Hebei North Uniersity, Zhangjiakou 075000, Hebei, China 133

Abstract k-nearest Neighbor (KNN) is one of the most popular s for pattern recognition. Many researchers hae found that the KNN classifier may decrease the precision of classification because of the uneen density of t raining samples.in iew of the defect, an improed k-nearest neighbor is presented using shared nearest neighbor similarity which can compute similarity between test samples with nearest neighbor samples. The experiment shows that this method can enhance classification precision compare to the traditional KNN. Key words: KNN Algorithms, Sample Classification, Nearest Neighbor 1. Introduction Pattern recognition is about assigning labels to objects which are described by a set of measurements called also attributes or features. There are two major types of pattern recognition problems: unsuperised and superised. In the superised category which is also called superised learning or classification. Classification of objects is an important area of research and of practical application in ariety of fields, including artificial intelligence, statistics and ision analysis. Considered as a pattern recognition problem, there hae been numerous techniques inestigated for classification[1]. K-Nearest Neighbor (KNN) classification is one of the most fundamental and simple classification methods. When there is little or no prior knowledge about the distribution of the data, the KNN method should be one of the first choices for classification, but it has some disadantages such as :a)computation cost is quite high because it needs to compute the distance of each query instance to all training samples; b) Low accuracy rate in multidimensional datasets; c) Need to determine the alue of parameter K, which is the number of nearest neighbors; d) Distance based learning is not clear which type of distance to use[2].researchers hae attempted to propose new approaches to improing the performance of KNN method. e.g, Discriminant Adaptie NN [3] (DANN), A daptie M etric N N [4] (A D A M EN N ), Weight Adjusted KNN [5] (WAKNN), Large Margin NN [6] (LMNN) and etc. Despite the success and rationale of these methods, most hae seeral constraints in practice. In this paper, we propose a noel, effectie yet easy-to-implement extensions of KNN method named IKNN. Innoation point lies in using the Shared nearest neighbor method, through calculating the Shared nearest neighbor alue between test sample with the nearest neighbor samples.nearest neighbor samples weight alue is redistributed based on nearest neighbor alue. Through the experiment discoered IKNN method hae higher classification accuracy than the KNN method with UCI data set. The rest of this paper is organized as follows. Section 2 describes the KNN method commonly used. Section 3 studies the nearest neighbor similarity and gies an improed KNN method named IKMM. Section 4 discusses the classifier using in experiment to compare IKNN with other KNN methods, and presents the experiment s results and analysis. In the last, we gie the conclusion and future work. 2. KNN description K-Nearest Neighbor classifier is an instance based learning method. It computes the similarity between the test instance with the training instances and considering the k top-ranking nearest instances, and finds out the category that is most similar. There are two methods for finding the most similar instance: majority oting and similarity score summing. (a) 1 - the nearest neighbor (b) 2 - the nearest neighbor Figure 1. Two instances of nearest neighbor In majority oting, a category gets one ote for each instance of that category in the set of k Topranking nearest neighbors. Then the most similar category is the one who gets the highest amount of otes. In similarity score summing, each category gets a score equal to the sum of the instances of that category in the k top-ranking neighbors. The most similar category is the one with the highest similarity score sum. The similarity alue between two instances is the distance between them based on a distance metric. Generally Euclidean distance metric is used. K-Nearest Neighbor classifier is as follows: For each test sample z=(x, y ), and will calculate the distance between it and training sample (x, y) D, which D is a training sample set, 134

to determine its nearest neighbor lists. Algorithm 1. K-Nearest Neighbor classifier 1: k is the number of nearest neighbor and D is the training sample set. 2: for each test sample z=(x,y ) do 3: Calculate the distance d(x,y ) between z and each train sample (x, y) 4: Select a training sample subset Dz whose has k samples, is the nearest to the z y ' arg max I(=y ) i 5: (x i,y i) Dz = 6:end for Once you get the nearest neighbor list, most of test sample will be based on nearest neighbor classification[7]: Majority oting: ' arg max I(=y ) i y =. (x i,y i) Dz V is the number of categories, and the category of the y i is a nearest neighbor number. I (.) is the indicator function, if the parameter is true, it returns 1, otherwise it returns 0. In the majority oting method, because each neighbor s influence on the classification is equal, is sensitie to the choice of K, as shown in figure 1, Symbols represents the test sample and Symbols -,+ represent the test samples.a way to reduce the influence of K is that distribute weighted to its function according to each of the nearest neighbor distance. Results from the training sample for classification of effects of z than those near the z training sample. Using distance weighted oting scheme, class label can be determined by the following formula. Distance weighted oting that W i is weight. The named WKNN. Distance weighted oting: y ' = arg max w I(=y ) i i (x i,y i) Dz 3. Improed k-nearest Neighbor Classification Algorithm Using Shared Nearest Neighbor Similarity 3.1. the problems existing in the KNN When KNN is used into data mining, there are seeral problems.first, because of based on local information to forecast, it ery is sensitie to noise data when the K alue is ery small. Second, In the practical application of classification s. it maybe hae multi-peak distribution problems between each category because of the uneen distribution of data. KNN classification depends on the distribution of training samples, the high density area samples will affect the judgment of category. Figure 2 and figure 3 in graphical form shows the aboe problems. Figure 2. Noise Data exist in category C i Figure 3. High density data of category C j Symbol is test sample and symbol * is Noise Data exist in category the C i.symbol + present High density data of category C j.figure 2 shows that the result of KNN classification is easily affected by noise data exist in category C i. Figure 3shows that the result of KNN classification is easily affected by high density data of category C j. In order to oercome the KNN, we proposed an improed based on shared nearest neighbor similarity. 3.2. Improed k-nearest Neighbor Classification According to paper[8],leent Ertoz proposed shared nearest neighbor method named SNN to soles the problem of high-dimensional data. The SNN core idea is that the similarity between two objects is defined based on Shared nearest neighbor.it is based on the following principle, If two points are similar to some of the same point, they are also similar, een directly similarity measure can t point out. The computation of SNN similarity between two point can be described as follows and Graphic is shown in figure 4. Step 1: For each point, find out N nearest points for it and create its nearest neighbor list. Step 2:If both two points appear in each others nearest neighbor list, there is a link between them. Calculate link strength of eery pair of linked points and build a point link graph. Step 3: The link strength of eery pair of linked points is the number of shared neighbor. Figure 4. The calculation of SNN similarity between two points Two black spots there are eight nearest neighbor interaction. These four of the nearest neighbor (orange) is Shared. So the SNN similarity between 135

the two points is 4. Because SNN similarity is good on determining similarity between point and point, according to the SNN similarity,we proposed an improed KNN classification.the improed named IKNN as follows: Algorithm 2. K-Nearest Neighbor classifier Using Shared Nearest Neighbor Similarity IKNN 1: k is the number of nearest neighbor and D is the training sample set. 2:for each test sample z=(x,y ) do 3:Calculate the distance d(x,y ) between z and each train sample (x, y) 4: Select a training sample subset Dz whose has k samples, is the nearest to the z 5:for each train sample K i Dz do 6: Calculate the nearest neighbor similarity alue Wsnn between Ki with Z [10]. This measure as follows: 4.4. Results and Analysis When Kaluefrom the 5 to 25,We carry out classify experiment. The experimental results hae shown in Figure 5,Figure 6, Figure 7. Figure 5. Iris data set 7:end for 4.Experiment and Analysis 4.1. Experimental purposes We use KNN, WKNN and IKNN to carry out classification experiment in order to erify the classifying effect. This section compares the IKNN method with original KNN and t discusses the experimental results. 4.2. Data Collections The classification which is KNN, WKNN and IKNN is ealuated on three standard data sets, namely Wine, Iris and Breast Cancer Wisconsin. None of the databases had missing alues, as well as they use continuous attributes. These data sets which are obtained from UCI repository [9] are described as Table 3. In these three data sets, the instances are diided into training and test sets by randomly choosing two third and one third of instances per each of them, respectiely.table 3 shows the specific quantity of samples in each category we choose. Table 1. The data sets used in the experiments Data ectors Attributes Classes Iris(IR) 150 5 3 Breast Cancer Wisconsin (BCW) 474 9 2 Wine(WI) 144 13 3 4.3. Performance measure To ealuate the performance of a classifier, we use accuracy measure put forward by Yuba yauz (1998) Figure 6. Wine data set Figure 7. BCW data set From figure 5, the cure of IKNN classification is clearer,which classification effect is significantly better than the other two.through figure 5,we can see that the accuracy for IKNN is greater than KNN and WKNN whose classification cure is smooth. The oerall classification affectionfor WKNN was slightly better than KNN, and the KNN classification error rate is the largest.when classification experiments is carried out,using WINE data set, IKNN classification effect is not obious, as figure 5 shown,with the change of K alue, clas s ifi cation cure fl uctuations.when k is equal to 16, IKNN classification 136

get the best classification results that only nine samples was wrongly distinguish in classification experiment. Figure 7 hae shown that a experiment was carried out using the BCW data set. Look from the three classification cure, as the K alue is different, KNN method classification effect is not ery stable.when K is equal to 25, there are 29 test sample is misjudgment. WKNN method has small fluctuations, and the classification result for IKNN method is more stable,which classification result is better than the other two. The IKNN Classification has ery good classification effect when it be used to category in the three common data set. Using IKNN method can improe the accuracy of classification, and has a good job stability in the feature extraction. 5. Conclusion This paper has proposed an improed classification method based on KNN, named IKNN. IKNN use Similarity judgment method based on SNN,and aoid the shortcoming that KNN classification depends on the distribution of training samples, and the high density area samples will affect the judgment of category. The experiment has shows that IKNN is an effectie method to improe the performance of categorization. In the future, we will continue work on the nearest neighbor clustering based on categories. Acknowledgments The work was supported the National Natural Science Foundation of China (11404088), and Natural Science Foundation of Hebei Proince (F2015405011). References 1. L.I. Kunchea (2005) Combining Pattern Classifiers, Methods and Algorithms. Wiley Interscience: New York. 2. R. O. Duda, P. E. Hart and D. G. Stork (2001) Pattern Classification. John Wiley & Sons: New York. 3. Hastie, T., Tibshirani, R. (1996) Discriminant adaptie nearest neighbor classification. IEEE Trans. Pattern Anal, 18(6), p.p.607-616. 4. Domeniconi, C, Peng, J, Gunopulos, D.(2002) Locally adaptie metric nearest-neighbor classification. IEEE Trans. Pattern, 24(9), p.p.1281-1285. 5. Han, E-H.S, Karypis, G, Kumar, V.(2001) Text categorization using weight adjusted k nearest neighbor classification. Proc. Conf. Knowledge Discoery and Data Mining, p.p.53-65. 6. Weinberger, K.Q, Blitzer, J, Saul, L.K. (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 24(10), p.p.207-244. 7. Pang-Ning Tan, Michael Steinbath, Vipin Kumar (2005). Introduction to Data mining. Addison Wesley: New Jersey. 8. L. Ertoz, M. Steinbach, V. Kumar (2003) Finding topics in collections of documents: a shared nearest neighbor approach. Clustering and Information Retrieal, 11, p.p.83-103. 9. UCI Repository of machine learning.databases. http://www.ics.uci.edu/~mlearn/ MLRepository.html. 10. Yuba, Yauz (1998)Application of k-nearest neighbor on feature projections classi to text categorization. Proc. Conf. on on Computer and Information Sciences, p.p.234-246. 137